As part of our Amazon EC2 design, we decided to use S3 as our persistence layer; durability and easy storage growth for our data is very important to us. We knew that S3 would be to slow on its own for real time access. Luckily for us, the use profile of our application fits perfectly with a write-through cache strategy. Our users generally need a single dataset accessible for a week or two before they finish their analysis and move on, most likely never to access it again. Django’s file based cache seemed ideal. We could point it at the generous (and otherwise unused) instance storage on our EBS backed AMIs and cache several hundred gigabytes of data. If we lose a sever there would be performance penalty as the working dataset was built back up through a read-through strategy, but nothing unacceptable for a short period of time.
Where there is smoke…
We launched. The app was up and running and everything seemed fine. After two weeks of operation, we had accumulated a few performance complaints that we initially chalked up to placebo effect on our customers. After all, we had announced the upgrade publicly, so any perceived slow down on would be easily be associated with the change. We see this all the time.
We do take performance seriously and we want our customers to have a good experience, so we started a cursory investigation anyway. Our office Internet connection is anemic at the best of times and borderline unusable during the day, so I took a field trip to Starbucks where I sat down with a coffee and hit the app as a “real person” might. It was an abysmal experience. So bad in fact that I started to panic. Were we really forcing our users to endure this crap? I raced back to the office to get ready for some serious damage control. When I arrived, I found (to my surprise) no new complaints. In fact, the app was performing fine from the office. Turns out that the local Starbucks has a reputation for terrible wifi.
No experiment can be concluded with one data point, so a few of us tried it again that night from our homes. The app absolutely screamed! Armed with the knowledge that the app was fast for us, under low load, I began to suspect that our cache implementation was buggy and we were actually missing more frequently than we realized. I sat down with one of my devs and started getting ready to add some instrumentation into our caching code to make sure we were actually utilizing our massive file cache. Speaking of which, how massive is it?
$ du -s -h -H . 33M /mnt/file_cache/
33 MB!? After two weeks of operation!? We were clearly not storing as much in the cache as we intended. After a quick round of debugging, we made a fun discovery in the Django docs: FiledBasedCache only stores 300 objects by default, regardless of what expiration time is set to. We did load test our EC2 environments, but not in a way that exercised cache turn over. We sheepishly put the additional config values (5,000,000 entries, stored for up to 1 year) and the cache size immediately shot up to a couple gigabytes. We were aiming to have around 200GB of working cache at any given time. We went home that day confident that we had made the app even faster and better for our customers.
Premature confidence is the root of all evil
Around 8AM the next day, one of the web servers threw a high CPU alert and promptly dropped out of the load balancer pool a few minutes later. Uh oh.
Before I could even bring up top on the afflicted box, it cleared its alarms and jumped back into the pool. A few minutes after that, the other web server threw a high CPU alert and dropped. Clearly something was amiss. The metrics showed that we were using way more CPU than typical across our web servers and that the spike in CPU usage was directly correlated to serving requests. Top showed that the culprit was apache and that our servers were simply too loaded to respond to new requests.
The last day was post-cache fix. Something isn’t right…
The immediate suspicion was that our file caching fix was the culprit. It was, after all, the only change we had made to the environment in over a week. But it was hard to believe that cutting out constant S3 requests would actually be slower than servicing the requests from our cache. It was at this point we started to question the caching implementation itself. Just exactly how was Django handling these requests? It only took a few seconds to pull up the code (filebased.py in the core/cache/backends dir) and see the answer to our problem glaring us in the face.
Interesting, _cull() is called in each call to set():
def set(self, key, value, timeout=None, version=None): key = self.make_key(key, version=version) self.validate_key(key) fname = self._key_to_file(key) dirname = os.path.dirname(fname) if timeout is None: timeout = self.default_timeout self._cull() # Interesting, what is going on here? try: if not os.path.exists(dirname): os.makedirs(dirname) f = open(fname, 'wb') try: now = time.time() pickle.dump(now + timeout, f, pickle.HIGHEST_PROTOCOL) pickle.dump(value, f, pickle.HIGHEST_PROTOCOL) finally: f.close() except (IOError, OSError): pass
Wonder how cull works?
def _cull(self): if int(self._num_entries) < self._max_entries: # This should be true for us... return try: filelist = sorted(os.listdir(self._dir)) except (IOError, OSError): return if self._cull_frequency == 0: doomed = filelist else: doomed = [os.path.join(self._dir, k) for (i, k) in enumerate(filelist) if i % self._cull_frequency == 0] for topdir in doomed: try: for root, _, files in os.walk(topdir): for f in files: self._delete(os.path.join(root, f)) except (IOError, OSError): pass
Well, we had max_entries set very high. Surely we were returning before trying to sort the file list…
def _get_num_entries(self): count = 0 for _,_,files in os.walk(self._dir): # Argh! Burn it with fire! count += len(files) return count _num_entries = property(_get_num_entries)
Ouch. A nasty O(n) operation on every set() call by way of _def_get_num_entries(). Whoops. Seems like our use case might not have been what the maintainers had in mind when they wrote these classes. The servers were collapsing under their own weight at a measly 12,000 cached objects. We certainly weren’t going to be able to handle 5 million. We quickly commented out the offending line of code (making set an O(1) operation) and watched our load plummet.
Ahhh. Much better. Almost half the load before this whole ordeal started!
The more permanent fix went in last night. We simply subclassed filebased.py to skip the cache cull check altogether when max_entries is -1. We then created a cron job to call our manage.py force-cull command. It will run once a month at midnight.
In the end we had two positive outcomes. First, we netted a big performance increase for our customers. Second, we saw that our application infrastructure was resilient enough to deal with the non-responsive web servers without losing availability for our logged on users.
We learned a few lessons over the last couple days:
- Caching is hard.
- Don’t assume that open source maintainers are omniscient. They probably didn’t have your exact use case in mind. Read the code for critical infrastructure before you rely on it.
- Make sure your code is as pre-instrumented as possible. Having to try and get additional debug level logging in place was a huge waste of time and introduced additional risks to our app.
- Don’t be too quick to discount feedback from your customers/users, no matter how confident you are!