In November, my team and I began an epic journey. We left our long-time homes in the bountiful lands of physical hosting in search of the mysical realm of “the clouds”. Cost of living in physical land was simply out of control.
Last weekend, we arrived.
We left Rackspace for Amazon Web Services. Sorry rackers! Your customer service was great, and your core cloud services seem nice, but Amazon’s pricing structure is way better and they just keep rolling out amazing features.
The good news is that we made it. We launched all of our apps into production on AWS last weekend. With thousands of little challenges (and a few huge ones) behind us, I wanted to get my thoughts into a post before it all blurs into the next big project.
One of the biggest challenges we faced was finding a solid architecture for hosting our app on Amazon. Our needs are fairly straightforward: high availability, high performance (page load), no need for scaling, easily managed by a small dev team with no ops folks. Oh, yeah, and be fully HIPAA HITECH compliant.
We had all sorts of questions: How do you handle load balancing? Multi-zone failover? Encryption? Backups? Monitoring? Replication? Recovery? Security? Having little cloud experience ourselves, we turned to the internet for help. Most of the information we found fell into two categories: Big Time Serious Guys Doing Big Time Serious Stuff (think Netflix) or Sounds Like Us, But No Freaking Details. Maybe our needs are unique (I doubt it) or maybe we are just a little ahead of the curve with some of the features we are using (like VPCs). In the end, we borrowed from everyone and rolled it all up into a pretty great solution.
This post is about what we did, what worked for us, what didn’t. Little lessons that kicked our ass along the way. If you have similar needs, you can think of this as a high-level blueprint for your entire environment. If there is enough interest, I will post code and detailed implementation details in subsequent articles.
To get us started, our requirements were:
- HIPAA HITECH compliance. For us, this means tons of crypto
- Availability-Zone (AZ) redundancy without downtime for customer facing components
- No data loss from single AZ failure
- < 1 hour of data loss from full region failure
- Improved security over our physical hosting
- Improved performance over our physical hosting
- Save massive amounts of money
- Fully automate code deployments and mostly automate infrastructure build out
Plan for the worst hope for something less than the worst
Everyone is going to have different uptime requirements and corresponding budgets. We thought through the failure scenarios, we came up with some likely ones that we needed to address:
- Application failure. Data corrupting bugs, crashes, poor scale, whatever. These are problems we need solve within the scope of the application itself.
- Isolated hardware failure. This means a cluster dropped offline, killing one of our EBS volumes or running instances.
- Entire availability zone failure. Something went bad with an upgrade and took down the entire zone. Or maybe another replication storm. Either way, we have lost access to everything in that zone.
- Entire region failure. Someone nuked a data center. It’s a smoking crater in the ground. Nothing left.
For our business’s requirements (and our budget), we decided that we should attempt to survive 1, 2, and 3 with no customer impact (zero downtime and zero data loss). In the case of an entire region failure, we accepted that our application would be offline for a period of time while we rebuild the environment in another region. In this scenario, we do not want to lose more than an hour of data, but can tolerate a lag time in availability of that data.
You have to deal with number one (which is essentially software design and quality) no matter how you choose to host. We used the standard approaches for this: good development practices such as code reviews, testing (manual and automated), continuous integration, automated deployments, and backups (we use EBS snapshots for this).
In the second scenario, we can lose potentially lose several different node types: a web server, a database server, the NAT/VPN box, or a load balancer (ELB). Our design ensures availability as long as we have one of each type available. If we lose a web server, the ELB will automatically drop it from the pool. If we lose a database server, our web servers will automatically failover to the backup database server. ELBs are designed to be redundant by default.
The third secenario is similar to the second. As long as we ensure that we have the full app stack on each availability zone, then we can lose an entire zone and let the load balancer redirect traffic to the surving one.
The fourth scenario requires a more dynamic strategy. We do not want to bear the financial and administrative burdens of spanning regions (or cloud providers), so we are willing to tolerate downtime in the (hopefully) unlikely scenario that Amazon loses an entire region. Our strategy to handle this type of failure is something like this:
- Assess the extent of the failure based on available information and estimate how long the app will be unavailable.
- If the outage looks likely to last more than a few minutes, redirect DNS to a failure page hosted elsewhere with a short TTL.
- If the outage looks likely to be for longer than a few hours, begin bringing up a new environment in a non-affected region using our cold failover VMs and/or utilizing EC2 snapshots. In the event of a complete Amazon failure (all regions down everywhere), we would start the build out on another provider.
In a major outage, the time to recover will vary wildly with the availability of various AWS services. If EC2 is down, but S3 is available, our recovery times will be considerably quicker than if we are not able to spool up new VMs, load balancers, etc. You can partially mitigate this with cold VMs and load balancers but there are costs associated with doing so.
It is important to remember that this is what worked for us given our requirements. If you need to survive region failures, you are going to have to go far beyond what we have done here.
Regions and Availability Zones
Before we go any farther, we must discuss geography in Amazon’s cloud. Amazon’s various web services are available across multiple regions. Regions are data centers which are geographically dispersed from each other. Amazon considers traffic between regions to be equivalent to regular internet traffic and you are charged as such.
By contrast, “Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region.” In other words, they are somewhat physically isolated even though they reside in the same data center. I would imagine they have fully separate (and redundant) power systems, networks, etc. Yet, they are close enough together to be connected at very high-speeds.
Did you catch the imagined bit? Amazon provides very few concrete details around their implementation of their services. Perhaps the biggest requirement placed on you for moving onto Amazon’s cloud is that you have to extend quite a bit of trust to them. If you do not trust them, the burden is on you to engineer your way around the issue. Realizing savings in the cloud means trusting one or more 3rd parties. The less your trust them, the less you save.
We need to care about availability zones for two reasons:
AWS provides tooling and support for spanning availability zones, but not regions. It is possible to build an “off the shelf” VPC that spans two or more AZs through the web interface. If you want to span regions, you are on your own and it’s gonna cost you more too.
Amazon’s EC2 SLA specifically says that you must have unavailable instances in more than one AZ for that region to be considered unavailable, thus qualifying you for compensation. Amazon does not consider a region “down” unless more than one AZ is toast. Starting to catch my drift? They’ve told you to run your app across availability zones and not to come crying to them if you are caught out when one goes down.
Amazon is practically telling you that AZs can fail. You should assume they are going to fail. They will fail. They have failed. Go read about a big example of this type of failure right now. I’ll wait.
Scared yet? You should be. A healthy fear of the cloud is important for success. You are about to add a massive layer of abstraction and only the prepared will stay up in the next great AWS outage.
Virtual Private Clouds
The foundation of our design starts with the a rather new AWS feature: virtual private clouds (VPC). A VPC gives you the ability to have your own subnets that use reserved private network IP blocks (192.168.0.0/16, 10.0.0.0/8). These IP ranges are not internet routable, meaning that a remote client cannot even reach hosts on your VPC subnets–a huge security benefit. In addition to private IPs, VPCs offer quite a few other security improvements, such as bi-directional security group rules, network ACLs, and VPN access points.
VPCs will complicate your environment. Each host will no longer have direct internet access and will have to use network address translation (NAT) for any outbound traffic that originates from inside the environment (like software updates). You will also have to handle routing and network level access controls. There are also pricing impacts. Hosts on a VPC will not have direct access to S3 via sideband interfaces and will incur bandwidth costs when accessing S3 that regular EC2 hosts will not. With one of our guiding goals being to maximize security of our patient health information, using a VPC was an easy choice.
For our design, we opted for a class B network with four subnets spanning two AZs: two DMZs and two private. This allowed us to have both a DMZ and a private subnet on each of our AZs. The DMZ is where any public facing hosts live. In our case, that means our elastic load balancers and our NAT/VPN host. The private subnets are where our web servers and database servers are located. The logical separation is a benefit, but the main reason we need the DMZ subnets is routing. Elastic load balancers will inherit their routing rules from the VPC subnets they face. In order to make NAT work correctly, the private subnet has a default route to the NAT host (outbound traffic musted be NAT’ed). For the ELBs and the NAT/VPN host to return responses to incoming traffic, they must route packets to the internet gateway. Thus, the DMZs use the internet gateway (IGW) for their default route and the private subnets use the NAT host.
I have mentioned the NAT/VPN host a few times here. I will explain that now. As discussed above, outbound traffic from inside this environment, such as software updates, git pulls, mail, or DNS resolution, must traverse the network through a NAT device. Amazon makes this easy by providing an AMI pre-configured to handle this duty. This does mean that you will have to eat the cost of another instance running 24/7. Luckily, it can be a small, maybe even a micro if they ever release those for VPCs. Another consequence of the VPC is that you will have to VPN to the network in order to directly access the hosts for administrative activities (unless you set up some port forwarding–not recommended for security). We decided to use OpenVPN and run it off the NAT device in order to avoid spinning up another 24/7 instance.
This is our general design.
The NAT/VPN host is in its own security group that allows inbound traffic from our corporate IP range for VPN access. It allows outbound traffic to any IP for NAT duty.
One last point on VPCs. Amazon states that your VPC traffic is isolated from everyone else. You can probably take this at face value; however, they do not provide any specifics about how this is accomplished. In order for us to meet our HIPAA HITECH goal, we needed to be able to guarantee that our PHI will not be exposed. This meant we needed to know how VPC traffic isolation is achieved if we were to rely on it. We tried reaching out to AWS support and they tried to be helpful, but ultimately they cannot provide specific details, let alone a documented implementation.
This led us to a core design decision that impacted every aspect of our implementation: all traffic and all data at rest that contains PHI must be encrypted, even traffic within the VPC.
Elastic Load Balancers (ELB)
There are only two ways for traffic originated outside our environment to enter: through an elastic load balancer, or through the NAT/VPN host. Each of our hosted applications has a load balancer assigned to it with a minimum of two web hosts (in two different AZs) assigned to its pool. The load balancers share a security group that allows only inbound (from anywhere) and outbound (to anywhere) TCP traffic on 80 and 443. In addition to balancing the traffic across hosts in the pool, the ELB will take a host out of the pool automatically if it fails. This is handled using a health check (http GET) to a route in our apps.
We also use the ELBs to terminate client SSL connections, so they are configured with our signed certificates and corresponding signing chains. The ELBs then initiate new SSL connections to the web servers in their pools to forward traffic. These “inside” SSL connections do not require signed certificates, as we control both sides of the connection.
ELBs can switch IP addresses and can dynamically change at anytime. AWS does this for load balancing and availability reasons. Instead of directing your traffic to a single IP address like a traditional load balancer, you instead are given a DNS name with a short TTL to send your traffic to. Typically, you simply add a CNAME record to your DNS provider so that your customers see your domain.
One interesting consequence of this strategy is that you cannot direct a zone apex, like fzysqr.com, to a CNAME. It is against the DNS spec. You have to use Amazon’s Route53 DNS service to get around the issue.
Web and database nodes
Our products are fairly standard web apps. At a minimum, they need to have at least one web server node and one database server node. In order to meet our fail-over requirements, we went with two web nodes per app and two database servers that are shared across all the apps. We decided to share the database nodes between our apps for cost purposes.*
*Protip: Build a spreadsheet model based on your current and expected usage. For us, it was extremely helpful to determine our cost/performance/complexity trade-off curves.
Amazon provides many different instance types with more becoming available each month. Each instance type can be “backed” by either elastic block stores (EBS) or instance storage. Instance storage is ephemeral, meaning when the instance stops, the data is lost. EBS is distributed, fault tolerant, theoretically faster (debatable), and more importantly, it hangs around even if you stop the VM. Even if you choose to use EBS backed VMs (we did) you still have access to the instance storage if you want it. This can be useful, because you get several hundred GB for “free” with each VM.
For web servers, we ended up with EBS backed c1.medium instances. These are considered “High-CPU”. We initially tried to get away with m1.smalls, but during load testing, we found that a small would fall over at around 15 concurrent users loading our slowest page repeatedly. Under load, the small instances were reporting 50% steal time in top. This meant that our app was getting blocked by other guests on the cluster. We also theorized that our VMs were on a more fully allocated cluster and were getting shorted on network and disk I/O as well. Bumping the size to c1.mediums allowed our app to scale up to 50 concurrent users without breaking a sweat and would easily handle our normal usage patterns.
The web servers are fairly straightforward Apache/mod_wsgi/Django setups. Each application has two, one for each availability zone. The load balancer distributes traffic using a round robin algorithm. There are three special features that we added for the AWS deployment.
Each app is configured to use the instance storage for its Django cache. This allowed us to cache data stored in S3 on disk, reducing S3 usage costs and increasing performance. Having independent caches on each web node presented a new problem with cache consistency and invalidation across the load balanced pool. We ended up writing an API for the servers to remotely invalidate each other’s cache to solve this problem.
Each app needed to be pointed to a primary database server, yet able to fail over to the secondary. To handle this, each web server runs its own instance of haproxy that is configured via a chef recipe to monitor availability of our database servers. The Django app is set to send all database traffic to localhost which is then forwarded on to a database server. Because each app has its own haproxy instance, we avoid having a single point of failure without having to run a MySQL cluster. The haproxy setup was based on Alex Williams’ guide with our own tweaks to the mysqld status scripts.
The apps are configured to send all database traffic over SSL. Remember, we don’t trust the VPC.
Our database servers are m1.larges. We run MySQL in a master-master replication similar to the technique described in this tutorial. A nifty trick to avoid auto-incremented key collisions is to have each server either use even or odd integers only when allocating new keys.
We had initially hoped to distribute our writes across both database servers. We quickly found that this wreaked havoc on session state stored in the databases and would frequently cause replication conflicts. ELBs have a stickiness feature that mostly solved this problem for us. However, occasionally the ELB would seem to forget that it was supposed to stick, breaking replication and causing a huge mess. We tried a few different approaches at tackling this issue before we decided that a proper solution to distribute database writes would require sharding, which we did not have time for. Instead we opted to balance our load by allocating each application a primary and secondary database server. The secondary would only be used in the event of the primary becoming unavailable. This meant a somewhat cruder load distribution, but at our scale the simplicity was a good compromise. We kept the master-master replication in place so that the servers could hot fail-over without human or scripted intervention.
Because we are paranoid freaks, we knew we would have to encrypt our databases. We accomplished this by locating the database stores on an encrypted EBS volume. We used dm-crypt to handle the encryption. We wrote some nifty scripts to mount the volumes and start up mysqld when the server is booted. Yes, you have to type in the password every time the server reboots. Yes, if you lose that password, you are toast. Don’t lose it.
We also set up the master-master replication to work over SSL. Trust no one!
Simple Storage Service (S3)
By far our favorite AWS tool is S3. We have to hold on to our customer data indefinitely and it can grow to be quite large. S3 means never having to worry about running out of space. The service is awesome, and cheap to boot, though not quite fast enough to replace local storage.
To get around potential performance problems, we ported our app to use S3 as primary storage with the VM instance storage as a giant local cache. When a client requests a dataset not in the cache, the app fetches it from S3 and stores it locally on the VM for up to a year. Since we store the datasets in the cache when they are uploaded, the vast majority of the time, our users will never even experience a cache miss. This strategy has a nice side effect. S3 only charges data transfer feeds on reads, writes are free. Reducing our reads to almost zero means cheaper bills for us.
Our biggest concern with S3 was protecting the PHI. S3 offers sever side encryption that is disturbingly easy to use. This feature is meant to earn them “check box” compliance with various regulations, including HIPAA. But, unfortunately, we trust no one, remember? How do we know our keys are secure? How do we know that a comprised user account isn’t getting used to bypass the encryption entirely?
We had to roll our own application level encryption with keys under our own control. Our code simply encrypts (and compresses) our data before it is POSTed into S3 using pycrypto. We also encrypt locally cached objects as well.
Since it was easy, we left the SSE encryption in place too. Double encryption. Overkill? Probably.
- If you are using Python, use boto to access S3 (or any AWS service).
- Set up restricted users in IAM that only have access to one bucket (e.g. dev vs prod) and limit them to as few privileges as you can.
Automated EBS snapshots
Part of satisfying the “lose less than one hour of data” requirement meant that we had to handle catastrophic failure of both database servers in a replication set. Our initial plan was to replicate the data to a slave outside of our region, but this proved to be an annoying problem with a VPC. We would need to set up a software region to region VPN or relax our security. Neither option appealed to us. Instead, we utilized a little Python/boto script that creates and rotates EBS snapshots once an hour. The EBS snapshots are stored in S3 and thus are as safe as any of our non-database critical data: 99.999999999% durability. If we lose a region (and all database servers within) we can at least recover back to the last snapshot. Of course, if we just lose one database server, we are only going to be missing any non-replicated writes.
To track all this fancy infrastructure, you need a monitoring system. Lucky for you, Amazon is there once again with a solution: CloudWatch. For a small monthly fee, you can get detailed metrics on disk usage, network IO, and CPU usage across all your VMs, your ELBs, and most of the other AWS products. For another small fee, you can report your own custom metrics on absolutely anything. All this metric reporting comes with a powerful system to set up alarms when the numbers fall below critical thresholds. And graphs. Awesome graphs.
Whoops. Someone left the iron plugged in.
We use custom CloudWatch metrics for a few different alarms:
- Free disk space on our EBS volumes
- MySQL replication status
- MySQL process status
Again, boto makes it easy to write little Python scripts to build up and report metrics. We run them from cron jobs on each server.
Chef Server, chef recipes, infrastructure automation
Throughout this post, I have discussed a ton of different configuration and implementation work we have had to do. Some of it probably sounds much worse than it was. This is because we make great use of Chef and Chef Server, an infrastructure automation tool. Chef is a set of different server components that are all tied together around a ruby domain specific language. Basically, you compile “recipes” into “cookbooks” and then deploy them out to your server environments.
The upshot of using Chef is that we can write something like our disk space monitoring metric script above, add it to our Chef repository and within the hour it will be installed automatically on every server in our environment. There are lots of great freely available Chef recipes to build on and we have a ton of our own custom recipes:
We use Chef to:
- replicate ssh keys out to each environment
- install all our of software dependencies
- set up our metrics scripts
- install and update cron jobs
- run EBS snapshots
- set up user groups and file system permissions
… and much more. We chose to host our own installation of Chef Server (because we’re cheap) but you can have Opscode do it for you, for a monthly fee.
Automated deploys using TeamCity
Our last bit of trickery is our fully automated code deployments. We use TeamCity and copious bash scripting to make this work. Because of the VPC, it has to be a bit more complex than we prefer. To deploy TeamCity:
- checks out source from Kiln
- runs unit tests
- packages code into tarball
- starts a VPN connection to the target VPC
- scp transfers the tarball to the target servers in the VPC
- uncompresses the tarball and runs the deploy script
- runs database migrations
- restarts apache using graceful
- disconnects the VPN
Having a reliable and well-tested automated deployment process reduces the risk of deployment related downtime. We often deploy mid-day and our users don’t even know it happened. *
*Protip: this requires solid upstream testing automation to work.
Tired of typing
My original idea for this blog post was a bit grand. There are simply too many things I would love to share. All the little gotchas and technical details. So let’s consider this the overview. This can be the framework for me to hang little tid-bits of implementation and chunks of code over the next several months. If there is anything you would love to see first, please let me know at email@example.com.