Operating Dev

Blurring the line between software development and operations

Operating Dev - Blurring the line between software development and operations

Scaling down – running cost-effective campaigns on the AWS cloud

[Originally posted on the TriNimbus blog]

If you ever had to handle a big marketing campaign on a website or a promotion on your SaaS product, you probably know that sizing your servers to handle the anticipated load is notoriously hard and costly. Typically, you need to get beefy hardware to power up a number of servers and deploy a load balancer to equally split the traffic between each server so you get a system that can handle millions of views in a small time range like hours.

If you read my post on the OperatingDev blog or the Precursor Games case study on theTriNimbus site, you know that we recently worked with Precursor Games to design and implement a scalable deployment architecture for their crowdfunding campaign. We chose to deploy the system on AWS and our aim was to ensure the servers can handle up to a million page views in the first few days. This meant deploying many servers behind a load balancer, of course, but using AWS gave us the freedom to focus on the architecture and don’t worry about the resources we will need to build it. This is what we did:

A diagram showing the deployment architecture on AWS

You can read in the case study some of the architectural details and the results we were able to achieve. In this post I want to focus on something else – the power of using cloud services like those provided by AWS to enable architectures that are too costly or too difficult to implement otherwise.

Just look at the laundry list of technologies we could utilize to handle the anticipated load:

  • Our own domain name service (Route 53) – allowed us to make everything look neat and tidy and change things in the background without changing any configuration

  • Scalable load balancer (ELB) – by pre-warming the load balancer instance we were able to take away any network latency between our servers and the load balancer and go from near zero to 20k GET requests in the first minute after opening up the firewall to let the public access the site

  • Auto-scaling web servers (EC2 with AutoScaling) – in case the traffic suddenly increased beyond our projections this would allow Precursor to still run the site as more power would be automatically added, but more importantly, we were able to scale down the system from the original 15 servers down to 9 and later 6 servers once the initial storm has passed and save a significant amount of money for the rest of the campaign

  • Automated deployment (Elastic Beanstalk) – not only did this service automated the environment management for us (handled auto-scaling and the ELB) but also allowed us to push changes to the sites live without impacting the existing users posting on the forums or donating money

  • Fully-managed MySQL (RDS) – the database is typically a single point of failure for many systems and is notoriously hard to scale; RDS made it easy to deploy a fault-tolerant database and scale the computing power and storage up and down with virtually zero impact to the live users

  • Fully-managed in-memory cache (ElastiCache) – instead of scaling the database through master-slave replication we could quickly add a memcached cluster in the environment and configure the sites to cache many of the database queries and PHP objects, which hugely improved the application performance and server throughput while allowing us to run a fairly small instance for the database

  • Shared storage for uploads (S3) – not only did S3 serve as a shared storage to allow files uploaded through the websites and forums would be immediately made available to all servers in the group, it allowed us to take advantage of its ability to act as a web server and handle requests for those uploads directly instead of through our web servers, reducing the traffic that had to come to the load balancer

  • Cached content distribution (CloudFront) – as it is typical in most sites, much of the content is static (.js, .css, .jpg files) and pushing it to a CDN service allowed us to greatly reduce not only the number of requests coming to the load balancer and our backend servers but more importantly serve the lion share of the network traffic and improve the throughput on the servers as each GET request-response on the servers was much smaller than the file sizes served by CloudFront

  • Secure and isolated network (VPC) – while listed last, this is probably the most important service we used as it allowed us to design a highly secure network and isolate many of the resources behind a multiple layers of firewalls and gateways that prevented direct access to all but the load balancer itself and a few admin servers

Thanks to the availability of the above services we could go from the first call with Precursor to the official launch in just 10 days. Thanks to the built-in monitoring and alerting capabilities of Amazon’s CloudWatch service and the use of New Relic (a cloud service well worth using) and Google Analytics (a cloud service as well) in addition to it we could closely monitor the system at all times and adjust as the traffic flowed and ebbed.

Using cloud technologies can be empowering and frees you up from worrying about the mundane tasks like deploying a load balancer or provisioning enough servers and focus on building architectures that are not resource constraint and are resilient to failure. Just look at the first minute experience in the graph below and think of the effort and money it would take to handle that in a traditional non-cloud setting.

CloudWatch monitoring graph showing the web traffic on the load balancer during the first day of the campaign

Your email address will not be published. Required fields are marked *

*