|Chart 1 - 384 page views on Dec 15, 2008|
Based on the CEO's statement that we were going to become one of the one thousand busiest web services on the Internet I was tasked with building a scale-able system that could grow rapidly (along with all other IT related duties, but that's another story...). Oh, and because we were a start-up funded solely by friends and family of our founder I had an extremely lean budget for equipment, facilities and personnel. Basically the budget was zero.
Admittedly I was a little naive, but I'm also optimistic and very determined. So I set out to do what I was asked.
At the time we were sharing a handful of servers with another start-up in a colo across the country. I had 20 year-old 1U Dell servers, a couple gigabit switches, two entry-level Cisco firewalls, and two low-end load balancers. I quickly put together a fairly lean list of servers and networking equipment I needed and tried to get a few hundred grand to buy that and setup at least one separate location. The answer came back that I couldn't even spend one tenth of what I needed & I had to figure out how to make things work without any capital expenditure.
Then on January 19-24, 2009 while I was trying to figure out how to work miracles we had our first Slashdot effect event when one of our partners had an article containing our player featured on politico.com (note: at the time we were mainly politically oriented, now we are a broad-based news, entertainment and sports organization). We went from averaging less than 100 player loads (AKA page views) per day to over 500,000 in a single day. Needless to say our small company was ecstatic, but I was a bit nervous. While our small infrastructure handled the spike, it did so just barely.
|Chart 2 - January 19-24, 2009 Slashdot effect|
Admittedly I was somewhat hesitant to use AWS. First, the concept of every server having a public IP address, then the fact that they didn't have an SLA, and finally the only way to load balance was to build your own with something like HA Proxy on EC2 servers. But the compelling factors, elasticity, pay as you go, no CapEx, etc., were really attractive, especially to someone like me who didn't have any money for equipment, nor could I hire anyone to help build and maintain the infrastructure.
Sometime in the spring of 2009 when AWS announced Elastic Load Balancing I was swayed and fully embraced moving to "the cloud." I started right away copying our (~200 GB) video library and other assets to S3, and started a few EC2 servers on which I started running our web, database and application stacks. By August of 2009 we were serving our entire customer-facing infrastructure on AWS, and averaging a respectable quarter million page views per day. In October of that year we had our second 500,000+ day, and that was happening consistently.
|Chart 3 - 2009 Traffic|
|Chart 4 - Explosive Growth in 2010|
We were still a skeleton crew, but finally had about ten people in the dev, database and operations group(s). Through this all and well beyond we never had more than one DBA, and one operations/infrastructure guy.
I can't say this growth wasn't without pain though. We did have a few times when traffic spikes would unexpectedly hit us, or bottlenecks would expose themselves. But throughout this time we were able to optimize our services making them more efficient, more able to grow and handle load, and even handle more calls per server driving costs (on a per call basis) down considerably. And, yes, we benefited greatly from Amazon's non-stop price reductions. I regularly reported to our CEO and others about how our traffic was growing exponentially but our costs weren't. Win, win, win!
I'm a bit of a data junky and I generate and keep detailed information on number of calls/hits to our infrastructure, amount of data returned per call, and ultimately cost per call. This has enabled me to keep a close eye on performance and costs. And I've been able to document when we've had numerous wins and fails. I've identified when particular deployments have begun making more calls or returning more data usually causing slower performance and always costing more money. I've also been able to identify when we've had big wins by improving performance and saving money.
The main way I've done this is to leverage available CPU capacity when servers have been underutilized on evenings and weekends. Currently on a daily basis I analyze close to 1 billion log lines, effectively for free. This is a high-level analysis looking at things like numbers of particular calls, bandwidth, HTTP responses, browser types, etc.
Starting in 2009 we really started to focus on making our systems more efficient and making them faster, more resilient and more scale-able. And I've been able to measure the results of those efforts and we recorded several wins, each time making our products faster, better and less expensive to deliver.
|Chart 5 - More Growth in 2011|
2011 was another banner year for us and we crossed the 20 million and 30 million page views per day thresholds. When our video products load on a given page as many as 20 calls are made to both static and dynamic content, roughly half of each type. All the static files (HTML, CSS, JS, images, video, etc.) are served through CDN's. But all the dynamic calls (embed, player services and analytics) are served through EC2 servers behind Elastic Load Balancers. And these are where I think we really shine. These are the services where we've really fine tuned their performance mentioned above.
|Chart 6 - Continued Growth in 2012 and 2013|
In 2012 and 2013 we saw more growth hitting as many 78 million page views in a single day, and at present an on average day our products load on 60 million pages across the web. This translates to about 500 million calls to static content served through CDN’s, and another 500 million daily calls to our web services (chart 7 shows four of our busiest web services, but not all of them) powered by web and database servers running in EC2 behind Elastic Load Balancers. ½ a billion dynamic services calls per day. Rather impressive!
|Chart 07 - AWS CloudWatch Stats Showing Over 400,000 Calls Per Day|
Not only have we been able to leverage the zero CapEx, low OpEx, high availability and scalability of AWS, but we were able to build all this with a very small team. In the fall of 2012 we had a couple of nearly 80 million page view days & at that time we had less than 10 people in the dev, database and operations groups (Note: to that point we never had more than 1 DBA and 1 network operations guy). Since I was the operations “group” up until that time I am blown away that we could build a world-class infrastructure serving at the scale we do with such a small crew. I believe it’s unheard of to build and run a system like ours with only 1 operations guy, and I know that wouldn't have been possible without AWS.