Loading

Thursday, March 6, 2014

How Did One Guy Build A World Class Web Infrastructure Serving Over One Billion Calls Per Day?

On Monday, December 15, 2008 I went to work for an online video start-up. On my first day our video players loaded on 384 of our partner's web pages. As you can see from the first chart that was actually a pretty good day for us (it was actually our biggest day to date) as we were averaging less than 100 pages a day. During my interview with the founder and CEO he confidently told me the company would be one of the 1000 busiest sites within one year. I believed him.

Chart 1 - 384 page views on Dec 15, 2008
I had a comfortable job as a Network Operations Director with a mature and stable company, but I felt like I needed a change so I was looking a little for something new. A colleague had recently found a job which had been posted on craigslist.org, so I was trolling IT position postings there. I was frustrated because most jobs wanted specific skills, like Active Directory, Exchange, routers, firewalls, etc. I didn't want to do just one thing or focus on a narrow area. Then I came across a post on craigslist looking for a utility player, one who had varied experience and could wear many hats. It was a match made in heaven. (I later found out this company was frustrated by not being able to find someone with a wide range of experience as all previous applicants had been siloed in specific areas. They had actually given up and pulled the post less than a day after I ran across it.)

Based on the CEO's statement that we were going to become one of the one thousand busiest web services on the Internet I was tasked with building a scale-able system that could grow rapidly (along with all other IT related duties, but that's another story...). Oh, and because we were a start-up funded solely by friends and family of our founder I had an extremely lean budget for equipment, facilities and personnel. Basically the budget was zero.

Admittedly I was a little naive, but I'm also optimistic and very determined. So I set out to do what I was asked.

At the time we were sharing a handful of servers with another start-up in a colo across the country. I had 20 year-old 1U Dell servers, a couple gigabit switches, two entry-level Cisco firewalls, and two low-end load balancers. I quickly put together a fairly lean list of servers and networking equipment I needed and tried to get a few hundred grand to buy that and setup at least one separate location. The answer came back that I couldn't even spend one tenth of what I needed & I had to figure out how to make things work without any capital expenditure.

Then on January 19-24, 2009 while I was trying to figure out how to work miracles we had our first Slashdot effect event when one of our partners had an article containing our player featured on politico.com (note: at the time we were mainly politically oriented, now we are a broad-based news, entertainment and sports organization). We went from averaging less than 100 player loads (AKA page views) per day to over 500,000 in a single day. Needless to say our small company was ecstatic, but I was a bit nervous. While our small infrastructure handled the spike, it did so just barely.

Chart 2 - January 19-24, 2009 Slashdot effect
When I started with this new company was when I was introduced to Amazon Web Services and started dabbling with EC2 and S3 right away. In fact, we started running our corporate website on EC2 a little over a month before I started, and we ran it on the exact same server for just over five years.

Admittedly I was somewhat hesitant to use AWS. First, the concept of every server having a public IP address, then the fact that they didn't have an SLA, and finally the only way to load balance was to build your own with something like HA Proxy on EC2 servers. But the compelling factors, elasticity, pay as you go, no CapEx, etc., were really attractive, especially to someone like me who didn't have any money for equipment, nor could I hire anyone to help build and maintain the infrastructure.

Sometime in the spring of 2009 when AWS announced Elastic Load Balancing I was swayed and fully embraced moving to "the cloud." I started right away copying our (~200 GB) video library and other assets to S3, and started a few EC2 servers on which I started running our web, database and application stacks. By August of 2009 we were serving our entire customer-facing infrastructure on AWS, and averaging a respectable quarter million page views per day. In October of that year we had our second 500,000+ day, and that was happening consistently.

Chart 3 - 2009 Traffic
Through most of 2009 our company had 1 architect, 0 DBA's (so this job defaulted to me), and 1 operations/infrastructure guy (me), and we were outsourcing our development. We finally started to hire a few developers and brought all development in-house, and we hired our first DBA, but it was still a skeleton crew. By the end of that year we were probably running 20-30 EC2 servers, had a couple ELB's, and stored and served (yes, served) static content on S3. Things were doing fairly well and we were handling the growth.

Chart 4 - Explosive Growth in 2010
2010 was a banner year for us. In Q1 we surpassed 1 million, 2 million and even 5 million page views per day. And by Q3 we were regularly hitting 10 million per day. Through it all we leveraged AWS to handle this load, adding EC2 servers, up-sizing servers, etc. And (this is one of my favorite parts) didn't have to do a thing with ELB as AWS scaled that up for us as needed, automatically.

We were still a skeleton crew, but finally had about ten people in the dev, database and operations group(s). Through this all and well beyond we never had more than one DBA, and one operations/infrastructure guy.

I can't say this growth wasn't without pain though. We did have a few times when traffic spikes would unexpectedly hit us, or bottlenecks would expose themselves. But throughout this time we were able to optimize our services making them more efficient, more able to grow and handle load, and even handle more calls per server driving costs (on a per call basis) down considerably. And, yes, we benefited greatly from Amazon's non-stop price reductions. I regularly reported to our CEO and others about how our traffic was growing exponentially but our costs weren't. Win, win, win!

I'm a bit of a data junky and I generate and keep detailed information on number of calls/hits to our infrastructure, amount of data returned per call, and ultimately cost per call. This has enabled me to keep a close eye on performance and costs. And I've been able to document when we've had numerous wins and fails. I've identified when particular deployments have begun making more calls or returning more data usually causing slower performance and always costing more money. I've also been able to identify when we've had big wins by improving performance and saving money.

The main way I've done this is to leverage available CPU capacity when servers have been underutilized on evenings and weekends. Currently on a daily basis I analyze close to 1 billion log lines, effectively for free. This is a high-level analysis looking at things like numbers of particular calls, bandwidth, HTTP responses, browser types, etc.

Starting in 2009 we really started to focus on making our systems more efficient and making them faster, more resilient and more scale-able. And I've been able to measure the results of those efforts and we recorded several wins, each time making our products faster, better and less expensive to deliver.

Chart 5 - More Growth in 2011
2011 was another banner year for us and we crossed the 20 million and 30 million page views per day thresholds. When our video products load on a given page as many as 20 calls are made to both static and dynamic content, roughly half of each type. All the static files (HTML, CSS, JS, images, video, etc.) are served through CDN's. But all the dynamic calls (embed, player services and analytics) are served through EC2 servers behind Elastic Load Balancers. And these are where I think we really shine. These are the services where we've really fine tuned their performance mentioned above.

Chart 6 - Continued Growth in 2012 and 2013
In 2012 and 2013 we saw more growth hitting as many 78 million page views in a single day, and at present an on average day our products load on 60 million pages across the web. This translates to about 500 million calls to static content served through CDN’s, and another 500 million daily calls to our web services (chart 7 shows four of our busiest web services, but not all of them) powered by web and database servers running in EC2 behind Elastic Load Balancers. ½ a billion dynamic services calls per day. Rather impressive!

Chart 07 - AWS CloudWatch Stats Showing Over 400,000 Calls Per Day
Not only have we been able to leverage the zero CapEx, low OpEx, high availability and scalability of AWS, but we were able to build all this with a very small team. In the fall of 2012 we had a couple of nearly 80 million page view days & at that time we had less than 10 people in the dev, database and operations groups (Note: to that point we never had more than 1 DBA and 1 network operations guy). Since I was the operations “group” up until that time I am blown away that we could build a world-class infrastructure serving at the scale we do with such a small crew. I believe it’s unheard of to build and run a system like ours with only 1 operations guy, and I know that wouldn't have been possible without AWS.

No comments:

Post a Comment