A week ago today we pushed out some code to our IIS and MS SQL servers that has had a huge impact. In a good way - a very good way.
First, a little background. I work for an online video company that has "widgets" and video "players" that load on thousands of websites millions of times a day. Each time one of these entities loads numerous calls get made to our systems for both static and dynamic content. One of the highest volume services is what we call "player services," where calls are made to display playlists, video meta data, etc. based on our business rules. For example, can a particular video be displayed on this website, etc. Anyway, these player services get hit a lot in a given day, about 350,000,000 times (or more) a day. That's over 1/3 of a billion (yes, billion, with a B) times a day to just this one service (this represents about 1/3 of the total hits to the systems I oversee. You do the math...).
We do have a scalable infrastructure which has steadily been growing over the past few years. In fact, we've grown tremendously as you can see by this Quantcast graph.
|Figure 1 - Traffic growth over the past few years.|
Over the years we've stumbled a few times, but have learned as we've grown. (I remember when we had our first Drudge link & it caused all kinds of mayhem. Recently we had five links on Drudge simultaneously and barely felt it.) We've also had many break-through's that have made things more efficient and/or saved us money. This one did both.
We have fairly predictable daily traffic patterns where weekday's are heavier than weekends, and during the day is heavier than at night. Using a couple of my favorite tools (RRDTool & Cacti) I am able to graph all kinds of useful information over time to know how my systems are performing. The graph in figure 2 shows the number of web requests per second to a particular server. It also shows the relative traffic pattern over the course of a week.
|Figure 2 - Cacti RRDTool graph showing daily traffic pattern over one week for one server.|
I use a great little Cacti plug-in called, "CURL BWTEST Response Time Graph Template" to monitor response times for various calls. As figure 3 shows, our average service response times would climb during peak traffic times each day. I monitor these closely every day and if the response times get too high I can add more servers to reduce the response time to an acceptable level.
|Figure 3 - Cacti CURL graph showing service call response time.|
Here's the exciting part, finally. As you can see in figure 3 we no longer have increasing response times during the day when our load increases. How did we do it, you ask? Well, a couple of really sharp guys on our team, our DBA and a .NET developer, worked on several iterations of changes to both SQL stored procedures and front-end code that now deliver the requested data quicker and more efficiently. Believe it or not this took a little work. Earlier I alluded to stumbling a few times as we've grown, and that happened here. A couple weeks ago we made a few additions to our code and deployed it. The first time we basically took down our whole system for a few minutes until we could back it out. Not only did that scare the crap out of us it made us look bad.
So we went back to the drawing board and worked on the "new and improved" addition to our code. After a couple more false starts we finally cracked this one. Without boring you with the details our DBA made the stored procedures much more efficient and our front-end developer made some tweaks to his code & the combination was dynamite.
As figure 3 shows, we aren't suffering from increased response times for calls under heavier load conditions. This in and of itself is fantastic; after all, quicker responses equals quicker load times for our widgets and players which equals a better user experience, which ultimately should translate to increased revenue. But this isn't the only benefit of this "update." The next improvement is the fact that overall the responses are quicker. Much quicker. As you can see in figure 4 on Sept. 14 (before the code deployment) the average response time (broken out per hour) increased dramatically during peak traffic times and averaged a little over 100 ms for the day. After the deployment, looking at data from Oct. 4, not only did the average response time stay flat during the higher traffic times, but it is down considerably overall. Responses now average 25 ms at all hours of the day, even under heavy load. This is a great double whammy improvement! In fact, Oct. 4 was a much heavier load day than 9/14, so not only were the responses quicker, the server handled way more traffic.
|Figure 4 - Spreadsheet comparing average load time before and after code deployment.|
So far I've discussed the benefits on the front-end web servers, but we've also seen a dramatic improvement on the back-end database servers that service the web servers. Figure 5 shows the CPU utilization over the past four weeks of one of our DB servers and how since the deployment a week ago it has averaged almost half what it did before the deployment. This enabled me to shut down a couple of database servers to save quite a bit of money. I've also been able to shut down several front-end servers.
|Figure 5 - Cacti graph of database server CPU utilization over 4 weeks.|
Due to this upgrade I have been able to shut down a total of 12 servers, saving us over $5000 per month; made the calls return quicker causing our widgets and players to load faster - at all hours of the day; and made our infrastructure even more scalable. Win, win, win!
For more information, and if you're feeling especially geeky see my post, "I LOVE LogParser" for details on how I am able to use Microsoft's logparser to summarize hundreds of millions of IIS log lines each day, and on-demand when needed to keep a good pulse on just what my servers are doing.