Loading

Friday, May 17, 2013

Improving Web Services (Part II)

Last year I wrote about how we improved web server performance with some fairly small changes, and about how I keep an eye on these metrics with reports I create analyzing my web server logs with Microsoft's LogParser. This is a follow-up to those.

Recently we did an upgrade to our platform. One of the "improvements" was our amazing DBA (he is truly amazing!) did was to tighten up some of the SQL stored procedures used for returning dynamic data to our video players (playlists, video meta data, etc.). These "player services" get hit around 300,000,000 to 400,000,000 times per day, so even a small improvement can have far-reaching impact.

As I'm sure is common across regions of the web traffic is lower at certain times. Ours is no different, so I leverage lower CPU load in the middle of the night to crunch my web server logs across my fleet of web servers. As this RRD Tool graph shows CPU load is considerably lower overnight, except when the server is processing its own log file analysis. Which takes about an hour or so on each server. It's also worth noting that average response times are not negatively affected during this time - I know, I keep a close eye on that!


Among the various pieces of data gleaned by this log processing is the time (in milliseconds) each response takes, as recorded by the server. This is very valuable information to me as I can definitively know impacts of various factors; like systems deployments (such as the one that spurred this post...), performance under various load conditions (peak times vs. slow times), performance during operations or maintenance windows (crunching logs, system updates, patches, etc.), and last but not least when people come to me saying anecdotally  "customers are saying our system is slow..." I can show them with absolute certainty, both historically and at any point in time (I have some really good methods of running ad hoc reports to get up-to-the minute stats), how our system is performing or has performed.

So any time we roll out a change of any kind I look at the data to understand the impact(s), if any. After this deployment of the new and improved SQL stored procedures I'm seeing approximately a 30% decrease in response times. That's a huge improvement!


Besides loading faster (at the client side) this is also causing a noticeably lower load on both the front end web servers and database servers. Therefore we have more available capacity or head room with the same number of servers, or I could potentially shut down some of our AWS EC2 servers saving money. Now we have set the bar even higher for performance of our systems, and any future "improvements" or modifications can be accurately measured against this.

I love the fact that I have such good insight into these systems and can measure any impact of changes or varying load with great accuracy!