Last month our finance guy came to me in a bit of a panic to point out that our Amazon Web Services (AWS) bill was way higher than expected – by several thousand dollars. After the initial shock wore off I started digging to figure out just what was going on.
From Amazon’s bill I could easily determine that our bandwidth costs were way up, but other expenses (like EC2) were in line with previous months. Since I have several S3 buckets that house content served from S3 and CloudFront, and since Amazon doesn’t break down the costs enough I had to figure this one out on my own.
We serve millions of page views per day, and each page view causes several calls to different elements within our infrastructure. Each call gets logged, which makes for a lot of log files – hundreds of millions of log lines per day to be exact. But, this is where the detail I needed would be found so I had to dig in to the log files.
Because I collect so much log information daily I haven’t built processes (yet) to get detailed summary data from the logs. I do however, collect the logs and do some high level analysis for some reports, then zip all the logs and stuff them to a location on S3. I like to hang on to these because you never know when a) I might need them (like now), and b) I’ll get around to doing a deeper analysis on them (which I could really use, especially in light of what I’ve uncovered tracking down this current issue).
I have a couple of under-utilized servers so I copied a number of the log files from S3 to my servers and went to work analyzing them.
I use a plethora of tools on these log files (generally on Windows) such as S3.exe, 7zip, grep (GNU Win32 grep) and logparser. One day I’m going to write a post detailing my log collection and analysis processes….
I used logparser to calculate the bandwidth served by each content type (css, html, swf, jpg, etc.) from each bucket on a daily basis. My main suspect was image files (mostly jpg) because a) we serve a lot every day (100 million plus), and b) they are generally the biggest of the content we serve from S3/CloudFront.
Since my log files are so voluminous it actually took several days to cull enough information from the logs to get a good picture of just what was happening. Even though I’d gotten over the initial shock of the increased Amazon bill I was a bit shocked again when I saw just what was happening. Essentially, overnight the bucket serving my images (thumbs & stills) went up 3-4 times on a daily basis. This told me either a) we were serving more images daily (but other indicators didn’t point to this), or b) our images were larger starting that day than they had been previously.
Well, that’s exactly what it was – the images were bigger. Turned out a developer, trying to overcome one bottleneck, stopped “optimizing” our images and the average file size grew by nearly 4 times – just about the amount of bandwidth increase I’d found. So, in order to save a few minutes per day and even a couple bucks with our encoder this one thing ended up costing us thousands of dollars.
Once I identified the issue I began working with our developers to fast-track a solution so we can accomplish all our goals: save time, save money encoding/optimizing the images, and get them small as possible to save on bandwidth with Amazon therefore saving a lot of money. I also went to work to identify the images that were too big and optimize them. In fact, this is actually an ongoing issue as our dev team hasn’t quite implemented their new image optimization deployment, so I actually grab unoptimized (i.e. too big) images a few times a day, optimize them, then upload to S3. This process probably justifies its own post so I’ll do that soon & link to it from here.