What Really Goes On at a Startup During an Outage
As 2016 kicks off, we’re stoked to see our #TiltTour come to life. We partnered with The Chainsmokers to make #TiltTour happen —it’s the first ever fan-sourced tour, letting fans set the tour itinerary. In November, college students on campuses across the nation competed for the chance to bring the EDM duo to their city for a January/February college campus tour. The first 6 campuses to pre-sell at least 800 tickets won a stop on #TiltTour.
The results were incredible. In under 2 hours all 6 shows sold out with no dates or venues confirmed. The #TiltTour competition came to a rapid end with 6 schools selling 800+ tickets in under 2 hours at $25 each.
We were blown away by how our platform powered this viral event and by how college communities were able to mobilize and make something happen on Tilt. In fact, students were eager to create the next movement; the following day, UC Davis students set up a Facebook page to bring Drake to campus and more than 2,500 people joined the group.
But with all this success, there had to be some behind the scenes chaos. When #TiltTour went live the morning of November 3, our traffic immediately spiked and our site quickly went down. We’d love to tell you our site and mobile app were down for an hour, maybe even 2-3. But nope, we were down for a full 8 hours.
What really happens behind closed doors when your favorite site or app goes down, you ask?
To sum it up in the most technical terms possible, it looks something like this:
Then some of this:
And eventually a lot of this:
Catching The Outage(s)
Within a few minutes of #TiltTour launching, several engineers noticed our site was largely down and spewing 500 errors. We immediately noticed an engineer had recently deployed a new API code just prior to the 500s and thought the deploy must have had something bad in it. We moved fast to do a rollback deploy, thinking this would solve the problem, but the 500s were still spewing and and our site was still non-responsive.
Our initial data showed a large inbound request spike, and a high load on the database. We analyzed some slow queries and saw we had a few very common and very expensive queries – namely our trending campaigns and our global activities feed. We updated the site to return cached data and the response times dropped in half at this point, but were still very high, and we were still getting a very large amount of 500 errors.
High Load Surge: The War Room
Pretty soon, we were all over it. We began investigating several other possibilities, and along the way noticed our Auto Scaling trigger had tried to spin up some new API nodes right around the time the outage began. This keyed us into the large traffic spike which occurred around the same time the site began experiencing issues. We noticed our new Auto Scale boxes had provisioned badly due to recent changes in the provisioning, but assumed they could not be in the Elastic Load Balance (ELB) rotation since the applications weren’t running there. Bad assumption!
At this point we started digging for the root cause of the problem thinking it might still be a simple issue.
We ended up spending quite a bit of time investigating other performance bottlenecks in our code, and even pushing out improvements here and there, while also manually spinning up more nodes into our Auto Scale group. We spun up more Database read slaves to ease some of the burden on the existing read slaves, which were under high load.
We eventually decided to verify our health checks, and which nodes were actually in rotation, only to find out the ‘bad’ nodes were actually still in rotation on the ELB, even though they had no applications running on them! We quickly destroyed those nodes, and turned off the ELB health check, so we could control what servers were in rotation manually. Almost immediately this dropped the 500s, and returned our application responses to a stable state.
While doing some recap, and cleanup from this initial outage, we started to see more spewing of 500s. We immediately checked the ELB health checks, verified only good boxes were in rotation, and began looking for other sources of error. We noticed our HTTP worker threads were all being consumed, so we figured this was a case of too much load and began to scale up again (the auto-scaling was turned off by this point) manually to add mork request workers. This seemed to work at first, but we noticed once we got new workers, they would work for a short time, then quickly become hung and join the pool of hung workers. If we restarted the workers, they would accept requests for a little bit, and then hang again.
A few of our engineers began to trace some of our workers to get more insight into what they were hanging on. This investigation led us astray for a while, but ultimately clued us into the correct problem after some trial and error. We realized the workers would eventually get hung perpetually on a database query. Our API workers have a pool of database read-slaves they can randomly choose from to make read queries against. Our new hunch told us one of our new read-slaves was in a bad state. Sure enough, one of the new read slaves we had provisioned earlier during the first outage had a bad disk, and as soon as our workers randomly chose to do a query from that node, it would hold onto the connection forever. Immediately, we updated the configs to take that database slave out of the pool, and all our workers came back roaring. By this point, we had more workers than we actually needed, so the 500s quickly disappeared and service was restored to normal.
There were really two outages, one caused by a bad autoscaling configuration and another caused by a bad disk on a new server. The massive load from #TiltTour exposed problems in our ability to scale and monitor our systems effectively.
After the outages subsided, we huddled up to figure out WTF happened. The load graphs were 10x normal traffic when things started to go down but we didn’t think we were so under provisioned. We started digging into various timing data and web requests. We found a big problem: our request logging happened synchronously as part of request middleware. The logging used multiple case insensitive regular expressions to scrub data. This added up to 300ms of overhead to some commonly used routes, which really hindered our ability to scale our systems effectively.
We also discovered our in-house URL shortener was actually causing a ton of problems, with multiple second response times on individual requests, due to unexpected (and under-monitored) spikes in the number of requests hitting them.
Learning From Our Mistakes
Small changes, when under load, can have a big impact. Things like the regular expressions (especially case-insensitive) change really hurt us, but we weren’t monitoring response time changes at such a granular level.
Assumptions can be risky during an outage. Don’t assume something is working as expected: verify. We should have verified whether our failed Auto Scale boxes were, in fact, out of rotation, and we could have uncovered the real issue much sooner. We assumed our ELB was doing the right thing, but it was not behaving as we expected.
Expensive routes/queries need to be dealt with ASAP. Our commonly used routes should have been cached prior.
Autoscale automation needs to be exercised regularly. Our autoscale configuration would have largely mitigated this traffic spike, but a high provisioning failure rate bit us.
Communicate with other teams. We needed better communication with the rest of the organization to anticipate the traffic spike. We knew it was happening but didn’t realize the rush it would cause.
Bake in the right instrumentation and alerting up front. It took too long for us to find each individual problem because we needed much better and more granular instrumentation. In a high load scenario, more than one thing falls over so you need to find things quickly. We would squash one issue and then have to dig into another.
Validate your health checks before putting them into production, even when using a third party (AWS/ELB). Make sure those third parties do what you expect, and actually know when your service is healthy, vs. when it is not.
Know your weakest points! It’s important to have visibility into what are your slowest routes, what routes or operations could really bring your service or application to its knees if it gets a surge? These are insights you can get from tools like NewRelic, StatsD, and even just timing logs. Keep tabs on these, and add caching where possible. Where not possible (or easily possible), set threshold alerts so you can be notified early on spikes in these operations.
Have different teams investigating different possibilities. Once you realize the outage/issue is bigger than you thought, and you aren’t sure of the root cause, it is good to have different groups investigate different possibilities. A few times during the “war room,” we had too many people all focused on the same investigation, rather than splitting up groups to investigate different possibilities.
Cleaning Up For The Future
Since the outage, we’ve done a lot of work internally to improve our systems. We added instrumentation and scalability metrics around every system. We instrument everything and implement load testing on all pieces of our system. We need to easily handle 10x our highest spike because we can’t predict the to the organic virality of Tilts. This has become part of our broader ‘X-Ray’ initiative to get better visibility into our system services across quality, scalability, and information availability.We’ve also implemented very clear owners and instructions in case we experience any major outages in the future.
After spending about 15 hours in a ‘War Room’ on all of this, it’s safe to say we learned A LOT! Feel free to tweet us with questions @TiltEng. We can also tell you how we handled this across the rest of the organization if you’re interested.
Would love to hear your feedback in the comments.