Mea Vita: Carpe Diem: scaling

Showing posts with label scaling. Show all posts

Tuesday, January 29, 2019

Rolling Out a New Version of Your Website

Tricks I Learned at Apple: Steve Jobs Load Testing is an excellent precursor to this post.

When launching a completely new version (update) of a website, it's best to have a rollout and a rollback plan. Very few brand new websites will have the problems that HealthCare.gov had in 2013 because new websites typically start with zero traffic. HealthCare.gov was a unique case since it went from zero to millions of users, overnight.

Typically, as a website grows, servers will be added and optimized to handle the additional traffic. But, if growth happens too quickly, then the company can prevent new users from creating new accounts on the website while they manage their growth and scale up their infrastructure. Facebook was able to manage their growth by rolling out across college campuses, one at a time, whereas Twitter had no way to control their growth since they were open to the public, resulting in the fail whale. Again, these are rare cases; the typical problem with websites occur when rolling out a major update.

Rolling out the New Website Version

While growing from zero to millions of users is a high quality problem, it's actually very rare. A more likely problem is encountered when an entirely new version of a website is rolled out since it will probably have critical bugs or scaling issues.

When I worked at Apple and Wyndham, we had to handle both bugs and scaling issues. At Apple, we switched from using RDBMs to memory caches for read-only data. At Wyndham, we had to roll out more than a dozen different websites at once for brands like Days Inn, Ramada, Howard Johnson's, Super 8, Hawthorn Suites, etc.

Managing Risk

Initially, Wyndham wanted to switch from the old website to the new one, all at once. My boss, who's a particularly sharp guy, had enough experience to immediately recognize the risk of doing this. Specifically, what if the new website was broken (what if it had too many bugs, preventing customers from booking rooms)? Instead, he suggested a very simple plan. Rather than making the switch, overnight, he suggested we keep the old version of the website running while rolling out the new website over the course of a week or so.

Since both the old and new versions of the website talked to the same database, it was a simple process, at a high level. We'd have an all-hands meeting, on Monday morning, in our war room (dedicate conference room). During Monday's meeting, all of the departments (marketing, product management, development, and QA) would give a thumbs up to move forward. Then, we'd have our load balancers begin to randomly send 1% of the traffic to the new version of our website. We'd place a cookie on the customer's browser so, if they came back later, they'd automatically be directed to the new version of the website otherwise they'd end up the old version.

Staging the Rollout

Just before the close of business on Monday, we'd meet again to confirm that everything was running as expected. On Tuesday morning, we'd meet and give a thumbs up to increase the traffic to the new website to 5%, etc. It looked like this:
Monday: 1%
Tuesday: 3% – 5% (based on Monday's performance)
Wednesday: 10%
Thursday: 50%
Friday: 100%

The beauty of starting at 1% and then 3 % – 5% is that's the most revenue you'll risk losing (in theory) if something goes wrong.

By using this week-long rollout process, we all kept our jobs. I only recall one time, when there was a major bug, that we had to stop after the first day or two, which wasn't a big deal; we simply sent all traffic to the old website while the new one was fixed and we got it right on our next rollout.

Thursday, April 21, 2016

Celebrity Server Overload

On June 25, 2009, I listen to Guy Kawasaki speak in San Diego. About half way through his presentation of demos on social media he gave a shout out to the audience of 500 about me and my company, Adjix. Everyone seated at my table turned and looked at me, "Who's this guy?" I was feeling great after leaving that breakfast presentation until I got home and learned that Michael Jackson had died. I wasn't a big fan of MJ, but his music is... powerful art. What quickly got my attention was that a customer had used Adjix to link to the news of MJ's death creating a huge load on the Adjix app servers. The web and database servers were humming along, without a problem; but the apps were bottlenecked by REST calls across the Internet. With the CPU cores pegged at 100%, I began manually spinning up more app instances to balance the celebrity server overload – which lived up to my expectations.

This morning, Prince died. Prince and the Revolution were my first rock concert when I was a kid. Prince wasn't suppose to be my first... Styx was... but Tommy Shaw hurt his hand, as it was reported in the news, and the Styx concert was cancelled (not postponed).

To confirm the news of Prince's death, I went to TMZ.com but their servers were down, "503 Service Unavailable." That HTML error code simply means, "No more! Uncle! I'm temporarily overloaded."

After giving TMZ.com a little time, their servers were handling requests, again. "Damn it. Prince is dead. And he's young, too young to die this soon."

1984 and Purple Rain had a powerful impact on me. Prince was a key soundtrack to my youth. ❖

Dearly beloved
We are gathered here today
2 get through this thing called life

Electric word life
It means forever and that's a mighty long time
But I'm here 2 tell u
There's something else
The afterworld

A world of never ending happiness
U can always see the sun, day or night

So when u call up that shrink in Beverly Hills
U know the one - Dr Everything'll Be Alright
Instead of asking him how much of your time is left
Ask him how much of your mind, baby

'Cuz in this life
Things are much harder than in the afterworld
In this life
You're on your own

And if de-elevator tries 2 bring u down
Go crazy - punch a higher floor

Wednesday, December 4, 2013

Scaling Obamacare

From a technical point, there's no way that HealthCare.gov (Obamacare) could have rolled out successfully.

Scaling a website of that magnitude, with millions of users from the git-go is unheard of. Even if Google, Facebook, and Twitter engineers developed it it would have still failed to launch smoothly.

Managing Growth

Generally, rolling out large scale websites is done in stages. When Gmail launched in 2004, it required an invitation so Google could control its growth. Facebook originally rolled out only to college students at select universities. Twitter, on the other hand, had no way to control its growth and it frequently went down (AKA: the Fail whale).

Since the world's best web engineers wouldn't have been able to launch a website like Obamacare, what made government contractors think they could do it?