Wednesday 2: Stuart Broz, "Making Drupal scale (Trellon sponsored session)"

Earth Day, biggest secular holiday

Collecting pledges for responsible acts - billion acts of green - through authenticated users. Slightly scary uncacheable numbers

Events page -Apache Solr as backend, geo IP lookup

Only a fair way in - two weeks before - where we realised the site arch wasn't right.
Single db server
One instance of web server with Pressflow
And a lot of hopes and dreams
Expected traffic was 25 million hits before 10am EST - in 2009 that's when the site went down!
We didn't expect to be involved in performance
But audits showed it was going to go down

New setup
New hardware
Varnish, Multiple apache servers, single Drupal instance? Master/many slave MySQL setup. Repackaged core and contrib to read data off slaves
Would have loved to do e.g. Cassandra, but didn't have time.
Master performance is the bottleneck.
Never switch on in high per site:
Statistics module. Db write on every page view. That locks db table.
Solr queries entire node table - half to whole second - have to hack

Surprise
Pledge widget featured on other sites
Webservices iPhone application
Featured on app store before Earth Day
Every one of these had to be authenticated
Facebook app
... One more thing: Google featured Earth Day on homepage
NZ staff saw traffic really early on: ohshit moment.

Db layer in D6 just not up for this
When data doesn't reach your slaves
Stale data
Users not replicating
In the end these were edge cases
Patched modules and used CDN module - requires core to be ha er patched
Lots of caching, using memcache
MySQL slow query log
Just don't trust contrib

On the day
We were worried
A site we ourselves didn't build
Emergency measures - auth users? Sadly wd have crippled site
Chris and Carl sitting in coder lounge, switching piecemeal to memcache, slowly improving
Issue with Varnish - by checking syslog discovered caches rebuilding every four seconds. Removes any value of Varnish. Check your cacheing mechanism over long periods.

Site stayed up
80 million page views in 48hrs
Average page load below 4 seconds - was loading in over 20s when we got it!
At the time one of the highest ever Drupal traffic densities.

"MySQL proxy?"

Didn't use much more tech on the stack, not this either. Didn't have time to investigate many possibilities. This could've provide a failover but also more complexity.

"Hack core to do slave queries?"

Pressflow couldn't do this. It was choosing a random server. It includes the master server along with the slaves, and the master was REALLY sensitive to I/o load and we didn't want to kill it, so remove all reads.

"Varnish problem?"

A module was running cache-clear-all.

"Timescale to look at architecture?"

Got the site in March for very specific work. Basically week of DrupalCon when we realised. We were testing our own web services under expected load. Programmers notified the day they were leaving.

"Superbowl day - ended up second hit for superbowl video on Google. Didn't even have any superbowl video. Varnish has one thread per conn, if you have more conn than N threads, it will queue up to N more, and then starts referring 503s."

We ran into the same problem.

"nginx zone limit - we'll allow at most 500 concurrent connections per host header. Some were getting 503s but at least Varnish didn't shut down."

Shared files - nfs share, has built-in caching. Once pulled once, caches it on each web node. Nothing changing on the server.

Deploy the CDN - take a lot of load off Varnish.

Wishlist
D7 supports CDN
MySQL replication
caching strategies better
Dbo makes horiz scaleout much easier
Field API store in MongoDB or Cassandra

"benchmark, stress testing?"

Not really doing that: we benchmarked for expected concurrency using av and two EC2 nodes. Using munin to monitor.

"Soasta will do this for you. High-end solution. Give them a testing plan and they'll spin up."