Zotero and AWS, or How We Learned to Stop Worrying and Love the Cloud (Part 1 of 2)

Zotero’s server infrastructure has evolved in countless ways since the project’s 2006 launch, but most of those changes are super boring and not worth remembering. Over the past two months, however, we moved the bulk of Zotero’s back end to Amazon Web Services, a step that I believe is uniquely noteworthy in the context of digital humanities projects and their long-term sustainability. In this post I describe the recent changes to Zotero’s architecture. In the next post I’ll discuss why these changes are important for the digital humanities. This story is long, but it has a moral, and also a van.

But First, a Little History
Initially operating on physical servers purchased by the Center for History and New Media and hosted by George Mason University, the 2006 version of zotero.org offered support forums, a blog, a documentation wiki, and developer tools including project management, bug tracking, and versioning.

Oh, that beta badge is so 2006! I do miss those sweet, sweet rounded tabs, though.

Many of these tools were cobbled together in just a few hours before our public launch.1 Despite at that point being one of the most complex websites launched by CHNM, the services behind zotero.org didn’t yet really constitute a web application.

Over the next few years we added library syncing (2008), web-based library browsing (2008), and group libraries (2009). These services required the development of a cluster of data servers which in turn queried a massive and rapidly growing database containing Zotero users’ data. The zotero.org website was just one consumer of these servers, which also provided synchronization functionality to Zotero clients and served the API that we exposed to the public (2009) so that developers could create their own applications using Zotero data.

Yes, at CHNM last year’s April Fools’ joke is always this year’s generously funded project.

Every time a Zotero client synced data, a web browser accessed a Zotero library, or a third-party application utilized our API, our data servers — which lived offsite at a commercial data center, because we had outgrown the capacity and quality offered on campus — fielded the request. By late 2010, we were getting absolutely clobbered: our database usage averaged over 3500 queries per second, and wait times for large sync operations had begun to reach unacceptable levels.

We were confronted with no other choice than to modify our infrastructure in a way that would not only better handle existing usage but also allow for continued rapid growth. We would need to rework some server code. We would need to purchase more rack space from our commercial data center. We were going to need a lot more hardware. This was not going to be cheap.

Testing the Waters
One possibility that we kicked around was moving part of Zotero’s architecture to Amazon’s Relational Database Service (RDS) which had launched in 2009 and is effectively just cloud-based MySQL. Our idea was that we could keep our cluster of data servers in house, but the data they accessed would now live remotely on RDS, which would save us the trouble of having to buy and install a bunch of new machines to host and serve databases. We actually had good precedent to work with Amazon. When we sketched out how to allow users to sync their PDFs and other files to our servers in 2008, we opted to go with cloud storage because we couldn’t predict how much storage we would actually need.2 More important, we didn’t want to incur the overhead of managing the hardware associated with those storage needs. The benefits have been substantial. Our storage usage has grown dramatically, but we have never needed to worry about purchasing new hard drives, disk arrays, or rack space.

Dan Stillman conducts an early RDS sharding trial, December 2010. Credit: AFP

The new distributed architecture created by our development team made a test migration of a small chunk of our database trivial. In less than an hour, we created an RDS instance at Amazon and moved some test libraries3 to a database shard hosted there (“shard” just being a fancy way to say a partition of the overall database). These libraries synced and worked just as if they continued to operate on our own database server, but now they were just across town in Amazon’s Northern Virginia data center. Unfortunately, despite this proximity, we found in our testing that there was just too much latency sending requests back and forth. The user experience actually worsened slightly.

Damn the Torpedoes!
Despite this disappointing result, the ease of spinning up virtual database servers really excited our team, and not just because we are huge nerds. Indeed, even though our webmaster has a really big van, it was now impossible to imagine actually purchasing more machines and driving them over to the data center. So instead we decided to virtualize the rest of the application server architecture and move the API and data servers to Amazon, not just the databases behind them. Here we turned to Amazon’s EC2 service, launched in late 2007 but only in full production since 2008. EC2 basically lets you run fully-functional Linux servers in a totally virtualized environment. You can clone them, start and stop them with a single click, add persistent storage, whatever. Over the 2010 year-end holidays, our developers worked furiously to prepare for this far more ambitious move, and by this past January we were ready to begin migrating our site from our own servers to Amazon’s.

It turns out that our users have a lot of data.

Step one was simply getting a mirror image of the existing server architecture running at Amazon. This process was relatively straightforward, since we already had a scalable system in place. All we really lacked was the ability to ramp up capacity rapidly. Step two involved distributing hundreds of thousands of Zotero libraries among a vastly larger army of database shards. With individual Zotero libraries containing up to an astounding 100,000 items, this process required two full weeks of high-end EC2 going full bore. Yet because we only moved idle libraries, users experienced little or no disruption. Wait times for syncing are now basically down to zero, and the website is far more responsive than it used to be.

So that’s what has been happening behind the scenes at Zotero over the past few months. In the next post, I’ll give my thoughts on why AWS should be an essential part of digital humanities projects more generally.

  1. And it showed, we know! []
  2. As it turns out, a lot. []
  3. That would be me and Faolan. []