Thursday, February 21, 2008

Virtual Computing with Amazon Web Services

Amazon offers to store your data, run your servers and coordinate all the pieces to make an infinitely scalable web-based service, which sounds so unlikely to work that I've been happy to ignore it. But Amazon's Mike Culver passed through Wellington last night on his worldwide evangelical tour, and I have seen the light.

They have some experience at running data centres and serving up data by now, not to mention fairly hefty economies of scale, so it's not unreasonable to consider Amazon as an alternative or complimentary IT provider. They offer a suite of services, of course: somewhere to store data, as many virtual servers as you require, and additional things like queues and databases to tie it all together. It's all protected by industrial strength cryptography so that only you control your servers and you can decide which parts of your data the public can access.

Mike Culver demonstrated the virtual servers, or EC2 (Elastic Compute Cloud) service. These aren't just virtual web servers like your webhost might provide, these are true virtual servers that you build and configure from the ground up, to do whatever you like, just as though they were sitting under your desk. You could configure two of them to play chess against each other all day if you wished. Or something ultimately more useful to you, since they will be costing you 10 cents (US) per hour each. Mike described how the New York Times fired up a hundred of them in order to convert all their back issues to PDFs in 24 hours, for $240. You can have as much compute power as you want, and pay for just the period you want it.

The way all this works beneath the hood is pure genius. To fire up an EC2 server requires you to specify a disk image ("Amazon Instance") for that server, which you can select from a list and/or customise yourself. The instance includes the operating system, additional software and specific configuration to your needs. You build it so that when you start it up it knows where to look for its work, and what to do, since they have no persistent disk of their own. Mike's example was of a website that offered a video conversion service using Amazon's S3 (Simple Storage Service) and SQS (Simple Queue Service) as well as EC2. The storage service stores the videos uploaded by the users, and the queue service records the requests for conversion. The EC2 server instance has been built so that when it starts up it looks in the queue for videos to convert, converts them and saves them back to S3. Notice how this scales: you can have one EC2 server that just monitors the length of the queue, and spawns or kills off the other EC2 servers that do the video conversion, according to the length of the queue. With this architecture your 10-users-a-day service could serve a million users tonight and you won't be paged.

Is there potential for using this service as a supplementary means of delivering a library's digital material? We focus chiefly on preservation and (let's be honest) small-scale boutique web access; if one of our websites suddenly became really popular it would just as suddenly collapse under the demand. Do we build a bigger server room and add more hardware, more bandwidth, just in case? Faced with the possibility of long periods of under-provision or over-provision - both of which hurt the organisation - an infinitely and instantly scalable model like AWS looks very attractive. Rather than offering an alternative to existing inhouse provision, it could be a way of providing enhanced access to certain digital material without putting a strain on the organisation's technical infrastructure. After all, Amazon has the kind of bandwidth, uptime and resiliency that we can only dream of.

Amazon's S3, the storage service, provides "buckets" (think collections) within which to store digital objects and their metadata (perfect!). Add in some EC2 servers with your preferred technology stack and a little custom development and you're away.

There's also the possibility of using the Mechanical Turk for small tasks that require a human mind (I'm thinking of things like adding descriptive metadata or transcriptions) but that's another post. In the meantime, read more about all this at wikipedia or the Amazon Web Services website.

2 comments:

lewisb said...

Gen-i's supercomputing centre is aiming to do the same kind of thing within NZ, avoiding the international bandwidth costs of services like S3. They presented on their plans yesterday at the Film Archive's Digital Forum here in Wellington. Really interesting potential for content management and one-off processor intensive transcoding jobs.

sunitha said...

Interesting news for the people who dont have space to store the data and to run servers and coordinate all the pieces to make an infinitely scalable web-based service,Amazon's S3, the storage service, provides "buckets" (think collections) within which to store digital objects and their metadata (perfect!).
Web Services Company