Embracing the Cloud - Locating Resources
Posted on 3/3/09 by Felix Geisendörfer
With the rise of affordable cloud computing (especially services like EC2 and S3) we need to learn to apply additional skills to our craft. One of those is using Hashtables to locate resources in our system.
Here is an example. Lets say your application stores image files uploaded by its users. At some point your hard disk is filling up. Now you have two choices.
- Add another hard drive
- Distribute the files amongst multiple servers
Choice A is called "Vertical Scaling" and usually easier. Choice B is "Horizontal Scaling". Horizontal scaling is obviously the future, since the free lunch is over.
So once you decided you want to be serious and not place your bets on vertical scaling, you face a problem. On any given server X how do I know which server holds File Y? The answer is simple, you look it up using a hash function. Since the rise of Gravatar and Git, cryptographic hash functions, especially MD5 and SHA1, have become popular general-purpose hashing functions of choice.
So if you previously stored your file urls in the database like this:
storeAt($file['url'], file_get_contents($file['path']));
You would rewrite it to look like this:
$servers = file_get_contents('http://resources.example.org/file_servers.json').
$servers = json_decode($servers, true);
$sha1 = sha1_file($file['path']);
foreach ($servers as $key => $server) {
if (preg_match($key, $sha1)) {
return $server . '/' . $file['id'];
}
}
throw new Exception('invalid server list');
}
$file['url'] = url($file);
storeAt($file['url'], file_get_contents($file['path']));
And store a file called file_servers.json on resources.example.org:
"/^[a-l]/": "http://bob.files.example.org",
"/^[l-z]/": "http://john.files.example.org",
}
If you now cache the file_servers.json list on each machine and make sure resources.example.org is available, you can scale to pretty much infinity.
How does it work? Simple. We wrote a hash function called url() which will assign a unqiue location to each newly uploaded file. This works by computing the sha1 fingerprint for the file. Then we pick a server based on the leading characters of the sha1, using the ranges defined in file_servers.json.
If you need to remove full servers or add new ones, you simply modify file_servers.json file.
The probably largest and most sophisticated implementation of the concept above seems to be Amazon S3 right now.
The jury is still out if they will manage to win the upcoming cloud war, but there are exciting times ahead of us. The recession is putting many companies, big and small, in need of saving costs through innovation. So what renewable energies and fuel efficient cars are for other industries, is what cloud computing is to us.
Think that cloud computing is not going to help with your current clients and applications? Invest an hour to put your clients static files (and uploads) in Amazon S3 and downgrade his shared hosting account. He now needs less storage, bandwidth and CPU power and gets redundant backup as a bonus.
-- Felix Geisendörfer aka the_undefined