Blog » Debuggable - Node.js Consulting

Embracing the Cloud - Locating Resources

With the rise of affordable cloud computing (especially services like EC2 and S3) we need to learn to apply additional skills to our craft. One of those is using Hashtables to locate resources in our system.

Here is an example. Lets say your application stores image files uploaded by its users. At some point your hard disk is filling up. Now you have two choices.

Add another hard drive
Distribute the files amongst multiple servers

Choice A is called "Vertical Scaling" and usually easier. Choice B is "Horizontal Scaling". Horizontal scaling is obviously the future, since the free lunch is over.

So once you decided you want to be serious and not place your bets on vertical scaling, you face a problem. On any given server X how do I know which server holds File Y? The answer is simple, you look it up using a hash function. Since the rise of Gravatar and Git, cryptographic hash functions, especially MD5 and SHA1, have become popular general-purpose hashing functions of choice.

So if you previously stored your file urls in the database like this:

$file['url'] = 'http://example.org/files/'. $file['id'];
storeAt($file['url'], file_get_contents($file['path']));

You would rewrite it to look like this:

function url($file) {
$servers = file_get_contents('http://resources.example.org/file_servers.json').
$servers = json_decode($servers, true);
$sha1 = sha1_file($file['path']);

foreach ($servers as $key => $server) {
if (preg_match($key, $sha1)) {
return $server . '/' . $file['id'];
}
}
throw new Exception('invalid server list');
}

$file['url'] = url($file);
storeAt($file['url'], file_get_contents($file['path']));

And store a file called file_servers.json on resources.example.org:

{
"/^[a-l]/": "http://bob.files.example.org",
"/^[l-z]/": "http://john.files.example.org",
}

If you now cache the file_servers.json list on each machine and make sure resources.example.org is available, you can scale to pretty much infinity.

How does it work? Simple. We wrote a hash function called url() which will assign a unqiue location to each newly uploaded file. This works by computing the sha1 fingerprint for the file. Then we pick a server based on the leading characters of the sha1, using the ranges defined in file_servers.json.

If you need to remove full servers or add new ones, you simply modify file_servers.json file.

The probably largest and most sophisticated implementation of the concept above seems to be Amazon S3 right now.

The jury is still out if they will manage to win the upcoming cloud war, but there are exciting times ahead of us. The recession is putting many companies, big and small, in need of saving costs through innovation. So what renewable energies and fuel efficient cars are for other industries, is what cloud computing is to us.

Think that cloud computing is not going to help with your current clients and applications? Invest an hour to put your clients static files (and uploads) in Amazon S3 and downgrade his shared hosting account. He now needs less storage, bandwidth and CPU power and gets redundant backup as a bonus.

-- Felix Geisendörfer aka the_undefined

How To Save Half A Second On Every CakePHP Request

Posted on 26/2/09 by Tim Koschützki

Hey folks,

as an application comes closer and closer to its launch date, not having cared about performance during development becomes more and more of a problem.
There are several ways to improve the performance of your CakePHP application. The first thing one would think about is generally caching of your views and database queries. However, during development stage implementing caching can cause a lot of confusion and phantom bugs. In short, it might waste your time which is better put into features. Any performance improvement that does not effect how data is retrieved, stored and cached is welcome. If it affects your entire site and not only parts of it, it's all the better.

With the help of Mark Story's excellent CakePHP DebugKit I got the idea of disabling Cake's reverse route lookup to gain performance. Almost half a second per request for link-heavy sites.

The Hack

As you use the HtmlHelper to create links with the helper's link() method, you throw an url in the form of an array at Router::url() everytime. With the complex routes parsing done for every $html->link() call, this becomes a big issue for link-heavy sites. There will be the most overhead if you have a lot of routes installed. So I thought if it's easily possible to override the behaviour for standard urls that don't need routes parsing. A classical example would be:

$html->link('Check this Article', array('controller' => 'articles', 'action' => 'view', $article['Article']['id']));

Since this link is dynamic (depending on the article id) it does not need routes parsing (most of the time, more on that later). There are others, like 4 parameters, 2 parameters, pagination links that have named params in the url .. or only one parameter if only the 'controller' is present and you want to access the index action, etc. I have tinkered a little and wrote some code that we are using in most of our projects now and it has worked out quite well. In fact it is saving almost half a second for every request:

<?php
class AppHelper extends Helper {
function url($url = null, $full = false) {
$Router =& Router::getInstance();
if (!empty($Router->__params)) {
if (isset($Router) && !isset($Router->params['requested'])) {
$params = $Router->__params[0];
} else {
$params = end($Router->__params);
}
}

if (isset($params['admin']) && $params['admin'] && !isset($url['admin'])) {
$url['admin'] = $params['admin'];
}

if (is_array($url) && isset($url['controller']) && !isset($url['page'])) {
if (!isset($url['action'])) {
$url['action'] = 'index';
}

$admin = '';
if (isset($url['admin']) && $url['admin']) {
$admin = Configure::read('Routing.admin') . '/';
if (strpos($url['action'], $admin . '_') === 0) {
$url['action'] = substr($url['action'], strlen($admin));
}
}
unset($url['admin']);
unset($url['plugin']);

$count = count($url);
if (4 == $count) {
return '/' . $admin . $url['controller'] . '/' . $url['action'] . '/' . $url[0] . '/' . $url[1];
}

if (3 == $count) {
if (isset($url['id'])) {
$url[0] = $url['id'];
}
return '/' . $admin . $url['controller'] . '/' . $url['action'] . '/' . $url[0];
}

if (2 == $count) {
return '/' . $admin . $url['controller'] . '/' . $url['action'];
}

if (1 == $count) {
return '/' . $admin . $url['controller'] . '/index';
}
}

return parent::url($url, $full);
}
}
?>

How does it work?

As you can see, it's simply overriding the default behaviour for Helper::url() in your AppHelper. It goes through the classical cases and builds out the url via lightweight string concatenation. Not a beauty, but fast.

What it really does is breaking the reverse routing feature. That means if you specify such a route:

Router::connect('/login', array('controller' => 'auth', 'action' => 'login'));

Then create a link with $html->link('Login Now!', array('controller' => 'auth', 'action' => 'login')), the link url will not automagially transform into '/login' anymore. This is pretty bad, however if your site hardcodes the urls used in routes, this is not a problem. This means that instead of providing array('controller' => 'auth', 'action' => 'login'), you just do '/login', which I always do for specific pages since there aren't so many of them. So the hack helps for non-dynamic cases.

Usage

To use the code, acknowledge it's more like a site-specific optimization. By the way, for those of you who think I should just go ahead and contribute a patch to CakePHP that will make it possible to disable reverse route lookup, I talked to Nate already. He said he is in the process of rewriting the Routing for Cake 1.3 and it will be much faster. In the meantime, this might help you.

Copy and paste the url() function to your AppHelper class in app_helper.php. Try it out and see if it works for you. Obviously, it doesn't affect all cases and is pretty specific to how we write urls. However, it falls back to the normal url parsing if no appropriate case is found. Any suggestions, bug reports and contributions are welcome.

Disclaimer:

In no way do I claim that this code will work as well for you as it does for us. It's specific to how urls are written using $html->link() and you might screw over your app by using this. Take extra caution when using it - especially with reverse route lookup - and make sure to test your app thoroughly.

-- Tim Koschuetzki aka DarkAngelBGE

Are we done yet?

Posted on 11/2/09 by Felix Geisendörfer

If you have worked with progress indicators in the past you might have had the same thought that always makes you wonder.

Should I write:

if ($done == $total) {
// We are done, move on
}

or should I write:

if ($done >= $total) {
// We are done, move on
}

that is the question!

Today I went for the '==' route because the $done variable should never be bigger than the $total one.

I think this comes right down to your development philosophy: Should weird conditions in your application that seem "harmless" be silently ignored or should they cause a crash?

I guess I'm the fail often fail early type.

What type are you?

-- Felix Geisendörfer aka the_undefined

Disable strict host checking for git clone

Posted on 4/2/09 by Felix Geisendörfer

Hey folks,

while playing with automated machine configuration in EC2 for a few minutes this morning, I stumbled across a little hurdle. One of the items in my init script was the cloning of a git repository from GitHub.

This normally isn't a very difficult task to automate. However, it can become so if you see the following message:

$ git clone git@github.com:debuggable/secret-project.debuggable.com.git
Initialized empty Git repository in /var/git/secret-project.debuggable.com/.git/
The authenticity of host 'github.com (65.74.177.129)' can't be established.
RSA key fingerprint is 16:27:ac:a5:76:28:2d:36:63:1b:56:4d:eb:df:a6:48.
Are you sure you want to continue connecting (yes/no)?

Interactive questions like this can be really annoying when it comes to automation. Luckily there is an easy fix available.

$ echo -e "Host github.com\n\tStrictHostKeyChecking no\n" >> ~/.ssh/config

This will add a configuration line to your ~/.ssh/config script that will silently ignore the authenticity of github.com.

-- Felix Geisendörfer aka the_undefined

PS: If the topic of passing ssh options to your git commands is interesting to you, make sure to also check out this git wiki page.

Restarting a command line PHP script

Posted on 3/2/09 by Felix Geisendörfer

Hey folks,

ever wrote a PHP script that acts as a daemon? No? Read Kevin's excellent post on the topic as well as this one explaining the low-level plumbing.

If you already have, you might have run into the scenario where you want the script to restart itself. For example you might have changed the code of the script after it started running, and you want to replace the current process with one running the new code.

After a bit of trial and error, I present my glorious hack:

die(exec(join(' ', $GLOBALS['argv']) . ' > /dev/null &'));

Feel free to share your own war stories and suggestions regarding PHP scripts acting as daemons : ).

-- Felix Geisendörfer aka the_undefined

debuggable

Embracing the Cloud - Locating Resources

How To Save Half A Second On Every CakePHP Request

The Hack

How does it work?

Usage

Disclaimer:

Are we done yet?

Disable strict host checking for git clone

Restarting a command line PHP script

RSS Feeds

Recent Posts

Archive

Recent Comments

Keep an eye on