Streaming UTF-8 (with node.js)
Posted on 18/5/10 by Felix Geisendörfer
UTF-8 is a variable-length character encoding for Unicode. This text is encoded in UTF-8, so the characters you are reading can consist of 1 to 4 bytes. As long as the characters fit within the ASCII range (0-127), there is exactly 1 byte used per character.
But if I want to express a character outside the ASCII range, such as '¢', I need more bytes. The character '¢' for example consists of: 0xC2 and 0xA2. The first byte, 0xC2, indicates that '¢' is a 2-byte character. This is easy to understand if you look at the binary representation of 0xC2:
11000010
As you can see, the bit sequence begins with '110', which as per the UTF-8 specification means: "2 byte character ahead!". Another character such as '€' (0xE2, 0x82, 0xAC) would work the same way. The first byte, 0xE2, looks like this in binary:
11100010
The prefix '1110' specifies that there are 3 bytes forming the current character. More exotic characters may even start with '11110', which indicates a 4 byte character.
As you can guess, UTF-8 text is not trivial to stream. Networks and file systems are not UTF-8 aware, so they will often split a chunk of text in the middle of a character.
To make sure you don't process a partial character, you have to analyze the last 3 bytes of any given chunk in your stream to check for the bit-prefixes that are used to announce a multibyte character. If you detect an incomplete character, you need to buffer the bytes you have for it, and then prepend them to the next chunk that comes in.
This way you can completely avoid breaking apart multibyte characters within a UTF-8 text, while still getting great performance and memory usage (only the last 3 bytes need checking / buffering).
As of yesterday, node.js's net / http modules are now fully UTF-8 safe, thanks to the streaming Utf8Decoder (undocumented, API may change) you can see below:
var Utf8Decoder = exports.Utf8Decoder = function() {
this.charBuffer = new Buffer(4);
this.charReceived = 0;
this.charLength = 0;
};
Utf8Decoder.prototype.write = function(buffer) {
var charStr = '';
// if our last write ended with an incomplete multibyte character
if (this.charLength) {
// determine how many remaining bytes this buffer has to offer for this char
var i = (buffer.length >= this.charLength - this.charReceived)
? this.charLength - this.charReceived
: buffer.length;
// add the new bytes to the char buffer
buffer.copy(this.charBuffer, this.charReceived, 0, i);
this.charReceived += i;
if (this.charReceived < this.charLength) {
// still not enough chars in this buffer? wait for more ...
return;
}
// get the character that was split
charStr = this.charBuffer.slice(0, this.charLength).toString();
this.charReceived = this.charLength = 0;
if (i == buffer.length) {
// if there are no more bytes in this buffer, just emit our char
this.onString(charStr)
return;
}
// otherwise cut of the characters end from the beginning of this buffer
buffer = buffer.slice(i, buffer.length);
}
// determine how many bytes we have to check at the end of this buffer
var i = (buffer.length >= 3)
? 3
: buffer.length;
// figure out if one of the last i bytes of our buffer announces an incomplete char
for (; i > 0; i--) {
c = buffer[buffer.length - i];
// See http://en.wikipedia.org/wiki/UTF-8#Description
// 110XXXXX
if (i == 1 && c >> 5 == 0x06) {
this.charLength = 2;
break;
}
// 1110XXXX
if (i <= 2 && c >> 4 == 0x0E) {
this.charLength = 3;
break;
}
// 11110XXX
if (i <= 3 && c >> 3 == 0x1E) {
this.charLength = 4;
break;
}
}
if (!this.charLength) {
// no incomplete char at the end of this buffer, emit the whole thing
this.onString(charStr+buffer.toString());
return;
}
// buffer the incomplete character bytes we got
buffer.copy(this.charBuffer, 0, buffer.length - i, buffer.length);
this.charReceived = i;
if (buffer.length - i > 0) {
// buffer had more bytes before the incomplete char, emit them
this.onString(charStr+buffer.slice(0, buffer.length - i).toString());
} else if (charStr) {
// or just emit the charStr if any
this.onString(charStr);
}
};
I feel like this implementation could still be somewhat simplified, so if you have any suggestions or comments, please let me know!
--fg
PS: Another buffer-based project I'm working on is a fast multipart parser - stay tuned for another post!