Streaming UTF-8 (with node.js)
Posted on 18/5/10 by Felix Geisendörfer
UTF-8 is a variable-length character encoding for Unicode. This text is encoded in UTF-8, so the characters you are reading can consist of 1 to 4 bytes. As long as the characters fit within the ASCII range (0-127), there is exactly 1 byte used per character.
But if I want to express a character outside the ASCII range, such as '¢', I need more bytes. The character '¢' for example consists of: 0xC2 and 0xA2. The first byte, 0xC2, indicates that '¢' is a 2-byte character. This is easy to understand if you look at the binary representation of 0xC2:
11000010
As you can see, the bit sequence begins with '110', which as per the UTF-8 specification means: "2 byte character ahead!". Another character such as '€' (0xE2, 0x82, 0xAC) would work the same way. The first byte, 0xE2, looks like this in binary:
11100010
The prefix '1110' specifies that there are 3 bytes forming the current character. More exotic characters may even start with '11110', which indicates a 4 byte character.
As you can guess, UTF-8 text is not trivial to stream. Networks and file systems are not UTF-8 aware, so they will often split a chunk of text in the middle of a character.
To make sure you don't process a partial character, you have to analyze the last 3 bytes of any given chunk in your stream to check for the bit-prefixes that are used to announce a multibyte character. If you detect an incomplete character, you need to buffer the bytes you have for it, and then prepend them to the next chunk that comes in.
This way you can completely avoid breaking apart multibyte characters within a UTF-8 text, while still getting great performance and memory usage (only the last 3 bytes need checking / buffering).
As of yesterday, node.js's net / http modules are now fully UTF-8 safe, thanks to the streaming Utf8Decoder (undocumented, API may change) you can see below:
var Utf8Decoder = exports.Utf8Decoder = function() {
this.charBuffer = new Buffer(4);
this.charReceived = 0;
this.charLength = 0;
};
Utf8Decoder.prototype.write = function(buffer) {
var charStr = '';
// if our last write ended with an incomplete multibyte character
if (this.charLength) {
// determine how many remaining bytes this buffer has to offer for this char
var i = (buffer.length >= this.charLength - this.charReceived)
? this.charLength - this.charReceived
: buffer.length;
// add the new bytes to the char buffer
buffer.copy(this.charBuffer, this.charReceived, 0, i);
this.charReceived += i;
if (this.charReceived < this.charLength) {
// still not enough chars in this buffer? wait for more ...
return;
}
// get the character that was split
charStr = this.charBuffer.slice(0, this.charLength).toString();
this.charReceived = this.charLength = 0;
if (i == buffer.length) {
// if there are no more bytes in this buffer, just emit our char
this.onString(charStr)
return;
}
// otherwise cut of the characters end from the beginning of this buffer
buffer = buffer.slice(i, buffer.length);
}
// determine how many bytes we have to check at the end of this buffer
var i = (buffer.length >= 3)
? 3
: buffer.length;
// figure out if one of the last i bytes of our buffer announces an incomplete char
for (; i > 0; i--) {
c = buffer[buffer.length - i];
// See http://en.wikipedia.org/wiki/UTF-8#Description
// 110XXXXX
if (i == 1 && c >> 5 == 0x06) {
this.charLength = 2;
break;
}
// 1110XXXX
if (i <= 2 && c >> 4 == 0x0E) {
this.charLength = 3;
break;
}
// 11110XXX
if (i <= 3 && c >> 3 == 0x1E) {
this.charLength = 4;
break;
}
}
if (!this.charLength) {
// no incomplete char at the end of this buffer, emit the whole thing
this.onString(charStr+buffer.toString());
return;
}
// buffer the incomplete character bytes we got
buffer.copy(this.charBuffer, 0, buffer.length - i, buffer.length);
this.charReceived = i;
if (buffer.length - i > 0) {
// buffer had more bytes before the incomplete char, emit them
this.onString(charStr+buffer.slice(0, buffer.length - i).toString());
} else if (charStr) {
// or just emit the charStr if any
this.onString(charStr);
}
};
I feel like this implementation could still be somewhat simplified, so if you have any suggestions or comments, please let me know!
--fg
PS: Another buffer-based project I'm working on is a fast multipart parser - stay tuned for another post!
You can skip to the end and add a comment.
Marco Rogers: Currently it gets applied to any net or http stream if you use setEncoding('utf-8'). I still have to add it for the fs module.
That's excellent ! It's a must have for NodeJs for dealing with localized applications.
Thanks for this great addition Felix !
Matt
Felix,
Great work, but you need to address a few issues:
-For what uses cases is your library intended? Keep in mind that XML and JSON parsing libraries already do their own UTF-8 verification so duplicating their work does not enhance performance.
-UTF-8 safeness also requires that no invalid characters such as 0xFF show up in the content. Therefore, safeness requires checking each byte.
-Consequently, I would not call your implementation UTF-8 "safe". It is not sufficient to just check the last 3 bytes. In this case, you are leaving it up to the user to check for invalid UTF-8 characters anyway.
All of this is unnecessary, as JavaScript already has functions which can facillitate encoding and decoding of UTF-8. Encoding UTF-8 to a bytestring is unescape(encodeURIComponent(string)) and decoding UTF-8 is decodeURIComponent(escape(string)).
One suggestion that might or might not be any use: you might be able to match the UTF8 byte ranges at the end of the string with a single regexp.
David Bender: Eli Grey: This class is merely splitting a bytes stream into portions that are guaranteed to not be in the middle of a multibyte character. The actual conversion to (and verification of) UTF-8 happens in Buffer.toString().
@Felix Thanks for the clarification. Consequently, your code does not just read the last 3 bytes though, as toString() is an O(n) operation.
I'm still not sure when I would actually this though. Verifying input tends to be done when parsing structured text, and libraries that do that handle partial multibyte characters. As an example, I would never put this between my socket and my JSON parser. For what situation do you use this?
David Bender: If you wanted to write a streaming JSON parser that operates on a character level (rather than byte level), this would be very useful to put in between your parser and your socket.
Felix: decodeURIComponent(escape(string)) not only decodes the UTF-8, but also verifies it. Why exactly would I want to use your code instead of this simple trick? Is it substantially faster?
Eli Grey: Because decodeURIComponent() can only handle a fixed-length string, not a stream of data. This code is really only useful for node.js, not client side stuff (in case that's what you're referring to).
Ah, thanks. I thought this was just for fixed-length strings too.
This post is too old. We do not allow comments here anymore in order to fight spam. If you have real feedback or questions for the post, please contact us.
Nice. This is definitely a problem I was aware of in node, but wasn't looking forward to handling myself. So does this decoder get applied to any stream if the encoding gets set as UTF8? Is that the only time it gets applied?