Streaming UTF-8 (with node.js)

UTF-8 is a variable-length character encoding for Unicode. This text is encoded in UTF-8, so the characters you are reading can consist of 1 to 4 bytes. As long as the characters fit within the ASCII range (0-127), there is exactly 1 byte used per character.

But if I want to express a character outside the ASCII range, such as '¢', I need more bytes. The character '¢' for example consists of: 0xC2 and 0xA2. The first byte, 0xC2, indicates that '¢' is a 2-byte character. This is easy to understand if you look at the binary representation of 0xC2:

11000010

As you can see, the bit sequence begins with '110', which as per the UTF-8 specification means: "2 byte character ahead!". Another character such as '€' (0xE2, 0x82, 0xAC) would work the same way. The first byte, 0xE2, looks like this in binary:

11100010

The prefix '1110' specifies that there are 3 bytes forming the current character. More exotic characters may even start with '11110', which indicates a 4 byte character.

As you can guess, UTF-8 text is not trivial to stream. Networks and file systems are not UTF-8 aware, so they will often split a chunk of text in the middle of a character.

To make sure you don't process a partial character, you have to analyze the last 3 bytes of any given chunk in your stream to check for the bit-prefixes that are used to announce a multibyte character. If you detect an incomplete character, you need to buffer the bytes you have for it, and then prepend them to the next chunk that comes in.

This way you can completely avoid breaking apart multibyte characters within a UTF-8 text, while still getting great performance and memory usage (only the last 3 bytes need checking / buffering).

As of yesterday, node.js's net / http modules are now fully UTF-8 safe, thanks to the streaming Utf8Decoder (undocumented, API may change) you can see below:

I feel like this implementation could still be somewhat simplified, so if you have any suggestions or comments, please let me know!

--fg

PS: Another buffer-based project I'm working on is a fast multipart parser - stay tuned for another post!

&nsbp;

You can skip to the end and add a comment.

Marco Rogers said on May 18, 2010:

Nice. This is definitely a problem I was aware of in node, but wasn't looking forward to handling myself. So does this decoder get applied to any stream if the encoding gets set as UTF8? Is that the only time it gets applied?

Felix Geisendörfer said on May 18, 2010:

Marco Rogers: Currently it gets applied to any net or http stream if you use setEncoding('utf-8'). I still have to add it for the fs module.

Matt said on May 18, 2010:

That's excellent ! It's a must have for NodeJs for dealing with localized applications.

Thanks for this great addition Felix !

Matt

David Bender said on May 18, 2010:

Felix,
Great work, but you need to address a few issues:

-For what uses cases is your library intended? Keep in mind that XML and JSON parsing libraries already do their own UTF-8 verification so duplicating their work does not enhance performance.

-UTF-8 safeness also requires that no invalid characters such as 0xFF show up in the content. Therefore, safeness requires checking each byte.

-Consequently, I would not call your implementation UTF-8 "safe". It is not sufficient to just check the last 3 bytes. In this case, you are leaving it up to the user to check for invalid UTF-8 characters anyway.

Eli Grey said on May 18, 2010:

All of this is unnecessary, as JavaScript already has functions which can facillitate encoding and decoding of UTF-8. Encoding UTF-8 to a bytestring is unescape(encodeURIComponent(string)) and decoding UTF-8 is decodeURIComponent(escape(string)).

Matthew said on May 18, 2010:

One suggestion that might or might not be any use: you might be able to match the UTF8 byte ranges at the end of the string with a single regexp.

Felix Geisendörfer said on May 18, 2010:

David Bender: Eli Grey: This class is merely splitting a bytes stream into portions that are guaranteed to not be in the middle of a multibyte character. The actual conversion to (and verification of) UTF-8 happens in Buffer.toString().

David Bender said on May 18, 2010:

@Felix Thanks for the clarification. Consequently, your code does not just read the last 3 bytes though, as toString() is an O(n) operation.

I'm still not sure when I would actually this though. Verifying input tends to be done when parsing structured text, and libraries that do that handle partial multibyte characters. As an example, I would never put this between my socket and my JSON parser. For what situation do you use this?

Felix Geisendörfer said on May 18, 2010:

David Bender: If you wanted to write a streaming JSON parser that operates on a character level (rather than byte level), this would be very useful to put in between your parser and your socket.

Eli Grey said on Aug 20, 2010:

Felix: decodeURIComponent(escape(string)) not only decodes the UTF-8, but also verifies it. Why exactly would I want to use your code instead of this simple trick? Is it substantially faster?

Felix Geisendörfer said on Aug 20, 2010:

Eli Grey: Because decodeURIComponent() can only handle a fixed-length string, not a stream of data. This code is really only useful for node.js, not client side stuff (in case that's what you're referring to).

Eli Grey said on Aug 20, 2010:

Ah, thanks. I thought this was just for fixed-length strings too.

This post is too old. We do not allow comments here anymore in order to fight spam. If you have real feedback or questions for the post, please contact us.

debuggable