More permanent stuff at

17 October 2005

Java Character Encoding

The company I work for produced a cross-platform (Windows and Mac OS X) application. And we rely on Java to store text generated on each platform into a database that may be running on an entirely different platform.

That's three platforms, so it helps to have a sound character encoding strategy--you would think. The problem is that Java makes this rather hard.

Consider three characters: left curly quote, right curly quote and the em-dash. They have an encoding on Windows (CP1252) and a different encoding on OS X (MacRoman), and no real encoding for ISO-8859-1. That really isn't too big of a deal, as long as you can get at the bytes. After all, that's all they are to the database--just bytes.

I said: as long as you can get at the bytes.

Here is the deal: if you're dealing with an ISO-8859-1 string, and you manage to slip some out of range characters in there (java.lang.String accepts this), the characters stay proper. But the second you try to get at the bytes, they are converted to question marks. Yes, question marks--that's 0x3f to me.

Why not accept the fact there is no encoding and just give the original bytes back?

This gets even trickier. Consider the fact that you create your strings by supplying an array of properly encoded bytes and a charset name. Once you start using the built-in string concatenation methods (or string.substring in my experiments), you are once again working with strings encoded in the default way, which might, or might not be the character set you specified in the original string constructor.

But back to getting the bytes. Assume that getBytes() is giving you busted data. There is a way to convert the string to properly encoded bytes. You have to 1) create a CharsetEncoder (from a Charset that you must create), 2) encode it into a ByteBuffer and 3) grab the bytes from a ByteBuffer.

I don't know what you think, but that is kind of drawn out for something that is conceptually fairly simple.

Go Java!