arsd.characterencodings

This is meant to help get data from the wild into utf8 strings so you can work with them easily inside D.

The main function is convertToUtf8(), which takes a byte array of your raw data (a byte array because it isn't really a D string yet until it is utf8), and a runtime string telling it's current encoding.

The current encoding argument is meant to come from the data's metadata, and is flexible on exact format - it is case insensitive and takes several variations on the names.

This way, you should be able to send it the encoding string directly from an XML document, a HTTP header, or whatever you have, and it ought to just work.

Members

Functions

convertToUtf8
string convertToUtf8(immutable(ubyte)[] data, string dataCharacterEncoding)

Takes data from a given character encoding and returns it as UTF-8

convertToUtf8Lossy
string convertToUtf8Lossy(immutable(ubyte)[] data, string dataCharacterEncoding)

Like convertToUtf8, but if the encoding is unknown, it just strips all chars > 127 and calls it done instead of throwing

tryToDetermineEncoding
string tryToDetermineEncoding(ubyte[] rawdata)

Tries to determine the current encoding based on the content. Only really helps with the UTF variants. Returns null if it can't be reasonably sure.

Examples

auto data = cast(immutable(ubyte)[])
	std.file.read("my-windows-file.txt");
string utf8String = convertToUtf8(data, "windows-1252");
// utf8String can now be used

The encodings currently implemented for decoding are:

  • UTF-8 (a no-op; it simply casts the array to string)
  • UTF-16,
  • UTF-32,
  • Windows-1252,
  • ISO 8859 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, and 16.
  • KOI8-R

It treats ISO 8859-1, Latin-1, and Windows-1252 the same way, since those labels are pretty much de-facto the same thing in wild documents (people mislabel them a lot and I found it more useful to just deal with it than to be pedantic).

This module currently makes no attempt to look at control characters.

Suggestion Box / Bug Report