arsd.characterencodings

This is meant to help get data from the wild into utf8 strings so you can work with them easily inside D.

The main function is convertToUtf8(), which takes a byte array of your raw data (a byte array because it isn't really a D string yet until it is utf8), and a runtime string telling it's current encoding.

The current encoding argument is meant to come from the data's metadata, and is flexible on exact format - it is case insensitive and takes several variations on the names.

This way, you should be able to send it the encoding string directly from an XML document, a HTTP header, or whatever you have, and it ought to just work.

Members

Functions

convertToUtf8
string convertToUtf8(immutable(ubyte)[] data, string dataCharacterEncoding)

Takes data from a given character encoding and returns it as UTF-8

convertToUtf8Lossy
string convertToUtf8Lossy(immutable(ubyte)[] data, string dataCharacterEncoding)

Like convertToUtf8, but if the encoding is unknown, it just strips all chars > 127 and calls it done instead of throwing

tryToDetermineEncoding
string tryToDetermineEncoding(in ubyte[] rawdata)

Tries to determine the current encoding based on the content. Only really helps with the UTF variants. Returns null if it can't be reasonably sure.

Variables

Windows_1252
dchar[] Windows_1252

I'm sure this could be a lot more efficient, but whatever, it works.

Examples

auto data = cast(immutable(ubyte)[]) std.file.read("my-windows-file.txt"); string utf8String = convertToUtf8(data, "windows-1252"); // utf8String can now be used

The encodings currently implemented for decoding are: UTF-8 (a no-op; it simply casts the array to string) UTF-16, UTF-32, Windows-1252, ISO 8859 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, and 16.

It treats ISO 8859-1, Latin-1, and Windows-1252 the same way, since those labels are pretty much de-facto the same thing in wild documents.

This module currently makes no attempt to look at control characters.

Suggestion Box / Bug Report