Unicode in DLXS

This topic touches on XPAT indexing, data preparation and middleware configuration and behavior as they relate to Unicode in DLXS.

Documentation on this topic can also be found at:

http://docs.dlxs.org/class/unicode.html (Release 11a)
http://www.dlxs.org/docs/12/class/unicode.html (Release 12)

Unicode in General

There is a lot of fuzziness in talk about characters. "Character set" considered harmful.

Definitions

DLXS multi-lingual character support before Unicode

The ASCII encoding only supports 127 characters. The ISO-8859-* encodings support 256 characters but only one set of 256 characters at a time. Latin2 covers German and Polish. Latin5 covers German and Turkish. There is no single-byte encoding covering German and Russian.

Reasons to use Unicode

Back to top

DLXS data preparation and Unicode

Tools are getting better, more plentiful every day.

Terminal emulators

Tools

The goal is to get your data into UTF-8 encoded XML. You need to know how characters in your data have been encoded in order to transform to another encoding.

Back to top

Unicode XPAT indexing

More when we talk about data preparation more fully. For now this is just to highlight some differences in programs and processes.

DLPS Unicode examples

Back to top

Middleware configuration, requirements and behavior for Unicode

XPAT version 5.3.2

Perl 5.8.3 or higher is required.

Configuration and behavior

The middleware will transcode all user input that is not valid UTF-8 from Latin1 to UTF-8 under the assumption that the input was Latin1. This implies that non-ASCII searches will fail in Latin1 collections. Unaccented searches will still work because of XPAT mapping in the data dictionary.

The collection manager (collmgr) locale field should be set to en_US.UTF-8. Any value not including "UTF-8" means the middleware will assume Latin1 encoding and will:

All XML templates have <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> elements to ensure user input is UTF-8 and to tell the browser to use UTF-8 encoding for the page.

The middleware supports collections with different character encodings in cross-collection mode. This fact is due to the transcoding Latin1 -> UTF-8 on input and Latin1 -> UTF-8 on output.

Back to top