Unicode in DLXS

This topic touches on XPAT indexing, data preparation and middleware configuration as they relate to Unicode in DLXS.

Documentation on this topic can also be found at: http://docs.dlxs.org/class/dlxs-unicode.xml


Unicode in General

There is a lot of fuzziness in talk about characters. "Character set" considered harmful.

Definitions

DLXS multi-lingual character support before Unicode

The ASCII encoding only supports 127 characters. The ISO-8859-* encodings support 256 characters but only one set of 256 characters at a time. Latin2 covers German and Polish. Latin5 covers German and Turkish. There is no single-byte encoding covering German and Russian.

Reasons to use Unicode

Back to top

DLXS data preparation and Unicode

Tools are getting better, more plentiful every day.

Terminal emulators

Tools

The goal is to get your data into UTF-8 encoded XML. You need to know how characters in your data have been encoded in order to transform to another encoding.

Back to top

Unicode XPAT indexing

More when we talk about data preparation more fully. For now this is just to highlight some differences in programs and processes.

DLPS production example

Back to top

Middleware configuration for Unicode

The pre-Unicode XPAT version is 5.2.3. The first Unicode aware XPAT version is 5.3.0.

Configuration and requirements.

Back to top