Unicode in DLXS

This topic touches on XPAT indexing, data preparation and middleware configuration and behavior as they relate to Unicode in DLXS.

Documentation on this topic can also be found at:

http://www.dlxs.org/docs/13/class/unicode.html

Unicode and Character Sets

Your data may be pure ASCII encoding which supports 127 characters. The ISO-8859-* encodings support 256 characters but only one set of 256 characters at a time. Latin2 covers German and Polish. Latin5 covers German and Turkish. There is no single-byte encoding covering German and Russian, for example.

Methods to represent characters from other alphabets:

Unicode Definitions

Reasons to use Unicode

Back to top

DLXS data preparation and Unicode

Tools are getting better, more plentiful.

Viewers, Terminal emulators

Tools

The goal is to get your data into UTF-8 encoded XML. You need to know how characters in your data have been encoded in order to transform to another encoding.

Back to top

Unicode XPAT indexing

This applies to XPAT-based classesTextClass, FindaidClass and BibClass. ImageClass is MySQL-based. More when we talk about data preparation for the classes more fully.

DLPS Unicode examples

Back to top

Middleware configuration, requirements and behavior for Unicode

Use Perl 5.8.3 or higher. 5.8.8 is better. Avoid 5.8.6 (debugger problem).

If your data is UTF-8 encoded Unicode, set the collection manager (collmgr) locale field to en_US.UTF-8. Middleware wil use xpatu to read the index. That is all.

To make legacy Latin-1 encoded SGML data work:

The basic assumption INSIDE the middleware is that ANY input (user typed or search results from XPAT) is UTF-8 encoded.

Downside: Searches for accented characters will fail in Latin-1 collections because the user's search term will be converted to UTF-8 but the collection data will be Latin-1. Unaccented searches will still work.

Back to top