Unicode in DLXS

This topic touches on XPAT indexing, data preparation and middleware configuration and behavior as they relate to Unicode in DLXS.

Documentation on this topic can also be found at:

http://docs.dlxs.org/class/unicode.html (Release 11a)
http://www.dlxs.org/docs/12a/class/unicode.html (Release 12a)

Unicode in General

There is a lot of fuzziness in talk about characters. "Character set" considered harmful.

Definitions

Forms of multi-lingual character support

Your data may be pure ASCII encoding which supports 127 characters. The ISO-8859-* encodings support 256 characters but only one set of 256 characters at a time. Latin2 covers German and Polish. Latin5 covers German and Turkish. There is no single-byte encoding covering German and Russian, for example.

Methods to represent characters from other alphabets:

Reasons to use Unicode

Back to top

DLXS data preparation and Unicode

Tools are getting better, more plentiful every day.

Terminal emulators

Tools

The goal is to get your data into UTF-8 encoded XML. You need to know how characters in your data have been encoded in order to transform to another encoding.

Back to top

Unicode XPAT indexing

More when we talk about data preparation more fully. For now this is just to highlight some differences in programs and processes.

DLPS Unicode examples

Back to top

Middleware configuration, requirements and behavior for Unicode

XPAT version 5.3.2

Perl 5.8.3 or higher is required. 5.8.8 is better. Avoid 5.8.6 (debugger problem).

Configuration and behavior

To make legacy Latin-1 encoded SGML data work:

The basic assumption is that ANY input (user typed or search results form XPAT) is utf-8 encoded XML. Why? How? From what encoding?

Downside: Searches for accented characters will fail in Latin-1 collections because the user's search term will be converted to UTF-8 but the collection data will be Latin-1. Unaccented searches will still work.

All XML templates have <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> elements to ensure user input is UTF-8 and to tell the browser to use UTF-8 encoding when rendering the page content.

The middleware supports collections with different character encodings in cross-collection mode. This fact is due to the transcoding Latin1 -> UTF-8 on input and Latin1 -> UTF-8 on output.

Back to top