DLXS Unicode Data Preparation and Online Presentation Issues

Last updated	2004-02-25 12:17:40 EST
Doc Title	DLXS Unicode Data Preparation and Online Presentation Issues
Author 1	Farber, Phillip
CVS Revision	$Revision: 1.1 $

Introduction

This document describes in some detail the issues involved in Unicode data preparation and indexing, middleware configuration, template issues and user input. In its data preparation and indexing aspect, it is mainly applicable to TextClass, BibClass and FindaidClass. With respect to the remaining issues, it relates to all the classes.

For non-unicode specific information on data preparation for individual classes, check the following links:

About Unicode

The authoritative source for information about Unicode is the Unicode Consortium. You will find the complete standard and lots of helpful links to other sources of information on Unicode.

First some definitions. A Character Repertoire is a collection of abstract characters independent of how they look when printed. A Coded Character Set is an assignment of a unique number to each character in a Character Repertoire. The ISO/IEC 10646 Coded Character Set assigns a unique number to virtually every character in in all the world's alphabets. These numbers are called Code Points. Unicode is a standard built on top of ISO/IEC 10646 that, in addition to specifying the assignment of number to character, deals with things like collation, bi-directionality, normalization and, most importantly, encoding. A Character Encoding Scheme (encoding) specifies how the number that stands for a character is stored in a file or in computer memory.

There are many Character Encoding Schemes defined by the Unicode Standard but the one of interest to us is called UTF-8. The UTF-8 encoding of the Unicode Coded Character Set is the preferred encoding for Unicode on the Web. It is a multi-byte encoding meaning that it may use from 1 to 6 bytes to encode the Unicode Code Point (number) of a given character. UTF-8 and US-ASCII (0-7F hex) are identical. Above 7F, 2 or more bytes are required to encode the number assigned to a Unicode character. With Unicode it is possible for one document to contain characters from many different alphabets and to treat them uniformly for search purposes.

DLXS Background

DLXS depends on a variety of mechanisms to handle non-ASCII character data. These include:

The use of SGML character entity references (CERs) such as Â in the data. These are mapped to single character gif images to display certain characters unavailable in typical browser fonts. The problem with this mechanism is that unless the user is knowledgeable enough to type the actual 7 character sequence "Â" instead of A, for example, their search fails.
The replacement of CERs with the corresponding ISO-8859-1 encoded character. By mapping this (typically) accented character to its unaccented ASCII equivalent, DLXS can find words that contain either the accented or unaccented form of the character. This works fine but, as noted in the introduction, limits the document to a single encoding such as Latin1. In a single document one can cover German+Polish with Latin2 or German+Turkish with Latin5 but there is no single-byte encoding to properly mix German+Russian, for instance.
Making certain uppercase letters in the user's input stand for certain characters like Thorn or Eth and "stealing" unused 8bit values to replace these CERs in the data during conversion. This is a very cumbersome process involving custom programming and involved use of mapping in XPAT indexing and searching.

These mechanisms are not required if the data is in Unicode especially now that Unicode fonts are widely available in the current generation of web browsers.

Platform Requirements

It is advisable to use the latest software versions recommended in DLXS System Requirements.

There a a few terminal emulators that handle UTF-8 encoded Unicode reasonably well:

xterm run as xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1' If running under Windows you need Version 8 of Hummingbird Exceed X Server, at least.
Natively, under Windows PuTTY is good. Under PuTTY Preferences->Translation select UTF-8.

Data Conversion

If your data does not come to you in Unicode UTF-8 encoded XML, conversion is necessary. A typical conversion might be as follows. Note that you may only need to perform just one of (A) or (B) depending on what form your data takes. That is, non-ASCII characters in your data may be represented by entities or encoded directly in, for instance, ISO-8859-1. It is possible that both steps (A) and (B) may be required.

A useful reference to Unicode characters is the file UnicodeData.txt available from the Unicode Consortium and delivered with Perl 5.8 under, for example, PERLROOT/perl/lib/5.8.3/unicore/.

(A) Convert the data to the Unicode UTF-8 encoding

Use the iconv program. The following example on Linux assumes your data is initially encoded in ISO-8859-1/Latin1:

iconv -c -f ISO-8859-1 -t UTF-8 -o outfile infile

Use the Perl Unicode.pm module in a script like the following:

#!/l/local/bin/perl -w use strict; use Unicode::MapUTF8 qw(to_utf8); while( <> ) { print to_utf8({ -string => $_, -charset => 'ISO-8859-1' }); }

Use a program like XMLSpy to read in your file and write it out UTF-8 encoded.

(B) Convert numeric character references (NCRs) and SGML character entity references (CERs) to Unicode UTF-8 encoded characters

Since your ultimate goal is to have UTF-8 encoded XML encoded recall that XML has 5 predefined CERs which you do not need to convert and which the utilities described below do not touch. They are &, <, >, ' and ".

Programs such as XMLSpy or osx may do the needed conversions for you but vary in their handling of SGML SDATA and NDATA entities. In some cases you may benefit from use of the following two utilities in addition..

For NCRs, i.e. references of the form &#DDDD; where D is a decimal digit or &#xXXXX; where X is a hexadecimal digit, you can use the DLXS utility program DLXSROOT/bin/t/text/ncr2utf8 run as:

ncr2utf8 inputfile > outputfile

For CERs, e.g. references like Å, you may need to analyze the references present in your data. The program DLXSROOT/bin/t/text/findEntities.pl will generate a list of CERs in your data.

It is likely that most or even all CERs in your data will come from one of the ISO Character Entity Sets: ISOamsa, ISOamsb, ISOamsc, ISOamsn, ISOamso, ISOamsr, ISOcyr1, ISOcyr2, ISOgrk1, ISOgrk2, ISOgrk3, ISOgrk4, ISOlat1, ISOlat2, ISOmfrk, ISOnum, ISOpub, ISOtech, MMLalias or MMLextra. You can use DLXSROOT/bin/t/text/isocer2utf8 run as:

isocer2utf8 inputfile > outputfile

to translate these CERs directly to UTF-8. Running findEntities.pl after this will identify any CERs outside these ISO sets.

Another option is to use an SGML parser like onsgmls together with Character Entity Declarations that substitute the Unicode NCR for the CER in the parsed output followed by a run of ncr2utf8 to complete the conversion.

Note that If you started with SGML, you may need to touch up the SGML to make it (and its DTD) XML compliant if you rely solely on the small utility programs supplied with the DLXS release. This process is outside the scope of this document (but see DLXSROOT/misc/sgml/textclass.stripped.xml.dtd for an example of the XML version of textclass.dtd). At this point you should have UTF-8 encoded XML data ready to index.

Indexing

Refer to files in DLXSROOT/prep/s/sampletc_utf8 and DLXSROOT/bin/s/sampletc_utf8 for the following discussion.

DLXS delivers a Makefile to take you through the process of building the main XPAT index and the fabricated region indexes. The process is very similar for Latin1 encoded SGML data and UTF-8 encoded XML data. This process is outlined in TextClass Indexing. The main difference between the non-Unicode Makefile and the Unicode Makefile is that xpatbldu, xpatu and xmlrgn are used instead of xpatbld, xpat and sgmlrgn.

Be sure your XML data file begins with the XML declaration:

<?xml version="1.0" encoding="UTF-8"?>

. Without this declaration, xmlrgn will not build correct region indexes.

The most important input to the indexing process is the XPAT Data Dictionary. If your data spans several languages, especially those languages with non-Latin alphabets, you will need to configure a Data Dictionary that takes this into account. The sampletc_utf8.blank.dd can be used as a starting point and with some editing is sufficient for Latin based languages. There are two sections in the Data Dictionary that need attention: the Index Points and the Mappings.

Once these sections in the Data Dictionary have been configured the indexing process can proceed via the Makefile. Note that if you have XML element or attribute names that contain non-ASCII characters in your document you should use multirgn to generate the region indexes due to a limitation in xmlrgn. It is expected that this case is rare.

Index Point specification

This specification tells XPAT what points in the data to index. Typically, XPAT is directed to index and search beginning at an alphabetic character following a blank space, i.e. a word. Here is the Index Point specification section of the sampletc_utf8.blank.dd in prep:

   <IndexPoints>
        <IndexPt> &printable.</IndexPt>
        <IndexPt>&printable.-</IndexPt>
        <IndexPt>-&printable.</IndexPt>
        <IndexPt>&printable.&lt.</IndexPt>
        <IndexPt>&printable.&amp.</IndexPt>
        <IndexPt> &Latin.</IndexPt>
        <IndexPt>&Latin.-</IndexPt>
        <IndexPt>-&Latin.</IndexPt>
        <IndexPt>&Latin.&lt.</IndexPt>
        <IndexPt>&Latin.&amp.</IndexPt>
        <IndexPt> &Greek.</IndexPt>
        <IndexPt>&Greek.-</IndexPt>
        <IndexPt>-&Greek.</IndexPt>
        <IndexPt>&Greek.&lt.</IndexPt>
        <IndexPt>&Greek.&amp.</IndexPt>
      </IndexPoints>

The sampletc_utf8.xml data file contains characters from the Latin and Greek alphabets. Index points are defined for the characters from each of those alphabets using XPAT Unicode metacharacters like "&Latin." and "&Greek.". These metacharacters group Unicode characters into "blocks" which correspond roughly to alphabets. The document The XPAT Data Dictionary has a list of these Unicode metacharacters together with the characters that belong to each block (about midway through the section). If your character data is Latin-based it will probably suffice to simply remove the Greek elements from sampletc_utf8.blank.dd.

It is not advisable to create a Data Dictionary that specifies all the blocks so as to create s "universal" Data Dictionary. This would impose a performance and memory penalty on XPAT at runtime.

Not all languages have a concept of upper and lower case.

Languages such as Chinese do not separate "words" with spaces. This presents a problem for XPAT. A partial solution is to specify every character to be an index point:

<IndexPt>&Hangul.&Hangul.</IndexPt>

This would result in an index 4 times the size of the data and a large runtime memory requirement for the XPAT index point table and as of this writing should be considered experimental. There is a probability of false hits but that should decrease as the length of the query increases.

Mappings specification

Case insensitivity makes it easier for users to enter query terms. This is implemented in the Mappings section by mapping uppercase characters to their lowercase equivalent. Keyboards in the United States usually do not have keys for the accented characters used in European languages. These accented characters are mapped to their unaccented forms in the Mappings section. This allows search and retrieval whether the character appears accented or unaccented in the data. Apropos of Unicode, here is a part of the Mappings section devoted to mapping uppercase Greek to lowercase:

 
        ...
        <Map><From>U+0391</From><To>U+03B1</To></Map>
        <Map><From>U+0392</From><To>U+03B2</To></Map>
        <Map><From>U+0393</From><To>U+03B3</To></Map>
        <Map><From>U+0394</From><To>U+03B4</To></Map>
        <Map><From>U+0395</From><To>U+03B5</To></Map>
        ...

Note that the Greek characters are specified using the "U+" Unicode notation. The number following the "U+" is the Unicode Code Point for the character expressed in hexadecimal notation. From this one can see that the Data Dictionary can be built entirely form ASCII characters. It is not necessary to have a UTF-8 enabled editor. The XPAT Unicode implementation currently accepts values up to U+FFFF (65535). This covers all the characters defined in Unicode Plane 0 also referred to as the Basic Multilingual Plane.

While there are characters in higher planes they are relatively rare and this XPAT limitation is not expected to present an obstacle to indexing your Unicode-based texts. Should the need arise XPAT can be extended to use a full 32 bit word internally. As there is little need for this currently it is more memory efficient to use a 16 bit word to store characters in memory.

You will need to analyze your texts to decide what sort of mapping may be useful to your target audiences. There are many issues to consider. Input mechanisms dominate these considerations.

Do your your users have Western European keyboards? It is not necessary to map accented to unaccented characters, though it is harmless to do so for users that do not have such keyboards. The accented characters are indexed and accepted as input and can be retrieved from the text.
Do your target users have Input Method Editors readily available and know how to use them to enter non-Latin characters?
Do your users have antiquated browsers with poor font support for Unicode?

DLXS is exploring the addition of a configurable javascript popup virtual keyboard to allow users to enter characters from alphabets for which they lack a physical keyboard.

Collmgr Fields / Configuration

To put your data online you will naturally need to define a collection in the collection database using Collmgr. There are two differences between a non-Unicode collection and a Unicode collection. Currently there is no support for a Unicode Wordwheel. Leave the wwappmodule, wwdd, wwrealms and wwrealmseng blank.

To configure a Unicode collection set the locale field to a UTF-8 locale value such as en_US.UTF-8. You can get a list of locale values recognized by your Unix system by typing locale -a at the shell prompt. A UTF-8 locale setting affects several areas of functionality in the middleware.

The middleware will use xpatu search engine to search the collection data. This implies that the data was indexed by xpatbldu and xmlrgn/multirgn. This does not apply to ImageClass which is migrating to MySQL searching. DLXS release 11 was the first release offering xpatu and xpatbldu.
The middleware will expect user input to be UTF-8 encoded. More on this below.
The middleware will send to charset=UTF-8 to the browser when outputting processed HTML templates. This will cause the browser to interpret the output from the middleware as UTF-8 and select a Unicode font for display purposes. Browsers lacking a Unicode font will display characters in a garbled manner that includes the hollow rectangular box for some characters.
Perl's internal UTF-8 flag is set on string data in the middleware to handle multi-byte characters.

Templates

At the present time a large number of HTML templates have a <META> tag with charset=iso-8859-1. These templates must continue to work for data from non-Unicode collections while at the same time supporting Unicode data. Rather than adding a PI to all these templates or duplicating them we have chosen to process them automatically on output from the middleware. The middleware changes occurrences of charset=iso-8859-1 (if present) to charset=UTF-8 when outputting processed HTML templates. Templates intended to support Unicode character data should have the <META> tag with charset=iso-8859-1 (<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">) in their headers. The upshot of this is that since all templates probably conform to this requirement already no changes should be needed.

Unicode, User Input and Form Submission

The encoding of user input to HTML forms is a complex area not made any easier by browser bugs and standards that do not address the problem fully. The best discussion of this topic is by A.J.Flavell. Basically the problem is that there is no reliable way for the browser to convey to the middleware what encoding is in effect for the data entered into a form by the user. Quoting Mr. Flavell:

In practice, browsers normally display the contents of text fields according to the character coding (charset) that applies for the HTML page as a whole; and when it submits the text fields they are effectively in this same coding. Thus if the server sent out the (page containing the) form with a definite charset specification, it could normally assume that the submitted data can be interpreted in accordance with the same charset. There are however anomalies of various kinds, some of which have been seen and understood by the author of this note, some of which have been seen and not understood, and some of which are only anecdotal at the moment.

In addition to these considerations, some users may be typing-in or pasting-in text from an application that uses their local character coding (practical examples being macRoman on a Mac; or MS-DOS CP850 being copied out of a DOS window on an MS Windows PC), into a text field of a document that used the author's - different - character coding (let's say for the simplest example, iso-8859-1): the user might then submit the form, disregarding that what they are seeing in the text area is not what they intended to send. [...]

Given this state of affairs we can see that user data entry is not 100% reliable. Nonetheless, it is reasonable to assume the following in a page send by the middleware with charset=UTF-8:

Users typing at a plain old US keyboard are generating ASCII codes which are by default UTF-8. So If a text contains a mixture of non-Latin or accented Latin characters and character data from the ASCII character set (UTF-8 single-byte-encoded Unicode characters) it has the potential to be searched effectively from an ASCII keyboard.
Users copying from DLXS results in their browser window and pasting back into a DLXS search form are generating the UTF-8 encoded data expected by the middleware.
Users typing input via an Input Method Editor (IME) will generate UTF-8 data as expected by the middleware.
Users entering search strings via a javascript virtual keyboard will generate UTF-8 encoded data.
Users typing from national keyboards may enter UTF-8 if their system is properly configured.

Beyond these assertions it is impossible to generalize about how copying and pasting characters from arbitrary sources into an input field might be expected to behave.

Current Limitations in DLXS Middleware

The middleware does not support collections with different character encodings in cross-collection mode. For coherent results, collections must all be of a single encoding in cross-collection mode, either all UTF-8 or all ISO. If collections exist in both UTF-8 and ISO-8859-1 they will be treated as ISO-8859-1 in cross-collection mode (with predictably strange results). UTF-8 encoded Unicode collections should be handled solely in single collection mode under these circumstances.

One obvious reason for this is that without output conversion of disparate encodings to UTF-8 the browser will be forced to misinterpret some of the data. Only one charset at a time can apply to an HTML page. A similar issue applies with user input and is made even more complex by issues raised in the previous section.

DLXS will continue to explore the possibility of support for multiple encodings across collections. Ultimately the most desirable scenario is to convert all collection data to UTF-8 encoded Unicode.