Unicode in DLXS
This topic touches on XPAT indexing, data preparation and middleware configuration and behavior as they relate to Unicode in DLXS.
Documentation on this topic can also be found at:
http://docs.dlxs.org/class/unicode.html (Release 11a)
http://www.dlxs.org/docs/12/class/unicode.html (Release 12)
Unicode in General
There is a lot of fuzziness in talk about characters. "Character set" considered harmful.
- Character Repertoire - collection of abstract characters independent of how they look when printed.
- Coded Character Set - assignment of a unique number to each character in a Character Repertoire.
- Code Point - the unique number assigned to a character in a Coded Character Set.
- ISO/IEC 10646 - The Coded Character Set for Unicode. The Unicode standard specifies additional properties for each character.
- Character Encoding Scheme - or Encoding for short, specifies how the number assigned to a character is stored in a file or in computer memory.
- UTF-8, UTF-16
- UTF-8 is a multi-byte variable length encoding. A byte is not a character, mostly.
- ASCII is UTF-8 by design
- Basic multilingual plane. 65536 characters. Divided into blocks of varying size by alphabet.
The ASCII encoding only supports 127 characters. The ISO-8859-* encodings support 256 characters but only one set of 256 characters at a time. Latin2 covers German and Polish. Latin5 covers German and Turkish. There is no single-byte encoding covering German and Russian.
- Use of .gif images of characters and <img src="Agrave.gif"> tags.
- Not font dependent.
- Not searchable, not scalable, slow. A lot of work to generate
- Character Entity References (CER), e.g. À in SGML for LATIN CAPITAL LETTER A WITH GRAVE
- Font dependent.
- Not searchable. Not easily entered bu the user. Not XML. Less support in browsers than for NCR
- Numeric character references (NCR), e.g. �C0; is the Unicode Code Point for LATIN CAPITAL LETTER A WITH GRAVE
- Font dependent.
- Not searchable. Not easily entered by the user.
- Convert NCR and CER to iso8859-1 encoding.
- Font dependent.
- Searchable. Easily entered by the user via XPAT mapping functionality.
- Limited to one alphabet per document.
- See previous section.
- Can represent more than one alphabet in a single document or web page.
- Searchable. (xpatu)
- Programming is simpler.
- Latin characters can be easily entered by users via XPAT mapping functionality.
- Non-ASCII characters can be entered by users via national keyboards, virtual keyboards, IMEs, copy/paste.
- Can be collated.
- Fundamental to XML.
- Better font support than for character entity references.
Back to top
DLXS data preparation and Unicode
Tools are getting better, more plentiful every day.
- The é (instead of é) problem.
- Linux
- GNOME terminal
- xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1'
- Bitstream Cyberbit and MS Arial Unicode fonts
- Windows
- PuTTY with Hummingbird Exceed X Server version 8 or higher on Windows
- MS Arial Unicode
- XMLSpy
The goal is to get your data into UTF-8 encoded XML. You need to know
how characters in your data have been encoded in order to transform to
another encoding.
- iconv -c -f ISO-8859-1 -t UTF-8 -o outfile infile
- DLXSROOT/bin/t/text/ncr2utf8
- DLXSROOT/bin/t/text/isocer2utf8
- OpenSP osx
- XMLSpy
Back to top
Unicode XPAT indexing
More when we talk about data preparation more fully. For now this is just to highlight some differences in programs and processes.
- xpatbldu, xpatu and xmlrgn are used instead of xpatbld, xpat and sgmlrgn.
- Sample Makefile in DLXSROOT/bin/s/sampletc_utf8
- <?xml version="1.0" encoding="UTF-8"?> Important for xmlrgn
- Index point meta characters (e.g. &Greek. or &Cyrillic.) are based on the Unicode block definitions, Perl unicode lib (e.g. lib/5.8.3/unicore/lib/Latin.pl) and modified as described in the XPAT data dictionary document.
- Specify index points in the data dictionary (.dd file) based on the alphabets in your data.
<IndexPoints>
<IndexPt> &printable.</IndexPt>
<IndexPt>&printable.-</IndexPt>
<IndexPt>-&printable.</IndexPt>
<IndexPt>&printable.<.</IndexPt>
<IndexPt>&printable.&.</IndexPt>
<IndexPt> &Latin.</IndexPt>
<IndexPt>&Latin.-</IndexPt>
<IndexPt>-&Latin.</IndexPt>
<IndexPt>&Latin.<.</IndexPt>
<IndexPt> &Greek.</IndexPt>
<IndexPt>&Greek.-</IndexPt>
<IndexPt>-&Greek.</IndexPt>
<IndexPt>&Greek.<.</IndexPt>
</IndexPoints>
- Specify character mappings in the data dictionary also based on the characters that occur in your data. Note U+XXXX notation. Refer to the Unicode character database. This is mainly for case mapping for alphabets that have case.
...
<Map><From>U+0391</From><To>U+03B1</To></Map>
<Map><From>U+0392</From><To>U+03B2</To></Map>
<Map><From>U+0393</From><To>U+03B3</To></Map>
<Map><From>U+0394</From><To>U+03B4</To></Map>
<Map><From>U+0395</From><To>U+03B5</To></Map>
...
- Run Makefile
- OAIster is %100 UTF-8 encoded XML indexed by xpatbldu and multirgn and searched using xpatu.
- Supports Latin, Greek, Cyrillic, Han, Hiragana, Katakana and Hangul characters.
- Highlighting based on .dd file character mappings.
- OAIster data dictionary
- Workshop example is %100 UTF-8 encoded XML containing French and indexed by xpatbldu and xmlrgn and searched using xpatu. Unicode wordwheel.
Back to top
Middleware configuration, requirements and behavior for Unicode
XPAT version 5.3.2
- 5.3 XPAT can read 5.2 indexes, i.e. 5.3 is backward compatible
- 5.2 XPAT cannot read 5.3 indexes
Perl 5.8.3 or higher is required.
Configuration and behavior
The middleware will transcode all user input that is not valid UTF-8 from Latin1 to UTF-8 under the assumption that the input was Latin1. This implies that non-ASCII searches will fail in Latin1 collections. Unaccented searches will still work because of XPAT mapping in the data dictionary.
The collection manager (collmgr) locale field should be set to en_US.UTF-8. Any value not including "UTF-8" means the middleware will assume Latin1 encoding and will:
- use xpat instead of xpatu to read the index.
- transcode XPAT results from Latin1 encoding to UTF-8.
- change SGML-style singletons (e.g. <LB>) to XML-style singletons (e.g. <LB/>).
All XML templates have <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> elements to ensure user input is UTF-8 and to tell the browser to use UTF-8 encoding for the page.
The middleware supports collections with different character encodings in cross-collection mode. This fact is due to the transcoding Latin1 -> UTF-8 on input and Latin1 -> UTF-8 on output.
Back to top