Image Class Character Set Conversion

Last updated	2002-03-13 11:15:54 EST
Doc Title	Image Class Character Set Conversion
Author 1	Weise, John
CVS Revision	$Revision: 1.6 $

dlxs-help@umich.edu

This document tells how to convert illegal text characters to SGML compliant characters. The process is actually closely tied to the database transformation process.

Introduction

The Image Class requires that all text characters comply with the ISO 8859 Latin-1 character set standard. Noncompliant characters may be ignored despite the errors generated during validation/normalization and the XPAT index will build. The problem, though, is that the illegal characters will not be searchable in any usable way.

The transformation program can optionally convert any character to any other character. This involves building a character conversion table in the form of a Perl module. Here is an example Perl module.

    %gCharConv = (
   
        '\x88' => '\xE0',
        '\x89' => '\xE2',
        '\x8A' => '\xE4',
        '\x87' => '\xE1',
        '\x80' => '\xC7',
        '\x9C' => 'o',
        '\xB2' => '<\x3D',
        '\x10' => '',
        '\x85' => '\x2E \x2E \x2E',
        'eee'  => 'ee',
                  );
      
    1; #TRUTH

In this example, most of the characters are represented by their hexadecimal (base 16) numbers. This is usually the easiest way to specify text characters that may not be displayable in the text file that is the Perl module. In other words, you would be challenged to type a latin small letter e with circumflex into your Perl module.

The character conversion Perl module is simply a text file that must reside in the collection's prep directory. It must be named using the following convention: collid-charconv

Our favorite example, the French Architecture collection, does not actually have a character conversion table, but if it did, the path and name would be this...

$DLXSROOT/prep/f/frarch/frarch-charconv

Determining Character Mappings

This section will tell you how to create the cryptic hexadecimal character mappings that go into a charconv file. Illegal characters most often become evident during the validation/normalization process, that is, when the "idb norm collid" command is issued. We'll use the "musart" collection as an example for exploration.

    dev:musart % 78 idb norm musart
    Normalizing musart SGML.
    $DLXSROOT/bin/i/image/sgmlnorm:/l1/prep/m/musart/ic.musart.unnorm.sgm:
    111:50:E: non SGML character number 142

In the command line example above, the normalization process informs us that there is an illegal character in the ic.musart.unnorm.sgm file. It also tells us exactly where it occurs within the ic.musart.unnorm.sgm file: line 111, character 50. It also tells us that the decimal value of the character is 142.

If you convert decimal 142 to hexadecimal, the value is 8E. So now we know that we must convert the illegal hexadecimal character 8E to a legal SGML character. The question is, what should it be converted to? To determine this, it is necessary to look at the context of the illegal character.

Continuing with this example, we would...

Use a text editor to go to line 111, character 50 of the ic.musart.unnorm.sgm file.
Look at the word or words the character falls within, which in this case is "Bartolom\216 Esteban Murillo" (Note that the EMACS text editor that was used in this example displays illegal characters not in decimal, not in hexadecimal, but octal! That is why it says "\216".)
Determine what the appropriate character should be. In this case it should be a Latin small letter e with acute.
Look at a table of ISO 8859 Latin 1 characters to see if Latin small letter e with acute exists, and since it does, we note the hexadecimal value of it, which is E9.
Create or modify the musart-charconv file so that it will do the character conversion '\x8E' => '\xE9'. Important, the charconv file must have executable permissions set. (e.g., chmod +x musart-charconv)
```
    %gCharConv = (
        '\x8E' => '\xE9',
                  );

      
    1; #TRUTH
```
Execute the "idb transform" and "idb norm" programs again for the musart collection. All occurrences of the illegal character are converted to legal characters and there should not be any validation/normalization errors, for that character at least.

What If the Desired Character Is Not Included in the ISO 8859 Latin 1 Standard Character Set?

If the desired character is not in the ISO Latin 1 character set, there are a few options.

Convert the character to a character that is similar. Perhaps the same character without a diacritic.
Use character entities in SGML. Please contact dlxs-help@umich.edu for more information.

How should special characters be represented in the web search form?

Most of the ISO 8859 Latin 1 characters that are not readily available on keyboards may be searched by entering a lesser representation. That is, e with acute may be searched by entering a plain "e". If an actual e with acute is entered, it will also work. This mapping is handled in the XPAT index configuration, which is beyond the scope of this document. Please contact dlxs-help@umich.edu for more information.

Handling of Special Characters by FileMaker Pro 5

When HTML is exported from FileMaker Pro 5, FileMaker normalizes special characters to ISO 8859 Latin 1 in the form decimal character entities (e.g., È). The idb transformation program converts these to legal Latin 1 characters in the SGML. Although, this does not completely alleviate the need to do conversions.