Data Conversion to Unicode/XML

Workshop Day 2 -- Tuesday Morning

Data Conversion: Unicode, XML, and Normalization

Data Conversion: Unicode, XML, and Normalization

To make the most of Finding Aids Class and Text Class in DLXS Release 14, you will want to convert or otherwise handle the character entities, numeric entities, or Latin 1 8-bit characters that have been the staples of SGML (and XML, despite the default encoding of UTF-8) for so long. Even with finding aids that are already in EAD 2002 XML, you will need to do some testing of character encodings, conversion of these encodings (if they exist) to UTF-8, normalization, and conversion of SGML to XML (strange but true).

Determining the Character Encodings Present in Your Data

There are a number of possibilities you may encounter:

Plain ASCII (aka the Basic Latin block)
Character entity references (ISO and otherwise)
Numeric character references (decimal and/or hexadecimal)
Latin 1 characters
UTF-8 characters

You may very well find a mixture of 1, 2, 3, and 4 or even 2, 3, and 5 in the wild, simply because many encoders are not clear on what they should be doing with special characters and some tools they use to create and examine their files are not as careful with characters as you would hope (Windows smart quotes are a recurrent problem). One hopes you will not encounter a document with a mixture of Latin 1 and UTF-8 characters, although it's possible that misidentified files could end up concatenated together and create such a mess.

There are a number of tools you can use to identify what you have before you.

findentities.pl

A perl script written by Phil that is part of the DLXS package, it prints the names and frequencies of the entities (CERs and NCRs) it encounters. Fairly quick, regardless of the size of the file. Can be run on more than one file at once, which is handy if you have a batch of texts.

xpatutf8check

Another perl script written by Phil, it exists to answer the question, "Will xpatu index this?" It will report the line number of the first non-UTF character it encounters when it has failed and it runs very quickly, so it's great as a first step in checking your material, but it is not authoritative enough to identify all of the problems you may have.

jhove

The JSTOR/Harvard Object Validation Environment has a UTF-8 module that reports whether your document is or is not valid UTF-8, and which Unicode blocks are contained in the document. Can be slow checking large documents, but very informative. Available at http://hul.harvard.edu/jhove/ and invoked with

jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul file.xml

utf8chars

Yet another perl script written by Phil, it identifies the characters used in a document and the Unicode blocks to which they belong. It assumes your document is UTF-8 and will report each instance (by line number) where a non-UTF character is encountered. Because it is identifying and counting each character in a document, it is rather slow, but very very useful. Runs on one file at a time and prints to standard out, but can be invoked through a foreach to check many files in one command.

Converting Those Character Encodings to UTF-8

If you have a mixed bag of encodings and entities in your documents, there's a definite order in which you want to approach the conversion task, to avoid having a mixture of Latin 1 and UTF-8 in one document at any point in the transformation.

First, if you have Latin 1 characters like â, run iconv, part of the Gnu C library, to convert files from one encoding to another.
```
iconv -f iso88591 -t utf8 oldfile > newfile
```
Next, convert character entity references like â using isocer2utf8, a perl script written by Phil to convert character entity references to UTF-8 characters. Although it references ISO in the name, it's been expanded to handle all the CERs we've encountered, including TEI Greek and the Chadwyck-Healey custom entities.
```
/l1/bin/t/text/isocer2utf8 oldfile > newfile
```
Finally, if you have numeric character references like â or â, run ncr2utf8, also written by Phil, to convert decimal and hexadecimal entities to UTF-8 characters.
```
/l1/bin/t/text/ncr2utf8 oldfile > newfile
```

This would be a good point to run findentities.pl again to see what (if anything) you have left, and to re-validate using jhove or utf8chars to ensure that you have done no harm.

Test Driving the Tools

In the directory /l1/workshop-samples/sooty, you will find four sample files that we'll examine for character encoding and then convert to UTF-8: findaid1.xml, findaid2.xml, text1.xml, and text2.sgm. Copy these to your own directory -- they are completely expendable and won't serve a purpose in tomorrow's Text Class implementation. They are merely illustrative of all the possibilities you might encounter and how you may want to handle them.

First, we'll look at which character or numeric entities, if any, are used in these documents.

foreach file (findaid*)
echo $file 
$DLXSROOT/bin/t/text/findEntities.pl $file 
end

foreach file (text*)
echo $file 
$DLXSROOT/bin/t/text/findEntities.pl $file 
end

We have some CERs and NCRs to deal with, aside from the five XML-approved entities (&, >, <, ', and "). So, we know we'll be needing both isocer2utf and ncr2utf. Next, we'll see what characters we have (Latin 1? UTF-8? something else?). We'll run through all three tools, just for the sake of completeness, in the order of speediness and terseness.

foreach file (findaid*)
echo $file 
xpatutf8check $file 
end

foreach file (text*)
echo $file 
xpatutf8check  $file 
end

We now know that both the text files are either UTF-8 or plain ASCII (because of the output of these two tests), but there's a problem with one of the finding aids. jhove will tell us a bit more about our materials. You'll note we don't need to echo the filename as that's part of the jhove report. You'll also notice jhove is not so fast.

foreach file (findaid*)
jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul  $file 
end

foreach file (text*)
jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul  $file 
end

So, the second file in each set is plain ASCII (the Basic Latin block) with entities, the first finding aid is not UTF-8, and the first text file is. Let's look a bit more at the two non-ASCII files with the slowest and most verbose tool of them all. We're not doing a foreach this time, but we wouldn't need to echo the filename either, as it is again part of what the tool reports.

utf8chars findaid1.xml

utf8chars text1.xml

We can see the exact problem with findaid1.xml -- there's an 8-bit Latin 1 e acute before Boekeloo on line 37. We also can see all the UTF-8 characters in text1.xml -- this is the sort of information that is useful when time comes to map characters and encodings in the xpatu data dictionary.

Now that we know which items need what character treatments, we'll convert them. text1.xml is completely fine, so we'll leave it as is. findaid1.xml has the one Latin 1 character, so we'll use iconv to convert it to UTF-8. It had no entities of any kind, so we'll be done with it after this step.

iconv -f iso88591 -t utf8 findaid1.xml > findaid1.xml.utf

Next, findaid2.xml had numeric character references. It is fine and can be indexed as-is, but users would need to search for the hexadecimal string in the midst of words ( é for é, for example). So, we'll use ncr2utf to convert the entities into the characters. WARNING! & is the ampersand (as is &) -- if you convert these to the character, you will run into validation problems down the road, as bare ampersands are not permitted in XML. Don't get carried away!

ncr2utf8 findaid2.xml > findaid2.xml.utf

Finally, text2.sgm has ISO character entity references (from Latin 1, Greek, and Publishing) that need to be converted to UTF-8 with isocer2utf.

isocer2utf8 text2.sgm > text2.sgm.utf

Note that the ampersand CER was not processed. This is perfectly correct.

Normalization and Converting SGML to XML

Some of you may be in a position where you'll want to be converting your SGML files to XML. Many of you will be fortunate enough to have files already in XML -- say, finding aids in EAD 2002. However, these will have to be normalized, too, to avoid problems with xpatu and xmlrgn down the road by ensuring that all the attributes are in the same order as specified in the DTD. Because of known but uncorrected problems in the normalization tools, you will end up with SGML and will need to convert that to XML.

Because the file we want to work with is now UTF-8, we need to set some environment variables for the tools from the sp package to let them know this is UTF-8. It doesn't matter that you've set your puTTy window to UTF-8, if you are using osx, osgmlnorm, or onsgmls, you must set your environment properly.

setenv SP_CHARSET_FIXED YES

setenv SP_ENCODING utf-8

First we normalize, invoking a declaration to handle the non-SGML UTF-8 characters without claiming that the material itself is XML.

osgmlnorm $DLXSROOT/misc/sgml/xmlentities.dcl sample.inp text2.sgm.utf > text2.sgm.norm

Now I'll test the output with one of the UTF-8 tools to make sure that it's come through unscathed, and with findEntities.pl to see what has happened with the remaining XML-friendly entities, and it's fine. Now to convert our SGML to XML using osx.

osx -x no-nl-in-tag -x empty -E 500 -f errors $DLXSROOT/misc/sgml/xmlentities.dcl sample.inp  text2.sgm.norm > text2.xml

Again I'll test the output with one of the UTF-8 tools to make sure that it's come through unscathed, and with findEntities.pl to see what has happened with the remaining XML-friendly entities, and again it's fine.

Just for fun, we'll normalize the files already in XML, just to show that things get changes from XML to SGML against their will.

osgmlnorm $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp   text1.xml > text1.xml.norm

osx -x no-nl-in-tag -x empty -E 5000 -f error $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp  text1.xml.norm > text1.xml.norm.xml