Data Conversion to Unicode/XML

Workshop Day 2 -- Tuesday Afternoon

Data Conversion to Unicode and XML

Data Conversion to Unicode and XML

Although we've always stated that data conversion is not something we can officially support through DLXS, it's clear that this is a complex part of the process and one where we have a great deal more experience than most, so it's only fair that we do cover it, especially at this point. DLXS Release 12 requires that you convert or otherwise handle the character entities, numeric entities, or Latin 1 8-bit characters that have been the staples of SGML (and XML, despite the default encoding of UTF-8) for so long.

Determining the Character Encodings Present in Your Data

There are a number of possibilities you may encounter:

Plain ASCII (aka the Basic Latin block)
Character entity references (ISO and otherwise)
Numeric character references (decimal and/or hexadecimal)
Latin 1 characters
UTF-8 characters

You may very well find a mixture of 1, 2, 3, and 4 or even 2, 3, and 5 in the wild, simply because many encoders are not clear on what they should be doing with special characters. One hopes you will not encounter a document with a mixture of Latin 1 and UTF-8 characters, although it's possible that misidentified files could end up concatenated together and create such a mess.

There are a number of tools you can use to identify what you have before you.

findentities.pl

A perl script written by Phil, it prints the names and frequencies of the entities (CERs and NCRs) it encounters. Fairly quick, regardless of the size of the file. Can be run on more than one file at once, which is handy if you have a batch of texts.

xpatutf8check

Another perl script written by Phil, it exists to answer the question, "Will xpatu index this?" It will report the line number of the first non-UTF character it encounters when it has failed and it runs very quickly, so it's great as a first step in checking your material, but it is not authoritative enough to identify all of the problems you may have.

jhove

The JSTOR/Harvard Object Validation Environment has a UTF-8 module that reports whether your document is or is not valid UTF-8, and which Unicode blocks are contained in the document. Can be slow checking large documents, but very informative. Available at http://hul.harvard.edu/jhove/ and invoked with

jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul file.xml

utf8chars

Yet another perl script written by Phil, it identifies the characters used in a document and the Unicode blocks to which they belong. It assumes your document is UTF-8 and will report each instance (by line number) where a non-UTF character is encountered. Because it is identifying and counting each character in a document, it is rather slow, but very very useful. Runs on one file at a time and prints to standard out, but can be invoked through a foreach to check many files in one command.

Converting Those Character Encodings to UTF-8

If you have a mixed bag of encodings and entities in your documents, there's a definite order in which you want to approach the conversion task, to avoid having a mixture of Latin 1 and UTF-8 in one document at any point in the transformation.

First, if you have Latin 1 characters like â, run iconv, part of the Gnu C library to convert files from one encoding to another.
```
iconv -f iso88591 -t utf8 oldfile > newfile
```
Next, convert character entity references like â using isocer2utf8, a perl script written by Phil to convert character entity references to UTF-8 characters. Although it references ISO in the name, it's been expanded to handle all the CERs we've encountered, including TEI Greek and the Chadwyck-Healey custom entities.
```
/l1/bin/t/text/isocer2utf8 oldfile > newfile
```
Finally, if you have numeric character references like â or â, run ncr2utf8, also written by Phil, to convert decimal and hexadecimal entities to UTF-8 characters.
```
/l1/bin/t/text/ncr2utf8 oldfile > newfile
```

This would be a good point to run findentities.pl again to see what (if anything) you have left, and to re-validate using jhove or utf8chars to ensure that you have done no harm.

Test Driving the Tools

In the directory /l1/workshop-samples/sooty, you will find four sample files that we'll examine for character encoding and then convert to UTF-8. Copy these to your own directory -- they are completely expendable and won't serve a purpose in tomorrow's Text Class implementation. They are merely illustrative of all the possibilities you might encounter and how you may want to handle them.

First, we'll look at which character or numeric entities, if any, are used in these documents.

foreach file (findaid*)
echo $file 
$DLXSROOT/bin/t/text/findEntities.pl $file 
end

foreach file (text*)
echo $file 
$DLXSROOT/bin/t/text/findEntities.pl $file 
end

Since most of you are set up for bash, here are the same commands in that shell:

for file in findaid*
do
echo $file 
$DLXSROOT/bin/t/text/findEntities.pl $file
done

for file in text*
do
echo $file 
$DLXSROOT/bin/t/text/findEntities.pl $file
done

We have some CERs and NCRs to deal with, aside from the five XML-approved entities (&, >, <, ', and "). So, we know we'll be needing both isocer2utf and ncr2utf. Next, we'll see what characters we have (Latin 1? UTF-8? something else?). We'll run through all three tools, just for the sake of completeness, in the order of speediness and terseness.

foreach file (findaid*)
echo $file 
xpatutf8check $file 
end

foreach file (text*)
echo $file 
xpatutf8check  $file 
end

Since most of you are set up for bash, here are the same commands in that shell:

for file in findaid*
do
echo $file 
xpatutf8check  $file
done

for file in text*
do
echo $file 
xpatutf8check  $file
done

We now know that both the text files are either UTF-8 or plain ASCII (because of the output of these two tests), but there's a problem with one of the finding aids. jhove will tell us a bit more about our materials. You'll note we don't need to echo the filename as that's part of the jhove report. You'll also notice jhove is not so fast.

foreach file (findaid*)
jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul  $file 
end

foreach file (text*)
jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul  $file 
end

Since most of you are set up for bash, here are the same commands in that shell:

for file in findaid*
do
jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul  $file
done

for file in text*
do
jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul  $file
done

So, the second file in each set is plain ASCII (the Basic Latin block) with entities, the first finding aid is not UTF-8, and the first text file is. Let's look a bit more at the two non-ASCII files with the slowest and most verbose tool of them all. We're not doing a foreach this time, but we wouldn't need to echo the filename either, as it is again part of what the tool reports.

utf8chars findaid1.xml

utf8chars text1.xml

We can see the exact problem with findaid1.xml -- there's an 8-bit Latin 1 e acute before Boekeloo on line 37. We also can see all the UTF-8 characters in text1.xml -- this is the sort of information that is useful when time comes to map characters and encodings in the xpatu data dictionary.

Now that we know which items need what character treatments, we'll convert them. text1.xml is completely fine, so we'll leave it as is. findaid1.xml has the one Latin 1 character, so we'll use iconv to convert it to UTF-8. It had no entities of any kind, so we'll be done with it after this step.

iconv -f iso88591 -t utf8 findaid1.xml > findaid1.xml.utf

Next, findaid2.xml had numeric character references. It is fine and can be indexed as-is, but users would need to search for the hexadecimal string in the midst of words ( é for é, for example). So, we'll use ncr2utf to convert the entities into the characters. WARNING! & is the ampersand (as is &) -- if you convert these to the character, you will run into validation problems down the road, as bare ampersands are not permitted in XML. Don't get carried away!

ncr2utf8 findaid2.xml > findaid2.xml.utf

Finally, text2.sgm has ISO character entity references (from Latin 1, Greek, and Publishing) that need to be converted to UTF-8 with isocer2utf.

isocer2utf8 text2.sgm > text2.sgm.utf

Note that the ampersand CER was not processed. This is perfectly correct.

Converting SGML to XML and Normalization

Many of you may be in a position where you'll want to be converting your SGML files to XML. Many of you will be fortunate enough to have files already in XML -- say, finding aids in EAD 2002. However, these will have to be normalized, too, to avoid problems with xpatu and xmlrgn down the road by ensuring that all the attributes are in the same order as specified in the DTD.

When dealing with Text Class material in SGML that needs to be converted to XML, the first thing I do is normalize and then convert. Why? Then I don't have to normalize the XML I create. Because the file we want to work with is now UTF-8, we need to set some environment variables for the tools from the sp package to let them know this is UTF-8. It doesn't matter that you've set your puTTy window to UTF-8, if you are using osx, osgmlnorm, or onsgmls, you must set your environment properly.

setenv SP_CHARSET_FIXED YES

setenv SP_ENCODING utf-8

For those of you in bash, it's

export SP_CHARSET_FIXED=YES

export SP_ENCODING=utf-8

Then we normalize, invoking a declaration to handle the non-SGML UTF-8 characters without claiming that the material itself is XML.

osgmlnorm $DLXSROOT/misc/sgml/xmlentities.dcl sample.inp text2.sgm.utf > text2.sgm.norm

Now I'll test the output with one of the UTF-8 tools to make sure that it's come through unscathed, and with findEntities.pl to see what has happened with the remaining XML-friendly entities, and it's fine. Now to convert our SGML to XML using osx.

osx -x no-nl-in-tag -x empty -E 500 -f errors sample.inp  text2.sgm.norm > text2.xml

Not that we don't need to invoke a declaration of any type or declare an encoding. That's because it's UTF-8. Again I'll test the output with one of the UTF-8 tools to make sure that it's come through unscathed, and with findEntities.pl to see what has happened with the remaining XML-friendly entities, and again it's fine.

Using Unconverted Collections with Release 12 Middleware

Recognizing that there may be situations when you do not wish to migrate a collection to XML and UTF-8 immediately, there are mechanisms built into Release 12 to allow the middleware to handle SGML-style empty elements (aka singletons), Latin 1 characters, and character entity references. How do you make this happen? In DlpsUtils.pm, there is a subroutine called Sgml2XmlFilter that has a hard-coded list of empty elements (<PB>, <LB>, <CAESURA>, etc.) that are converted upon discovery to XML-style (<PB/>, <LB/>, <CAESURA/>, etc.), and a feature that converts Latin 1 (ISO-8859-1) characters to UTF-8. This subroutine comes into play if the locale field in collmgr is not set to en_US.UTF-8 (locale used to be optional but is now required if you are using UTF-8 and xpatu). In order to declare your entities, you need to put a file called entitiesdoctype.chnk in the web directory for your collection, declaring the entities like so:

<!DOCTYPE Top [
<!ENTITY Dstrok   "&#x0110;">
<!ENTITY Sacute   "&#x015A;">
<!ENTITY Scaron   "&#352;">
<!ENTITY Ubreve   "&#x016C;">
<!ENTITY Zdot     "&#x017B;">
]>

That being the case, why would anyone ever bother to go through the trouble of converting their material? First off, the value of having UTF-8 is apparent if you have material that used more than one entity set (and even the lowliest collections have both an e acute and an em-dash in them somewhere). Now that — is one character that can be mapped to a space in the data dictionary like other punctuation, phrases that were obscured in searches now turn up, and characters that we used to flatten (transforming ā to a, for example) can be displayed. Second, this facility comes at a cost. All of the material returned needs to be run through this filter, which will take some time. In a results list, the lag is negligable, but in larger sections of text, it could be noticeable. Finally, some confusion might arise when a user cuts and pastes material he received as a result and cannot retrieve it again, because the results and input are UTF-8 (which is the encoding of the search form) but the material being searched is not.

Workshop Day 2 -- Tuesday Afternoon

Data Conversion to Unicode and XML

Determining the Character Encodings Present in Your Data

Converting Those Character Encodings to UTF-8

Test Driving the Tools

More Documentation

Converting SGML to XML and Normalization

More Documentation

Using Unconverted Collections with Release 12 Middleware