Converting collections to Text Class

Overview:

The task is to conform an SGML-encoded collection to the Text Class DTD while preserving and enhancing the structure of the source files as much as possible.

Step 1	Convert non-XML-compliant character entities using `sgml2xml.pl`
Step 2	Convert source files to XML using `sx`
Step 3a	Generate a quick-and-dirty overview of element use in source files using `xpatgen.pl`
Step 3b	Get a summary of elements and corresponding attributes in vendor DTD using `dtdparse.pl`
Step 4	Develop XSLT script for conforming source files to Text Class DTD
Step 5	Transform source files
Step 6a	Convert transformed files back to SGML using `xml2sgml.pl`
Step 6b	Validate transformed files against Text Class DTD using `nsgmls`
External link	Index collection
Step 6c	Check integrity of transformed files: examine element counts using `xpatgen.pl`

Typographic conventions:

Files, commands, and URLs appear in this bold fixed width font .
Generic arguments and variables are in BOLD FIXED WIDTH ALL CAPS ITALICS .

When batch-processing source files, copy them after each step (placing them in a well-named directory) so that expensive steps don't have to be re-executed to recover an earlier state. Example directory names: 0raw, 1sgml2xml, 2sx, 3xslt, 4xml2sgml. An overview of helpful tools for this conversion is available here.

1. Convert non-XML-compliant character entities using sgml2xml.pl

Usage:

perl sgml2xml.pl SOURCEFILE

E.g., for the first source file in the collection Early English Prose Fiction (EEPF)

perl sgml2xml.pl eepf01.pgp

The script does an in-place modification of the sourcefile, so that the output file is the same as the input file. Thus it's advisable to save the original source files, copy them into a directory for processing (e.g., one called sgml2xml) and run the script on the copied files. This way, you still have a copy of the source files in their original state.

What's going on here?
XML only allows limited use of character entities in the form &entityname; . An example of a non-compliant entity is &hyphen; . The script sgml2xml.pl handles such entities by converting the initial ampersand to & . For example:

— becomes &mdash;
& becomes &amp;
This preserves character entity information without using non-XML-compliant entities. Later in this conversion process, character entities will be recovered by globally converting & back to the & character.
See section 2.4 of the XML specification for more information on the use of character entities in XML.

2. Convert source files to XML using sx

Usage:

sx -x XML_OUTPUT_OPTION -b ENCODING -E MAXERRORS -f ERROR_OUTPUTFILE doctype SOURCEFILE > OUTPUTFILE.XML

...where doctype is a file in the form:

<!DOCTYPE TOP_LEVEL_TAGNAME SYSTEM "ABSOLUTE_DTD_PATH"[
<!ENTITY % entrefs SYSTEM "ABSOLUTE_CHARENTS_PATH">
%entrefs;
]>

The top-level tag name should be the top level tag of the source file.
E.g.:

sx -x no-nl-in-tag -x empty -b iso-8859-1 -E 500 -f eepf01.errs doctype eepf01.pgp > eepf01.xml

...where the file doctype contains:

<!DOCTYPE EEPFGRP SYSTEM "/dlxs/prep/e/eepf/conversion/eepf.dtd"[
<!ENTITY % entrefs SYSTEM "/dlxs/prep/e/eepf/charents.frag">
%entrefs;
]>

More information about sx is available here at James Clark's site.
More information on charents.frag files is here.

What's going on here?

XML requires a declaration at the top of the document, such as <?xml version="1.0"?> . Sx inserts this declaration.

XML does not permit SGML-style empty tags. Sx converts empty tags such as <BR> to <BR/> .

Sx converts lowercase tags to uppercase, eliminating any problems due to XML's case-sensitivity.

3a. Generate a quick-and-dirty overview of element use in the collection using xpatgen.pl

This step generates a list of each element in the DTD and the number of times it appears in the collection. This can be useful for revealing that an element is not used at all in a collection, meaning that the XSLT script (see step 4) does not have to handle it. This step is also required in order to be able to compare, in step 6b, the counts of elements in the collection before and after XSLT transformation. The collection must be Xpat-searchable for this step.

xpatgen.pl requires installation of the Perl module SGML::DTD by Earl Hood. See http://www.nacs.uci.edu/indiv/ehood/perlSGML/doc/html/SGML..DTD.html for more information on the module.

Create a file of queries based on the DTD by running the script xpatgen.pl

Usage:

perl xpatgen.pl FILENAME.DTD > XPATQUERYFILENAME.XPT

The file will contain a query for each element in the form:

region ELEMENT1
region ELEMENT2...

For example:

perl xpatgen.pl eepf.dtd > eepf.xpt

Run the Xpat queries

xpat DATA_DICTIONARY_PATH.DD < XPATQUERIES.XPT > RESULTS.XPT

For example:

xpat eepf.dd < /dlxs/prep/e/eepf/conversion/eepf.xpt > eepf.xpt.out

This runs the Xpat queries in eepf.xpt on the data dictionary for EEPF and saves the results of those queries in eepf.xpt.out . The output will look something like this:

>> 1: 100 matches

>> 2: 3 matches

>> 3: 10 matches

...

If an element is not used in a collection, Xpat will return, for example:

>> No information for region COPYR in the data dictionary.

Eliminate blank lines from the results file using Excel or Word. Then paste it into column B of an Excel spreadsheet, lined up with elements from the query file in column A. Columns A and B should have the same number of rows. Save this spreadsheet for use in step 6b.

3b. Get a summary of elements and corresponding attributes in vendor DTD using dtdparse.pl

This summary puts information from the source DTD into a more human-readable form, showing the attributes that may be used for each element. The summary is helpful in making sure all attributes are accounted for in the XSLT script.

Requires Earl Hood's Perl module SGML::DTD. See http://www.nacs.uci.edu/indiv/ehood/perlSGML/doc/html/SGML..DTD.html for more information.

Usage:

perl dtdparse.pl SOURCE.DTD > OUTPUTFILENAME

For example:

perl dtdparse.pl eepf.dtd > eepf.dtd.summ

Sample output:
[Element]
               [Attribute]
          date
               align
               lang
               pn
               quote
               r
          dates
               born
               died
          desc
          ...

4. Develop XSLT script for conforming source files to Text Class DTD

The Text Class DTD uses elements and attributes as defined in the TEI Guidelines.

All attributes in the Text Class DTD may belong to any of its elements, with the following two exceptions:

The only attributes of the element DATES are BORN , DIED , and CERT .

COLL , the top-level element, has no attributes.

The task is now to plan the transformation of vendor source files to Text Class, preserving and enhancing the structure and semantics of the source files as much as possible. Editorial notes on selection and encoding of the source text should be preserved. Tags and content that exist only for the publisher's internal use may be dropped.

This will be an iterative process:

Refer to the vendor's printed guide to understand the meaning of tags in the source files. Refer to the TEI Guidelines to understand the usage of Text Class tags.

The grid view in XML Spy is extremely helpful for visualizing the structure of a well-formed XML document (but it's slow for very large documents). Using the Projects feature of XML Spy, you can make a project for the collection you are working on and keep all your XML source files (produced in steps 1 and 2) in the appropriate folder.

Configure XML Spy so that it can transform individual documents. This makes it easy to quickly check on whether XSLT is behaving the way you want it to. But for batch processing, it's more efficient to run the transformations either on the server or using the DOS command prompt.

Do directed Xpat searching to check on how elements are used in the original collection markup. Xpat documentation is available here. See also some tips on Xpat searching here.

Make sure all relevant elements and attributes used in the source files are accounted for. Take on elements in the first pass and attributes in a second pass.

Edit XSLT scripts in XML Spy. We provide a brief XSLT cookbook showing common transformations.

The distinction between push and pull approaches in XSLT is an important one:

Pulling is when Xpath expressions (which refer to a node or nodes in an XML document) selectively pull information out of the XML document. This method, often used to restructure or reformat data, is convenient for creating the tightly constrained header used in Text Class.

Pushing is when certain parts of the document are pushed through a set of templates. This approach, better suited to preserving the structure of the source document, is usually preferred for processing the main body of texts.

For your reference, here is the XSLT script used to transform Early English Prose Fiction, a Chadwyck-Healey collection of medium complexity.

The XSLT specification is available at the W3C site.

XSLT Programmer's Reference by Michael Kay (available from Wrox) is a helpful reference book.

5. Transform source files

Download Saxon by Michael Kay and follow installation instructions (Saxon is in Java; Instant Saxon can be installed and executed on a Windows box). If Instant Saxon sends an error message that it needs the Java Virtual Machine, download and install Internet Explorer from Microsoft (even if you already have Explorer installed), making sure that the Virtual Machine is bundled with it.

There are two ways to run transformations in batches. Using Instant Saxon on a Windows box is somewhat faster; however this ties up your machine. If transformations are run on the server, other work can be done on the local machine.

1) Using Saxon installed on a server:

java com.icl.saxon.StyleSheet SOURCE.XML STYLE.XSL > OUTPUTFILE.XML
(For batch processing, use foreach command.)

2) Using Instant Saxon from the DOS command prompt in the form:

saxon SOURCE.XML STYLESHEET.XSL > OUTPUTFILE.XML
You need to be in a directory containing the Instant Saxon .exe file when running this command.
(For batch processing, make a file with one command for each source file, separated by newlines. Then run the command file using the sh command in Cygwin.)

6a. Convert transformed files back to SGML using xml2sgml.pl

What's going on here?

The script strips off the XML declaration at the top of the file.

The COLL element, if present, is removed.

& is converted back to & , restoring SGML character entities.

By definition, non-empty elements in the Text Class DTD require a closing tag. Empty elements in XML files will be in the form <ELEMENT/> (not valid in SGML). This script should be written to transform any such elements to the form <ELEMENT></ELEMENT> .

If in the next step nsgmls points to invalid empty tags, add those tags to the xml2sgml.pl script. Re-run the transformation on files as they were following step 5 (xml2sgml.pl cannot be run twice on the same file -- ampersands will be lost).

6b. Validate transformed files against Text Class DTD using nsgmls

This tool by James Clark checks that the XSLT-transformed documents conform to the Text Class DTD.

Usage:

nsgmls -s -f FILENAME.ERRS doctype FILENAME
(Use foreach for batch processing.)

For example:

nsgmls -s -f eepf01.xml.errs doctype eepf01.xml

Any errors will be recorded in the .errs files. If a .errs file is zero bytes in size, the corresponding file is valid. Once all files are validated, delete the .errs files.

Add <COLL> to the beginning of the first file and a </COLL> to the end of the last file in the directory list. If the files are too big to do this using a text editor (e.g., xemacs), you can do it by making files containing only " <COLL> " and " </COLL> " and using the cat command to join them to the first and last files in the proper place.

Now cat all files together:

cat *.xml > COLL.sgm

For good measure, check the well-formedness of this large file using nsgmls in the form:

nsgmls -s -f COLL.errs doctype COLL.sgm

The error file should be empty. Move the .sgm file to the obj directory, or whatever directory you are using to store the collection object:

mv COLL.sgm /l1/obj/C/COLL

For example:

cat *.xml > eepf.sgm
mv eepf.sgm /l1/obj/e/eepf

There is another, critical step for checking the integrity of the transformations you have run, but the collection must first be indexed. Documentation on indexing is in the next section. Go to that section now and index the files you have transformed. Once that's done, continue from 6c.

6c. Check integrity of transformed files: examine element counts using xpatgen.pl

Open the spreadsheet created in step 3a. Label the worksheet containing the query summary "source files". Then go to a new worksheet within the same file. Label the worksheet "Text Class".

Run the script xpatgen.pl once again, this time on the Text Class DTD:

perl xpatgen.pl TEXTCLASS.DTD > XPAT_QUERY_FILENAME.XPT

This creates a file of Xpat queries which can then be run on the files just transformed and indexed using a command in the form:

xpat COLL.DD < XPAT_QUERY_FILENAME.XPT > RESULTS.XPT

As in step 3a, copy the queries and place them in column A of the new worksheet. Then copy the results file, eliminate blank lines, and paste them into column B of the new worksheet. There should be the same number of rows in columns A and B.

You may wish to print out the two worksheets (source files and Text Class) for side-by-side comparison.

Check that the counts of tags are the same before and after transformation when expected. Particularly important elements to check are lines (often L), FIGURE, and PB.

If you are consolidating two elements in the source collection into one Text Class element, check that the sum of counts in the source collection equals the count of the Text Class tag (e.g., when TRAILER and CONCLUDE are both transformed to TRAILER)

Note that when an element is nested within the same element (e.g., P within NOTE within P), the Xpat count will only be of the number of P's at the outermost level. To get a count of P's that are nested at the first level within outer P's, run "region P1" in Xpat. For P's within that second level of P's, run "region P2". Note that all the P tags are just P tags in the markup itself, but Xpat indexes them with numbers to keep track of any nesting.