Text Class Collection Implementation

DLXS Workshop, August 2005

Text Class Instructor: Chris Powell

If you have questions, please address them to dlxs-info@umich.edu.

This portion of the DLXS Workshop focuses on implementing a collection in the Text Class. It is organized as a hands-on lesson, with the entire process outlined in detail. All of the steps are included so that it can be repeated or used as a guide later. Links to the detailed Text Class documentation are included.

A printed copy of this document will be available at the workshop.

Workshop Day 3 -- Wednesday Morning

Workshop Day 3 -- Wednesday Afternoon

For simplified Data Flow Diagram overview of TextClass data prep and delivery, including the directories in which files are created, see the TextClass Prep DFD.

Workshop Day 3 -- Wednesday Morning

Text Class Content Preparation

In Text Class Content Prep we discuss the elements and attributes required for Text Class delivery, the necessary architecture for storing texts and collections, and review strategies and methods for converting texts to conform to the Text Class DTD, XML, and UTF-8, and normalization.

Text Class XML DTD Overview

It is assumed that any texts to be converted to Text Class already validate against another DTD for encoding monographic materials, such as TEI Lite, that represents corresponding structures (chapters, pages, etc.). Because of the extremely lax content modelling (almost every element is defined to permit ANY of the named elements), the Text Class DTD is only useful to ensure that the nomenclatures have been changed appropriately. No markup changes were made to accommodate release 12 aside from the conversion to UTF-8 XML. Any collections you have already converted to the Text Class DTD do not need structural markup changes.

If you elect to modify the Text Class XML DTD to validate your source documents, you may need to change the Text Class middleware; you will almost certainly have to adjust XML to HTML XSLT stylesheets, and changes may affect searching and results list behaviors.

The following elements and attributes are required:

The Text Class XML DTD is a fluid document; more attributes, and occasionally elements, are added as the need arises in processing new collections. Because of differences in the syntax of SGML and XML DTDs, things that validated against the SGML version may not validate against the current XML version -- the SGML inclusions of floating elements like page breaks and line breaks throughout the entire TEXT element, for example, are gone, and these must be declared explicitly in the elements in which they occur. If you run into a case where your old documents do not validate against the Text Class XML DTD, please let me know.

Text Conversion Strategies

DLPS does not have any preferred methods or quick and easy tools for this stage of the process. Only you, looking at your texts and your encoding practices, can do the intellectual work required to convert the texts. You should do this with the tools you are most comfortable using, whether they be macros in your favorite editor, perl scripts if you have strong programming skills, OmniMark if you like that, or XSLT (my personal choice). We have a fairly detailed XSLT strategy on the documentation website, which uses freely-available or ubiquitous tools, and if you are creating XML documents anyway, this might be a reasonable route to pursue.

We have also used a perl script to do conversions of TEI Lite-encoded SGML into Text Class SGML, and are willing to make these (largely undocumented) scripts available. We are happy to offer suggestions and our historical experience in converting collections, but cannot really support you with specific tools or methods in your conversion, as it is particular to the encoding of your texts.

For today, we are going to be working with some texts that are already in Text Class. We will be building them into a collection we are going to call workshoptc.

This documentation will make use of the concept of the $DLXSROOT, which is the place at which your DLXS directory structure starts. We generally use /l1/, but for the workshop, we each have our own $DLXSROOT in the form of /l1/workshop/userX. To determine what your $DLXSROOT is, type the following commands at the command prompt:


Create directory $DLXSROOT/prep/w/workshoptc with the following command:

mkdir -p $DLXSROOT/prep/w/workshoptc/data

Move into that directory with the following command:

cd $DLXSROOT/prep/w/workshoptc/data

This will be your staging area for all the things you will be doing to your texts, and ultimately to your collection. At present, all it contains is the data subdirectory you created a moment ago. We will be populating it further over the course of the next two days. Unlike the contents of other directories, everything in prep should be ultimately expendable in the production environment.

Copy the necessary files into your data directory with the following commands:

cp $DLXSROOT/obj/b/a/b/bab3633.0001.001/bab3633.0001.001.sgm $DLXSROOT/prep/w/workshoptc/data/.

cp $DLXSROOT/obj/a/b/e/abe5413.0001.001/abe5413.0001.001.sgm $DLXSROOT/prep/w/workshoptc/data/.
cp $DLXSROOT/obj/a/b/u/abu0246.0001.001/abu0246.0001.001.sgm $DLXSROOT/prep/w/workshoptc/data/.

The files, as you see, are not yet in UTF-8 XML, so we'll be repeating some of the steps we took yesterday. These files are much less interesting so we won't see as much variety. First we'll look at what entities they contain, and then whether or not they can be indexed by xpatu.

$DLXSROOT/bin/t/text/findEntities.pl *.sgm

Looks good, nothing surprising.

foreach file (*.sgm)
echo $file
xpatutf8check $file

Since most of you are set up for bash, here are the same commands in that shell:

for file in *.sgm
echo $file
xpatutf8check  $file

So we need to convert the CERs to UTF-8, just as we did to the sample files yesterday.

foreach file (*.sgm)
isocer2utf8 $file > $file.utf
for file in *.sgm
isocer2utf8 $file > $file.utf

And convert the now UTF-8 SGML files to XML. Unlike yesterday, I'm going to normalize the XML, to give those people working with materials already in XML a chance to see how that works. We need to make sure our environment is properly set for tools in the Open SP set.

setenv SP_ENCODING utf-8

For those of you in bash, it's

export SP_ENCODING=utf-8

And now we use osx to convert to XML. Because of the way we're renaming, I suggest those using bash to switch to tc shell by typing

foreach file (*.utf)
osx -x no-nl-in-tag -x empty -E 500 -f $file.errors $DLXSROOT/misc/sgml/xmlentities.dcl $DLXSROOT/prep/s/sampletc/sampletc.text.inp  $file > $file:r:r.xml

If you switched shells, you can just type exit to get back to bash.

Other Text Modifications

One way to help the cgi with identifying specific text structures, like divisions, exactly is to insert unique attributes based on a combination of the IDNO and the sequence of the division in the text. This is an expendable ID and not meant to permanently identify a structure -- use you own throughtfully assigned and permanent ID attributes for that. Before indexing, check to see if node attributes have been applied when the documents were converted to Text Class -- they will appear in the DIV tags for each division and will look like this: <DIV1 NODE="AAN8938.0001.001:1">. If they have not, use the following command to insert them:

$DLXSROOT/bin/t/text/nodefy $DLXSROOT/prep/w/workshoptc/data/bab3633.0001.001.xml
mv $DLXSROOT/prep/w/workshoptc/data/bab3633.0001.001.xml.noded $DLXSROOT/prep/w/workshoptc/data/bab3633.0001.001.xml

Normalize and Validate XML

This step checks the XML against the Text Class DTD to validate the XML. It also normalizes the XML, which, if necessary, adjusts the XML tagging so that it is consistent in terms of case and order of element attributes.

There are not likely to be any errors with the workshoptc data, but tell the instructor if there are.

foreach file (*.xml)
osgmlnorm $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp $file > $file.norm

Since most of you are set up for bash, here's the same command in that shell:

for file in *.xml
osgmlnorm $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp $file > $file.norm

This will normalize the texts and result in new texts with a .norm extension added. There is, however, an unfortunate side effect -- all of our XML-style "singletons" are now back to SGML-style. This is a known bug (Bug#112685 ) since 2001, but still it plagues us. It's related to the problem of warnings about XML DTDs that some of you have no doubt seen when normalizing finding aids. Very annoying, because now we have to run osx again. And again, we're renaming the output by stripping off the modifiers, so those of you using bash need to change shells again (for the last time!).

foreach file (*.norm)
osx -x no-nl-in-tag -x empty -E 5000 -f $file.errors $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp  $file > $file:r:r.norm.xml

And again, if you like bash, just type exit to get back to your usual environment.

As you can see, we get lots and lots of errors this time around. Why? Because the sampletc_utf8 DTD is an XML DTD, and these files have sadly turned into SGML. Because we do have an SGML DTD on hand, we can change from the sampletc_utf8.text.inp to the sampletc.text.inp and get along fine. But what of cases where you do not have an identical SGML DTD, as with EAD 2002? I am the last person to EVER advise ignoring errors, but in this case they are irrelevant. You can check this by immediately validating the newly created norm.xml files against this same DTD using onsgmls.

foreach file (*.norm.xml)
onsgmls -w xml -w no-explicit-sgml-decl  -s -f $file.errors $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp  $file
for file in *.norm.xml
onsgmls -w xml -w no-explicit-sgml-decl  -s -f $file.errors $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp  $file

Our norm.xml error files are all zero bytes, as we like them. I have a handy command for removing the zero byte files and leaving the others (if any) which I like to use when processing dozens of files at once.

find . -type f -size 0 -prune -exec rm {} \;

One last thing: although the files are all lovely, valid XML, because they are going to be concatenated to build a single collection, the XML declarations at the top of each need to be removed (a single declaration will be added to the collection as it is concatenated).

foreach file (*.norm.xml)
/l1/workshop-samples/sooty/attliststrip.pl $file
for file in *.norm.xml
/l1/workshop-samples/sooty/attliststrip.pl $file

Our norm.xml files are now all in order and ready to be used to build the collection.

Storing Texts and Page Images

As you may have noticed from our file copying steps earlier, we store each digitized text in its own directory, based on its DLPS ID, along with the related page images. The DLPS ID is a unique ID for each text, based on the ID assigned to its MARC record by the OPAC. Directories are created in the form $DLXSROOT/obj/d/l/p/dlpsid (the DLPS ID can consist of a mix of number and letter characters). Pageviewer defaults to search for page images stored in a directory based on this form, although there is a method that can be overridden.

To facilitate links between the texts and the images stored in the $DLXSROOT/obj directories, the middleware is configured to read a several million row table on our MySQL server containing page image metadata. If you have created pageview.dat files in the past and would like to upgrade to the new middleware, we are delivering a program ($DLXSROOT/bin/t/text/makepageviewdata.pl) that will convert pageview.dat files into MySQL rows. Invocation is simple (don't do it -- just FYI):

$DLXSROOT/bin/t/text/importpageviewdata.pl [-f] -d "$DLXSROOT/obj"

The -f flag indicates a "full run", i.e., process all files regardless of whether they've changed since the last run (otherwise, there is a timestamp file to determine which files have changed since the last run). Based on what database format you chose during DLXS installation, this process will populate the database with the information from any pageview.dat files it encounters as it runs through the directory you specified recursively.

More Documentation

Text Class Index Building with XPATu

In this section the workshoptc normalized XML will be concatenated and indexed with the XPATu search engine, preparing it for use with the DLXS middleware.

Set Up Directories and Files for XPATu Indexing

Previously, we did what we needed to do with our materials "by hand" -- today, we will work with the materials packaged in the sampletc_utf8 collection and adapt them for use with workshoptc. This should parallel what you'll be doing back at your institutions. First, we need to create the rest of the directories in the workshoptc environment with the following commands:

mkdir -p $DLXSROOT/bin/w/workshoptc
mkdir -p $DLXSROOT/obj/w/workshoptc
mkdir -p $DLXSROOT/idx/w/workshoptc

The bin directory holds any scripts or tools used for the collection specifically; obj holds the "object" or XML file for the collection, and idx holds the XPATu indexes. Now we need to populate the directories. First, change directories into $DLXSROOT/prep/w/workshoptc/data and concatenate the texts into one collection with the following command:

cat /l1/workshop-samples/sooty/collstart bab3633.0001.001.norm.xml abe5413.0001.001.norm.xml abu0246.0001.001.norm.xml /l1/workshop-samples/sooty/collend > $DLXSROOT/obj/w/workshoptc/workshoptc.xml

Next, we'll copy and edit the necessary files from sampletc_utf8 to get our workshoptc collection together.

cp $DLXSROOT/bin/s/sampletc_utf8/Makefile $DLXSROOT/bin/w/workshoptc/Makefile
cp $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.blank.dd $DLXSROOT/prep/w/workshoptc/workshoptc.blank.dd
cp $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.extra.srch $DLXSROOT/prep/w/workshoptc/workshoptc.extra.srch
cp $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.inp $DLXSROOT/prep/w/workshoptc/workshoptc.inp 

Each of these files need to be edited to reflect the new collection name and the paths to your particular directories. This will be true when you use these at your home institution as well, even if you use the same directory architecture as we do, because they will always need to reflect the unique name of each collection. Failure to change even one file can result in puzzling errors, because the scripts are working, just not necessarily in the directories you think they are.

If you are comfortable editing in the unix environment, in the Makefile, workshoptc.blank.dd, workshoptc.extra.srch, and workshoptc.inp, change all references to /s/ to /w/ and sampletc_utf8 to workshoptc. Otherwise, run the following command:

sh /l1/workshop-samples/sooty/paths

Build the Collection Specific Text Class DTD

Before indexing your collection, you will need to create a collection-specific Text Class DTD. Because the class DTD supports any element having any of the declared attributes (for convenience of DTD creation), indexing "as-is" with XPATu will cause your index to be unnecessarily large. This may also cause problems for XML validation tools. You can create your own collection specific version of the Text Class DTD by running the following command: (don't do it -- just FYI)

egrep -i "<\!ELEMENT" $DLXSROOT/misc/xml/textclass.xml.dtd > textclass.stripped.xml.dtd

NWe'll use the "make dtd" command from the Makefile to determine which attributes are used in your collection and build a custom DTD by concatenating it with $DLXSROOT/misc/xml/textclass.stripped.xml.dtd. Using the "make validate" command will then validate your collection against the new DTD. If the individual texts validated before, they should validate as a concatenated collection now.

cd $DLXSROOT/bin/w/workshoptc
make dtd
make validate

Build the XPATu Index

Everything is now set up to build the XPATu index. The Makefile in the bin directory contains the commands necessary to build the index, and can be executed easily.

To create an index for use with the Text Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" structures based on a combination of elements (for example, defining what the "main entry" is, without adding a <MAINENTRY> tag around the appropriate <AUTHOR> or <TITLE> element). The following commands can be used to make the index, alone or in combination. We will be using "make dd," make xml," and "make post."

make dd indexes words for texts that have been concatenated into one large file for a collection.

make xml indexes the XML structure by reading the DTD. Validates as it indexes. Slower than multiregion indexing (see below) for this reason. However, necessary for collections that have nested elements of the same name (for example a P within a NOTE1 within a P).

make multi (multiregion structure indexing) indexes the XML structure and relies on a "tags file" (included in the sample collection) to know what XML elements and attributes to index. Rarely used with fully-encoded full-text collections because of the nesting problem mentioned above. If you'd like to try this on your own, index only the new text (bab3433.0001.001.norm.xml)

make post builds and indexes fabricated regions based on the XPATu queries stored in the workshoptc.extra.srch file.

make dd
cp /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.blank.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
/l/local/bin/xpatbldu -m 256m -D /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
cp /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.presgml.dd
make xml
cp /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.presgml.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
/l/local/bin/xmlrgn -D /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l1/workshop/sooty/dlxs/misc/sgml/xml.dcl /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.inp /l1/workshop/sooty/dlxs/obj/w/workshoptc/workshoptc.xml
/l/local/bin/xmlrgn:/l1/workshop/sooty/dlxs/misc/sgml/xml.dcl:1:W: SGML declaration was not implied
cp /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.prepost.dd
make post
cp /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.prepost.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
touch /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.init
/l/local/bin/xpatu -q /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd < /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.extra.srch | /l1/workshop/sooty/dlxs/bin/t/text/output.dd.frag.pl /l1/workshop/sooty/dlxs/idx/w/workshoptc/ > /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.extra.dd
/l1/workshop/sooty/dlxs/bin/t/text/inc.extra.dd.pl /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.extra.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd

Sorting and browse building require that you have only one maintitle, mainauthor and maindate per text, so that you have one value on which to sort. Your extra.srch files may need to be changed in order to be more specific. If you do not, some sort operations will give you a sortkey assertion failure.

Some examples of more specific searches in your extra.srch are provided below. The first relies on identifying metadata that has been specified through the use of attributes; the second merely chooses the first occurrence as an indication that it is the "main" value.

(((region TITLE incl "type=main") within region TITLESTMT) within region SOURCEDESC);
{exportfile "/l1/idx/e/example/maintitle.rgn"}; export; ~sync "maintitle";
(((region AUTHOR within (region "<TITLESTMT".."</AUTHOR>")) within (region
TITLESTMT within region FILEDESC)) not within (region SOURCEDESC)); {exportfile
"/l1/idx/e/example/mainauthor.rgn"}; export; ~sync "mainauthor";

More Documentation

Workshop Day 3 -- Wednesday Afternoon

Text Class Collection to Web

These are the final steps in deploying an Text Class collection online. Here the Collection Manager will be used to review the Group Database. The Collection Manager will also be used to update the Collection Database for workshoptc, including setting up the collection for dynamic browsing. Finally, we need to work with the collection map and the set up the collection's web directory.

Review the Groups Database Entry with CollMgr

One function of CollMgr allows the grouping of collections for cross-collection searching. Any number of collection groups may be created for Text Class. Text Class supports a group with the groupid "all". It is not a requirement that all collections be in this group, though that's the basic idea. Groups are created and modified using CollMgr. For this workshop, the group "all" record has already been edited to include the workshoptc collection. Take a look at the record to become familiar with it.


We won't be doing anything with groups, but the question recently came up from another DLXS partner.

Review the Collection Database Entry with CollMgr

Each collection has a record in the collection database that holds collection specific configurations for the middleware. CollMgr (Collection Manager) is a web based interface to the collection database that provides functionality for editing each collection's record. Collections can be checked-out for editing, checked-in for testing, and released to production.A collection database record for workshoptc has already been created and we will edit it. In general, a new collection needs to have a CollMgr record created from scratch before the middleware can be used. Take a look at the record to become familiar with it.


More Documentation

Configure the Collection for Dynamic Browsing Using CollMgr

Dynamic browsing is a new feature introduced in DLXS release 12. Adding dynamic browsing to a collection is a matter of simple configuration in CollMgr and then running a script on the command line to populate the browse tables with data to facilitate browsing.

Collmgr field: browseable

To enable browsing, the browseable field must be set to "yes".

Collmgr field: browsenav

The browsenav field must have a value of 0, 1 or 2. Small collections should use 0. Medium collections 1. Large collections 2. This is the number of layers of browse tabs you want for the collection. 0 means that all the items are on one page -- no tabs. 1 means you have one layer of tabs with the letters of the alphabet, and 2 means you have two layers of tabs -- one for a letter, and another for the two-letter subdivisions under it.

Collmgr field: browsefields

browsefields holds the list of fields you would like to be browseable. This list is used to prepare the data for browsing, and also to present browsing options to the user. Currently, author and title are the canonical Text Class browse fields.

Now that we are finished updating CollMgr, we need to release our collection to production.

With the above fields properly configured and CollMgr released, the updatebrowsedb.pl script can be run. It populates the ItemColl, ItemBrowse and ItemBrowseCounts tables with information from the collection's data dictionary.

cd $DLXSROOT/bin/browse
$DLXSROOT/bin/browse/updatebrowsedb.pl class=text c=workshoptc host=jolt.umdl.umich.edu row=production

More Documentation

Make Collection Map

Collection mapper files exist to identify the regions and operators used by the middleware when interacting with the search forms. Each collection will need one, but most collections can use a fairly standard map file, such as the one in the sampletc_utf8 collection. The map files for all Text Class collections are stored in $DLXSROOT/misc/t/text/maps

Map files take language that is used in the forms and translates it into language for the cgi and for XPAT. For example, if you want your users to be able to search within chapters, you would need to add a mapping for how you want it to appear in the search interface (case is important, as is pluralization!), how the cgi variable would be set (usually all caps, and not stepping on an existing variable), and how XPAT will identify and retrieve this natively.

The first part of the file is operator mapping, for the form, the cgi, and XPAT. The second part is for region mapping, as in the example above. There is an optional third part for collections with metadata applied bibliographically, such as genre categories.

cd $DLXSROOT/misc/t/text/maps
cp sampletc_utf8.map workshoptc.map

In DLXS post release 10, the map must have a mapping for the SYNTHETIC value ID. To facilitate sorting, the system must be able to assign one ID uniquely with each text.

<label>unique item identifier</label>
<native>region id</native>

Mappings are also needed for maintitle, mainauthor, and maindate (if the latter are applicable).

More Documentation

Set Up the Collection's Web Directory

Each collection may have a web directory with custom Cascading Style Sheets, interface templates, graphics, and javascript. The default is for a collection to use the web templates at $DLXSROOT/web/t/text. A collection specific web directory may be created, and it is necessary if you have any customization at all. For a minimal collection, you will want two files: index.html and textclass-specific.css.

mkdir -p $DLXSROOT/web/w/workshoptc
cp $DLXSROOT/web/s/sampletc_utf8/index.html $DLXSROOT/web/w/workshoptc/index.html
cp $DLXSROOT/web/s/sampletc_utf8/textclass-specific.css $DLXSROOT/web/w/workshoptc/textclass-specific.css

Or for a simpler set of pages to edit

cp /l1/workshop/test01/dlxs/web/s/sampletc_utf8/* $DLXSROOT/web/w/workshoptc

As always, we'll need to change the collection name and paths. You might want to change the look radically, if your HTML skills are up to it.

In release 12, web templates have disappeared, replaced by XML and XSL. If you have done a great deal of customization, you will have to change the default stylesheets.

Try It Out


More Documentation

Reviewing Existing Collections After a Move to Release 12

Check the Fabricated Regions

Check the CollMgr

Update the Map