Text Class Collection Implementation

DLXS Workshop, August 2008

Text Class Instructor: Chris Powell

If you have questions, please address them to dlxs-help@umich.edu.

This portion of the DLXS Workshop focuses on the differences in implementing a collection in Text Class, as compared to what we covered with Finding Aids Class on Tuesday. Links to the detailed Text Class documentation are included.

A printed copy of this document will be available at the workshop.

Workshop Day 3 -- Wednesday Morning

For simplified Data Flow Diagram overview of TextClass data prep and delivery, including the directories in which files are created, see the TextClass Prep DFD.

Workshop Day 3 -- Wednesday Morning

Text Class Content Preparation

In Text Class Content Prep we discuss the elements and attributes required for Text Class delivery, the necessary architecture for storing texts and collections, and review strategies and methods for converting texts to conform to the Text Class DTD, XML, and UTF-8, and normalization.

Text Class XML DTD Overview

It is assumed that any texts to be converted to Text Class already validate against another DTD for encoding monographic materials, such as TEI Lite, that represents corresponding structures (chapters, pages, etc.). Because of the extremely lax content modelling (almost every element is defined to permit ANY of the named elements), the Text Class DTD is only useful to ensure that the nomenclatures have been changed appropriately.

If you elect to modify the Text Class XML DTD to validate your source documents, you may need to change the Text Class middleware; you will almost certainly have to adjust XML to HTML XSLT stylesheets, and changes may affect searching and results list behaviors.

The following elements and attributes are required:

The Text Class XML DTD is a fluid document; more attributes, and occasionally elements, are added as the need arises in processing new collections. Because of differences in the syntax of SGML and XML DTDs, things that validated against the SGML version may not validate against the current XML version -- the SGML inclusions of floating elements like page breaks and line breaks throughout the entire TEXT element, for example, are gone, and these must be declared explicitly in the elements in which they occur.

Text Conversion Strategies

DLPS does not have any preferred methods or quick and easy tools for this stage of the process. Only you, looking at your texts and your encoding practices, can do the intellectual work required to convert the texts. You should do this with the tools you are most comfortable using, whether they be macros in your favorite editor, perl scripts if you have strong programming skills, OmniMark if you like that, or XSLT (my personal choice). We have a fairly detailed XSLT strategy on the documentation website, which uses freely-available or ubiquitous tools, and if you are creating XML documents anyway, this might be a reasonable route to pursue.

We have also used a perl script to do conversions of TEI Lite-encoded SGML into Text Class SGML in the past, and are willing to make these (largely undocumented) scripts available. We are happy to offer suggestions and our historical experience in converting collections, but cannot really support you with specific tools or methods in your conversion, as it is particular to the encoding of your texts.

Other Text Modifications

One way to help the cgi with identifying specific text structures, like divisions, exactly is to insert unique attributes based on a combination of the IDNO and the sequence of the division in the text. This is an expendable ID and not meant to permanently identify a structure -- use you own throughtfully assigned and permanent ID attributes for that. Before indexing, check to see if node attributes have been applied when the documents were converted to Text Class -- they will appear in the DIV tags for each division and will look like this: <DIV1 NODE="AAN8938.0001.001:1">. If they have not, use the following tool provided in the DLXS installation to insert them:


Normalize and Validate XML

This step checks the XML against the Text Class DTD to validate the XML. It also normalizes the XML, which, if necessary, adjusts the XML tagging so that it is consistent in terms of case and order of element attributes. It is exactly the same as for Finding Aids class, though it is not built into the sample Makefile.

Storing Texts and Page Images

Unlike finding aids, we store each digitized text in its own directory, based on its DLPS ID, along with the related page images. The DLPS ID is a unique ID for each text, based on the ID assigned to its MARC record by the OPAC. Directories are created in the form $DLXSROOT/obj/d/l/p/dlpsid (the DLPS ID can consist of a mix of number and letter characters). Pageviewer defaults to search for page images stored in a directory based on this form, although there is a method that can be overridden to handle different storage options.

To facilitate links between the texts and the images stored in the $DLXSROOT/obj directories, the middleware is configured to read a several million row table on our MySQL server containing page image metadata. If you have created pageview.dat files in the past and need to populate the SQL database, we delivering a program ($DLXSROOT/bin/t/text/makepageviewdata.pl) that will convert pageview.dat files into MySQL rows. Invocation is simple (don't do it -- just FYI):

$DLXSROOT/bin/t/text/importpageviewdata.pl [-f] -d "$DLXSROOT/obj"

The -f flag indicates a "full run", i.e., process all files regardless of whether they've changed since the last run (otherwise, there is a timestamp file to determine which files have changed since the last run). Based on what database format you chose during DLXS installation, this process will populate the database with the information from any pageview.dat files it encounters as it runs through the directory you specified recursively.

More Documentation

Text Class Index Building with XPATu

This is largely the same as indexing with Finding Aids Class. The only major difference is the preparation of a custom DTD for your collection.

Set Up Directories and Files for XPATu Indexing

As with Finding Aids Class, you need the same directories for scripts, the collection concatenated XML, and the collection index. The bin directory holds any scripts or tools used for the collection specifically; obj holds the "object" or XML file for the collection, and idx holds the XPATu indexes. There is no instruction for concatenation in the Text Class Makefile. I tend to concatenate the texts into one collection with a command in the form:

cat /l1/workshop-samples/sooty/collstart *.noded /l1/workshop-samplesooty/collend > $DLXSROOT/obj/c/coll/coll.xml

You can examine the files referenced -- they just include a <COLL> and </COLL> root element for the collection.

Build the Collection Specific Text Class DTD

Before indexing your collection, you will need to create a collection-specific Text Class DTD. Because the class DTD supports any element having any of the declared attributes (for convenience of DTD creation), indexing "as-is" with XPATu will cause your index to be unnecessarily large. This may also cause problems for XML validation tools. You can create your own collection specific version of the Text Class DTD by running the following command: (don't do it -- just FYI)

egrep -i "<\!ELEMENT" $DLXSROOT/misc/sgml/textclass.xml.dtd > textclass.stripped.xml.dtd

There is a "make dtd" command from the Makefile to determine which attributes are used in your collection and build a custom DTD by concatenating it with $DLXSROOT/misc/xml/textclass.stripped.xml.dtd. Using the "make validate" command will then validate your collection against the new DTD. If the individual texts validated before, they should validate as a concatenated collection.

Build the XPATu Index

As in Finding Aids Class, the Makefile in the bin directory contains the commands necessary to build the index, and can be executed easily.

To create an index for use with the Text Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" structures based on a combination of elements (for example, defining what the "main entry" is, without adding a <MAINENTRY> tag around the appropriate <AUTHOR> or <TITLE> element). The following commands can be used to make the index, alone or in combination. We will be using "make dd," make xml," and "make post."

make dd indexes words for texts that have been concatenated into one large file for a collection.

make xml indexes the XML structure by reading the DTD. Validates as it indexes. Slower than multiregion indexing (see below) for this reason. However, necessary for collections that have nested elements of the same name (for example a P within a NOTE1 within a P).

make multi (multiregion structure indexing) indexes the XML structure and relies on a "tags file" (included in the sample collection) to know what XML elements and attributes to index. Rarely used with fully-encoded full-text collections because of the nesting problem mentioned above. If you'd like to try this on your own, index only the new text (bab3433.0001.001.norm.xml)

make post builds and indexes fabricated regions based on the XPATu queries stored in the workshoptc.extra.srch file.

make dd
cp /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.blank.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
/l/local/bin/xpatbldu -m 256m -D /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
cp /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.presgml.dd
make xml
cp /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.presgml.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
/l/local/bin/xmlrgn -D /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l1/workshop/sooty/dlxs/misc/sgml/xml.dcl /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.inp /l1/workshop/sooty/dlxs/obj/w/workshoptc/workshoptc.xml
/l/local/bin/xmlrgn:/l1/workshop/sooty/dlxs/misc/sgml/xml.dcl:1:W: SGML declaration was not implied
cp /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.prepost.dd
make post
cp /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.prepost.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
touch /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.init
/l/local/bin/xpatu -q /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd < /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.extra.srch | /l1/workshop/sooty/dlxs/bin/t/text/output.dd.frag.pl /l1/workshop/sooty/dlxs/idx/w/workshoptc/ > /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.extra.dd
/l1/workshop/sooty/dlxs/bin/t/text/inc.extra.dd.pl /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.extra.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd

Sorting and browse building require that you have only one maintitle, mainauthor and maindate per text, so that you have one value on which to sort. Your extra.srch files may need to be changed in order to be more specific. If you do not, some sort operations will give you a sortkey assertion failure.

Some examples of more specific searches in your extra.srch are provided below. The first relies on identifying metadata that has been specified through the use of attributes; the second merely chooses the first occurrence as an indication that it is the "main" value.

(((region TITLE incl "type=main") within region TITLESTMT) within region SOURCEDESC);
{exportfile "/l1/idx/e/example/maintitle.rgn"}; export; ~sync "maintitle";
(((region AUTHOR within (region "<TITLESTMT".."</AUTHOR>")) within (region
TITLESTMT within region FILEDESC)) not within (region SOURCEDESC)); {exportfile
"/l1/idx/e/example/mainauthor.rgn"}; export; ~sync "mainauthor";

More Documentation

Text Class Collection to Web

These are the final steps in deploying an Text Class collection online. As with Finding Aids Class, the Collection Manager is used to create the Collection Database entry for the new collection, including setting up the collection for dynamic browsing. The Collection Manager is also used to review/amend the Group Database. Finally, you need to work with the collection map and the set up the collection's web directory.

Review the Collection Database Entry with CollMgr

Each collection has a record in the collection database that holds collection specific configurations for the middleware. CollMgr (Collection Manager) is a web based interface to the collection database that provides functionality for editing each collection's record. Collections can be checked-out for editing, checked-in for testing, and released to production. In general, a new collection needs to have a CollMgr record created before the middleware can be used. The copy functionality can make this easier by allowing you to clone a collection with characteristics like your new collection.


More Documentation

Configure the Collection for Dynamic Browsing Using CollMgr

Adding dynamic browsing to a collection is a matter of simple configuration in CollMgr and then running a script on the command line to populate the browse tables with data to facilitate browsing.

Collmgr field: browseable

To enable browsing, the browseable field must be set to "yes".

Collmgr field: browsenav

The browsenav field must have a value of 0, 1 or 2. Small collections should use 0. Medium collections 1. Large collections 2. This is the number of layers of browse tabs you want for the collection. 0 means that all the items are on one page -- no tabs. 1 means you have one layer of tabs with the letters of the alphabet, and 2 means you have two layers of tabs -- one for a letter, and another for the two-letter subdivisions under it.

Collmgr field: browsefields

browsefields holds the list of fields you would like to be browseable. This list is used to prepare the data for browsing, and also to present browsing options to the user. Currently, author, title, and subject are the canonical Text Class browse fields. You will need fabricated regions of mainauthor, maintitle, and subject to support browsing. A quick XPAT query of our workshop collection shows we don't have a fabricated region subject but we do have region TERM with subjects in them. We could add subject browsing but would need to do some work to support it.

Now that we are finished updating CollMgr, we need to release our collection to production.

With the above fields properly configured and CollMgr released, the updatebrowsedb.pl script can be run. It populates the ItemColl, ItemBrowse and ItemBrowseCounts tables with information from the collection's data dictionary. You should use the "wrapper" shell script provided in the same subdirectory, ub .

More Documentation

Review the Groups Database Entry with CollMgr

One function of CollMgr allows the grouping of CollMgr (Collection Manager) is a web based interface to the collection database that provides functionality for editing each collection's record. collections for cross-collection searching. Any number of collection groups may be created for Text Class. Text Class supports a group with the groupid "all". It is not a requirement that all collections be in this group, though that's the basic idea. Groups are created and modified using CollMgr. Take a look at the record to become familiar with it .


Make Collection Map

Collection map files exist to identify the regions and operators used by the middleware when interacting with the search forms. Each collection will need one, but most collections can use a fairly standard map file, such as the one in the sampletc_utf8 collection. The map files for all Text Class collections are stored in $DLXSROOT/misc/t/text/maps

Map files take language that is used in the forms and translates it into language for the cgi and for XPAT. For example, if you want your users to be able to search within chapters, you would need to add a mapping for how you want it to appear in the search interface (case is important, as is pluralization!), how the cgi variable would be set (usually all caps, and not stepping on an existing variable), and how XPAT will identify and retrieve this natively.

The first part of the file is operator mapping, for the form, the cgi, and XPAT. The second part is for region mapping, as in the example above. There is an optional third part for collections with metadata applied bibliographically, such as genre categories.

If you want to make a map file specifically for your collection (because you want to change the values in pulldown menus, perhaps), you need to make a copy of the existing map used and alter the values in the newly-copied map file, and then change the values in collmgr to refer to the new map and the new values.

In DLXS post release 10, the map must have a mapping for the SYNTHETIC value ID. To facilitate sorting, the system must be able to assign one ID uniquely with each text. If you have a sortkey error, check this first!

<label>unique item identifier</label>
<native>region id</native>

Mappings are also needed for maintitle, mainauthor, and maindate (if the latter are applicable).

More Documentation

Set Up the Collection's Web Directory

Each collection may have a web directory with custom Cascading Style Sheets, interface templates, graphics, and javascript. The default is for a collection to use the web templates at $DLXSROOT/web/t/text. A collection specific web directory may be created, and it is necessary if you have any customization at all. For a minimal collection, you will want two files: index.html and textclass-specific.css.

As always, you'll need to change the collection name and paths. You might want to change the look radically, if your HTML skills are up to it.


Now, suppose you want to change a few things. Perhaps you want to change the word "Availability" to "Rights" in the ToC view. You may ask yourself, "How do I change this? I don't even know where this comes from!" Just as search form labels are set in the collmgr and the mapfile used by the collection, text labels are set in the langmap. (But grepping in the $DLXSROOT/web/t/text directory is always a good initial strategy for interface changes, if you haven't memorized every detail of DLXS yet.)

If you want to change this label for all Text Class collections, you can edit the page in the class directory. If this change is only relevant to one collection, you will need to make a langmapextra file and put it in the collection web directory. In my /l1/workshop-samples/sooty directory, there is a langmapextra.en.xml file that you can place in your $DLXSROOT/web/w/workshoptc directory.

  <Lookup id="headerutils">
    <Item key="headerutils.str.22">Rights</Item>

You might also want to change the color scheme for the navheader bars, as Suz discussed earlier. You could edit the textclass-specific.css that we copied over (changing the td.mainnavcell background-color and the .navcolor) to whatever color you choose, or you could replace the textclass-specific.css with the version in my /l1/workshop-samples/sooty directory, changing these to some particularly muted shades of purple.

td.mainnavcell {
 background-color: #A2A0AB;
 border-bottom: 1px solid #666666;}

.navcolor { background-color: #8A7B90; }

Finally, you could change the handling of low-level encoding elements. If we do a few XPAT queries, we'll see that there are a lot of FOREIGN elements, with language codes. If we grep in $DLXSROOT/web/t/text for FOREIGN, we'll see it appears in text.components.xsl, but the values just get passed through with no additional styling. If we want to give it a style, we can add a local text.components.xsl with a template for FOREIGN. There is a version in my /l1/workshop-samples/sooty directory that you can copy to your $DLXSROOT/web/w/workshoptc directory and use as-is, or adapt to your XSLT abilities permit. The template below italicizes the content of the FOREIGN element, and then follows it with the expanded version of the LANG attribute language code.

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0"
  exclude-result-prefixes="func dlxs">

 <xsl:import href="../../t/text/text.components.xsl"/>

<xsl:template match="FOREIGN">
<span class="rend-i"><xsl:value-of select="."/></span>
<xsl:if test="@LANG='fle'"><xsl:text> [Flemish] </xsl:text></xsl:if>
<xsl:if test="@LANG='fre'"><xsl:text> [French] </xsl:text></xsl:if>
<xsl:if test="@LANG='ger'"><xsl:text> [German] </xsl:text></xsl:if>
<xsl:if test="@LANG='gre'"><xsl:text> [Greek] </xsl:text></xsl:if>
<xsl:if test="@LANG='hu'"><xsl:text> [Hungarian] </xsl:text></xsl:if>
<xsl:if test="@LANG='it'"><xsl:text> [Italian] </xsl:text></xsl:if>
<xsl:if test="@LANG='lat'"><xsl:text> [Latin] </xsl:text></xsl:if>
<xsl:if test="@LANG='sp'"><xsl:text> [Spanish] </xsl:text></xsl:if>

Note that this is just one template, not the entire text.components.xsl copied over. The class-level version is imported, and this FOREIGN template overrides the default one.

Back to the browse issue for a moment. Since subject browsing is available, and we know we have subject terms in our collection, we will probably want to add this functionality. So what do we need to do?

More Documentation

Questions Always Worth Asking When You Add a New Collection

Check the Paths in All Your Files

Check the Fabricated Regions

Check the CollMgr

Update the Map