Findaid Class Collection Implementation (brief version)

DLXS Workshop

Findaid Class Instructor Tom Burton-West

This portion of the DLXS Workshop focuses on implementing a collection in the Findaid Class. Since many of the steps in implementing a Findaid Class collection are very similar to implementing a Text Class collection, the emphasis at the workshop will be on the differences between Text Class and Findaid Class and on dealing with practical implementation issues. Since EAD encoding practices vary widely we will highlight issues arising from different encoding practices and how to resolve them.

This document provides a relatively brief overview of encoding, data prep and indexing issues. A more detailed version of this document which includes descriptions of commonly occuring problems and solutions as well as more in depth discussion is Findaid Class workshop details

More general documentation: Findaid Class


Findaid Class Encoding Practices and Processes

In Findaid Class Encoding Practices and Processes we discuss the elements and attributes required for "out of the box" Findaid Class delivery, various encoding issues, and preparing the work environment and validating the data.


EAD 2002 DTD Overview

These instructions assume that you have already encoded your finding aids files in the XML-based EAD 2002 DTDIf you have finding aids encoded using the older EAD 1.0 standard or are using the SGML version of EAD2002, you will need to convert your files to the XML version of EAD2002.   

The EAD standard was designed as a “loose” standard in order to accommodate the large variety in local practices for paper finding aids and make it easy for archives to convert from paper to electronic form.  As a result, conformance with the EAD standard still allows a great deal of variety in encoding practices.

The DLXS software is primarily designed as a system for mounting University of Michigan collections.  In the case of finding aids, the software has been designed to accommodate the encoding practices of the Bentley Historical Library. The more similar your data and setup is to the Bentley’s, the easier is will be to integrate your finding aids collection with DLXS.  If your practices differ significantly from the Bentley’s, you will probably need to do some preprocessing of your files and/or modifications to various files in DLXS.  We have found that the largest number of issues in implementing Findaid Class for member institutions is dealing with differences in encoding practices. Much of the following material will cover various issues that commonly arise.

 


Practical EAD Encoding Issues

There are a number of encoding issues that may affect the data preparation, indexing, searching, and rendering of your finding aids. Some of them are:


Data Preparation

For today, we are going to be working with some texts that are already in Findaid Class. We will be building them into a collection we are going to call workshopfa.

This documentation will make use of the concept of the $DLXSROOT, which is the place at which your DLXS directory structure starts. We generally use /l1/, but for the workshop, we each have our own $DLXSROOT in the form of /l1/workshop/userX/dlxs/. To check your $DLXSROOT, type the following commands at the command prompt:

cd $DLXSROOT
pwd

The prep directory under $DLXSROOT is the space for you to take your encoded finding aids and "package them up" for use with the DLXS middleware. Create your basic directory $DLXSROOT/prep/w/workshopfa and its data subdirectory and then copy sample data files into the newly created subdirectory with the following commands:

mkdir -p $DLXSROOT/prep/w/workshopfa/data
cd $DLXSROOT/prep/w/workshopfa
cp $DLXSROOT/prep/s/samplefa/data/*.xml $DLXSROOT/prep/w/workshopfa/data/.

We'll also need a few files to get us started working. They will need to be copied over as well, and also have paths adapted and collection identifiers changed. Follow these commands:

cp $DLXSROOT/prep/s/samplefa/validateeach.csh $DLXSROOT/prep/w/workshopfa/.
cp $DLXSROOT/prep/s/samplefa/samplefa.xml.inp $DLXSROOT/prep/w/workshopfa/workshopfa.xml.inp
cp $DLXSROOT/prep/s/samplefa/samplefa.text.inp $DLXSROOT/prep/w/workshopfa/workshopfa.text.inp
mkdir -p $DLXSROOT/obj/w/workshopfa
mkdir -p $DLXSROOT/bin/w/workshopfa
cp $DLXSROOT/bin/s/samplefa/preparedocs.pl $DLXSROOT/bin/w/workshopfa/.
cp $DLXSROOT/bin/s/samplefa/Makefile $DLXSROOT/bin/w/workshopfa/Makefile

Now you'll need to edit these files to ensure that the paths match your $DLXSROOT and that the collection name is workshopfa instead of samplefa.

STOP!! Make sure you edit the files before going to the next steps!!

Make sure you change these files:

You can run this command to check to see if you forgot to change samplefa to workshopfa:

grep "samplefa" $DLXSROOT/bin/w/workshopfa/* $DLXSROOT/prep/w/workshopfa/* |grep -v "#"

 

With the ready-to-go ead2002 encoded finding aids files in the data directory, we are ready to begin the preparation process. This will include:

  1. validating the files individually against the EAD 2002 DTD
    make validateeach
  2. concatenating the files into one larger XML file
    make prepdocs
  3. validating the concatenated file against the dlxsead2002 DTD:
    make validate
  4. "normalizing" the concatenated file.
    make norm
  5. validating the normalized concatenated file against the dlxsead2002 DTD
    make validate

These steps are generally handled via the Makefile in $DLXSROOT/bin/s/samplefa which we have copied to $DLXSROOT/bin/w/workshopfa. To see the Makefile and how it is used, click here.

Make sure you changed your copy of the Makefile to reflect "/w/workshopfa" instead of "/s/samplefa". You will want to change lines 2 and 3 accordingly:

   1  
   2  NAMEPREFIX = samplefa
   3  FIRSTLETTERSUBDIR = s

Tip: Be sure not to add any space after the workshopfa or w.

If you are doing this at your home institution you will also want to make sure you change $DLXSROOT, and the locations of the various binaries to match your installation. We will not need to do this for the workshop.


Step 1: Validating the files individually against the EAD 2002 DTD

cd $DLXSROOT/bin/w/workshopfa
make validateeach


The Makefile runs the following command:
% /l1/workshop/userXX/dlxs/prep/w/workshopfa/validateeach.csh

What's happening: The makefile is running the c-shell script validateeach.sh in the prep directory. The script creates a temporary file without the public DOCTYPE declaration, runs onsgmls on each of the resulting XML files in the data subdirectory to make sure they conform with the EAD 2002 DTD. If validation errors occur, error files will be in the data subdirectory with the same name as the finding aids file but with an extension of .err. If there are validation errors, fix the problems in the source XML files and re-run.

Check the error files by running the following commands

 ls -l $DLXSROOT/prep/w/workshopfa/data/*err
if there are any *err files, you can look at them with the following command:
 less  $DLXSROOT/prep/w/workshopfa/data/*err

There are not likely to be any errors with the workshopfa data, but tell the instructor if there are.


Step 2: Concatentating the files into one larger XML file (and running some preprocessing commands)

cd $DLXSROOT/bin/w/workshopfa
make prepdocs
The Makefile runs the following command:
$DLXSROOT/bin/w/workshopfa/preparedocs.pl $DLXSROOT/prep/w/workshopfa/data $DLXSROOT/obj/w/workshopfa/workshopfa.xml $DLXSROOT/prep/w/workshopfa/logfile.txt
This runs the preparedocs.pl script on all the files in the specified data directory and writes the output to the workshopfa.xml file in the appropriate /obj subdirectory. It also outputs a logfile to the /prep directory:
The Perl script does two sets of things:
  1. Concatenates all the files
  2. Runs a number of preprocessing steps on all the files

Concatenating the files

The script finds all XML files in the data subdirectory,and then strips off and xml declaration and doctype declaration from each file before concatenating them together. It also wraps the concatenated EADs in a <COLL> tag . The end result looks like:

<COLL>
<ead><eadheader><eadid>1</eadid>...</eadheader>... content</ead>
<ead><eadheader><eadid>2</eadid>...</eadheader>... content</ead>
<ead><eadheader><eadid>3</eadid>...</eadheader>... content</ead>
</COLL>

Preprocessing steps

The perl program also does some preprocessing on all the files. These steps are customized to the needs of the Bentley. You should look at the perl code and modify it so it is appropriate for your encoding practices.

The preprocessing steps are:

The output of the combined concatenation and preprocessing steps will be the one collection named xml file which is deposited into the obj subdirectory.

 

Step 3: Validating the concatenated file against the dlxsead2002 DTD

make validate
The Makefile runs the following command:
onsgmls -wxml -s -f $DLXSROOT/prep/w/workshopfa/workshopfa.errors $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/w/workshopfa/workshopfa.xml.inp $DLXSROOT/obj/w/workshopfa/workshopfa.xml

This runs the onsgmls command against the concatenated file using the dlxs2002dtd, and writes any errors to the workshopfa.errors file in the appropriate subdirectory in $DLXSROOT/prep/c/collection.. More details

Note that we are running this using workshopfa.xml.inp not workshop.text.inp. The workshopfa.xml.inp file points to $DLXSROOT/misc/sgml/dlxsead2002.ead which is the dlxsead2002 DTD. The dlxsead2002 DTDis exactly the same as the EAD2002 DTD, but adds a wrapping element, <COLL>, to be able to combine more than one ead element, more than one finding aid, into one file. The larger file will be indexed with XPAT tomorrow. It is, of course, a good idea to validate the file now before going further.

ls -l $DLXSROOT/prep/w/workshopfa/workshopfa.errors

$ less $DLXSROOT/prep/w/workshopfa/workshopfa.errors

Step 4: Normalizing the concatenated file

make norm
The Makefile runs a series of copy statements and two main commands:
1.)   /l/local/bin/osgmlnorm -f $DLXSROOT/prep/s/samplefa/samplefa.errors $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT$DLXSROOT/prep/s/samplefa/samplefa.xml.inp $DLXSROOT/obj/s/samplefa/samplefa.xml.prenorm > /l1/dev/tburtonw/obj/s/samplefa/samplefa.xml.postnorm
2.)   /l/local/bin/osx -bUTF-8 -xlower -xempty -xno-nl-in-tag -f /l1/dev/tburtonw/prep/s/samplefa/samplefa.errors /l1/dev/tburtonw/misc/sgml/xml.dcl /l1/dev/tburtonw/prep/s/samplefa/samplefa.xml.inp /l1/dev/tburtonw/obj/s/samplefa/samplefa.xml.postnorm > /l1/dev/tburtonw/obj/s/samplefa/samplefa.xml.postnorm.osx 

These commands ensure that your collection data is normalized. What this means is that any attributes are put in the order in which they were defined in the DTD. Even though your collection data is XML and attribute order should be irrelevant (according to the XML specification), due to a bug in one of the supporting libraries used by xmlrgn (part of the indexing software), attributes must appear in the order that they are definded in the DTD. If you have "out-of-order" attributes and don't run make norm, you will get "invalid endpoints" errors during the make post step.

Step one, which normalizes the document writes its errors to $DLXSROOT/prep/s/samplefa/samplefa.errors. Be sure to check this file.

Step 2, which runs osx to convert the normalized document back into XML produces lots of error messages which are written to standard output. These can generally be ignored.

Step 5: Validating the normalized file against the dlxsead2002 DTD

make validate 

We run this step again to make sure that the normalization process did not produce an invalid document. This is necessary because under some circumstances the "make norm" step can result in invalid XML. One known cause of this is the presense of XML processing instructions. For example: "<?Pub Caret1?>".

More Documentation

 

Findaid Class Index Building with XPAT

In this section the workshopfa XML will be indexed with the XPAT search engine, preparing it for use with the DLXS middleware.


Set Up Directories and Files for XPAT Indexing

First, we need to create the rest of the directories in the workshopfa environment with the following commands:

mkdir -p $DLXSROOT/idx/w/workshopfa

The bin directory we created yesterday holds any scripts or tools used for the collection specifically; obj ( created earlier) holds the "object" or XML file for the collection, and idx holds the XPAT indexes. Now we need to finish populating the directories.


cp $DLXSROOT/prep/s/samplefa/samplefa.blank.dd $DLXSROOT/prep/w/workshopfa/workshopfa.blank.dd
cp $DLXSROOT/prep/s/samplefa/samplefa.extra.srch $DLXSROOT/prep/w/workshopfa/workshopfa.extra.srch

Each of these files need to be edited to reflect the new collection name and the paths to your particular directories. This will be true when you use these at your home institution as well, even if you use the same directory architecture as we do, because they will always need to reflect the unique name of each collection. Failure to change even one file can result in puzzling errors, because the scripts are working, just not necessarily in the directories you are looking at.

grep -l "samplefa" $DLXSROOT/prep/w/workshopfa/*

will check for changing s/samplefa to w/workshopfa. If you are at the workshop that should be all you need. However if you are doing this at your home institution you need to replace "/l1/" by whatever $DLXSROOT is on your server. If you don't have an /l1 directory on your server (which is very likely if you are not here using a DLPS machine) you can check with:

grep -l "l1" $DLXSROOT/prep/w/workshopfa/*

 


Build the XPAT Index

Everything is now set up to build the XPAT index. The Makefile in the bin directory contains the commands necessary to build the index, and can be executed easily.

To create an index for use with the Findaid Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" structures based on a combination of elements (for example, defining who the "main author" of a finding aid is, without adding a <mainauthor> tag around the appropriate <author> in the eadheader element). The following commands can be used to make the index:

make singledd indexes words for texts that have been concatenated into on large file for a collection. This is the recommended process.

make xml indexes the XML structure by reading the DTD. Validates as it indexes.

make post builds and indexes fabricated regions based on the XPAT queries stored in the workshopfa.extra.srch file.

cd $DLXSROOT/bin/w/workshopfa
make singledd
cp /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.blank.dd
	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
/l/local/xpat/bin/xpatbld -m 256m -D /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
cp /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
	/l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.presgml.dd
make xml
cp /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.presgml.dd
	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
/l/local/xpat/bin/xmlrgn -D /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
	/l1/workshop/test02/dlxs/misc/sgml/xml.dcl
	/l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.inp
	/l1/workshop/test02/dlxs/obj/w/workshopfa/workshopfa.xml

cp /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.prepost.dd
make post
cp /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.prepost.dd
	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
touch /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.init
/l/local/xpat/bin/xpat -q /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
	< /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.extra.srch
	| /l1/workshop/test02/dlxs/bin/t/text/output.dd.frag.pl
	/l1/workshop/test02/dlxs/idx/w/workshopfa/
	> /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.extra.dd
/l1/workshop/test02/dlxs/bin/t/text/inc.extra.dd.pl
	/l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.extra.dd
	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd

Testing the index

At this point it is a good idea to do some testing of the newly created index. Invoke xpat with the following commands

xpatu $DLXSROOT/idx/w/workshopfa/workshopfa.dd
Try searching for some likely regions. Its a good idea to test some of the fabricated regions. Here are a few sample queries:
>> region "ead"
  1: 3 matches

>> region "eadheader"
  2: 3 matches

>> region "mainauthor"
  3: 3 matches

>> region "maintitle"
  4: 3 matches

>> region "admininfo"
  5: 3 matches


Fabricated Regions in FindaidClass

The make post step and the testing steps above leads us into a discussion of the use of fabricated regions in FindaidClass. uses the workshopfa.extra.srch file to add to the XPAT index. The Makefile in the bin directory conta

"Fabricated" is a term we use to describe what are essentially virtual regions in an XPat indexed text. See a basic description of what a fabricated region is and how they are created.

In Finding Aids, we use fabricated regions for certain uninteresting regions simply so that some code can be shared. For example, the fabricated region "main" is set to refer to <ead> in FindaidClass with:

(region ead); {exportfile "/l1/idx/b/bhlead/main.rgn"}; export; ~sync "main";

whereas in TextClass "main" can refer to <TEXT>. Therfore, both FindaidClass and TextClass can share the Perl code, in a higher level subclass, that creates searches for "main".

Other fabricated regions are used for searching such as the maintitle and mainauthor regions.

The majority of the fabricated regions for Findaid Class are used for the creation and display of the left hand table of contents in the "outline" view. The findaidclass.cfg file contains a hash called %gSectHeadsHash which is normally loaded into FindaidClass.pm's tocheads hash in the FindaidClass::_initialize method. The elements of the hash and the corresponding fabiricated regions are used to create the table of contents and to output the XML for the corresponding section of the EAD when one of the TOC links is clicked on by a user. The fabricated regions are used so XPAT can have binary indexes ready to use for fast retrieval of these EAD sections.

Some of the more interesting regions extracted from the samplefa.extra.srch file are listed below.

One of these regions is the add. This used to be <ADD> in the EAD 1.0 DTD, but now, is created based on the ead2002 DTD's <descgrp> tag which contains a type attribute of add.

A number of issues related to varying encoding practices can be resolved by the appropriate edits to the *.extra.srch file. (Although some of them may require changes to other files as well)




  (region ead); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/main.rgn"}; export; ~sync "main";
   
    ##
    (((region "<c01".."</did>" + region "<c02".."</did>" + region "<c03".."</did>" + region "<c04".."</did>" + region "<c05".."</did>" + region "<c06".."</did>" + region "<c07".."</did>" + region "<c08".."</did>" + region "<c09".."</did>") not incl ("level=file" + "level=item")) incl "level="); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/c0xhead.rgn"}; export; ~sync "c0xhead";
       ##
    ((region "<origination".."</unittitle>") within ((region did within region archdesc) not within region dsc)); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/maintitle.rgn"}; export; ~sync "maintitle";
    ##
       
    ((region "persname" + region "corpname" + region "famname" + region "name") within (region "origination" within ( region "did" within (region "archdesc")))); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/mainauthor.rgn"}; export; ~sync "mainauthor";
    ##
   
    (region "abstract" within ((region did within region archdesc) not within region "c01")); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/mainabstract.rgn"}; export; ~sync "mainabstract";
       ##
       ((region unitdate incl "encodinganalog=245$f") within ((region did within region archdesc) not within region dsc)); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/colldate.rgn"}; export; ~sync "colldate";
    ##
    
    (region dsc); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/contentslist.rgn"}; export; ~sync "contentslist";
    ##
     ########## admininfo ########
    admininfot = (region "descgrp-T" incl (region "A-type" incl "admin")); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/admininfo-t.rgn"}; export; ~sync "admininfo-t";
    ##
    ## ########## add ######
    addt = (region "descgrp-T" incl (region "A-type" incl "add")); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/add-t.rgn"}; export; ~sync "add-t";
  ## ########## frontmatter/titlepage ########
  frontmattert = region "frontmatter-T"; {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/frontmatter-t.rgn"}; export; ~sync "frontmatter-t";
    ##
    # frontmatter itself not needed as fabricated region since it exists
    # as a regular xml region
    ##
  ## ########## bioghist ########
    bioghist = ((region "bioghist" within region "archdesc") not within region "dsc"); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/bioghist.rgn"}; export; ~sync "bioghist";
    
  ##bioghisthead = ((region "<bioghist" .. "</head>" within region "archdesc") not within region "dsc"); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/bioghisthead.rgn"}; export; ~sync "bioghisthead";
    ##
  ((region did within region archdesc) not within region dsc); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/summaryinfo.rgn"}; export; ~sync "summaryinfo";;
    ##
  ##
  #############################
  (region "subject" + region "corpname" + region "famname" + region "name" + region "persname" + region "geogname"); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/subjects.rgn"}; export; ~sync "subjects";
  (region "corpname" + region "famname" + region "name" + region "persname"); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/names.rgn"}; export; ~sync "names";
  
   
  #(region "odd-T" ^ (region odd not within region dsc)); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/odd-t.rgn"}; export; ~sync "odd-t";  

See a full listing of the extra.srch file of the Bentley Historical Library's finding aids.

More Documentation