Findaid Class Collection Implementation

DLXS Workshop, August 2004

Findaid Class Instructors: Chris Powell, Alan Pagliere, with Greg Kinney (of the Bentley Historical Library)

This portion of the DLXS Workshop focuses on implementing a collection in the Findaid Class. It is organized as a hands-on lesson, with the entire process outlined in detail. All of the steps are included so that it can be repeated or used as a guide later. Links to the detailed Findaid Class documentation are included.

A printed copy of this document will be available at the workshop.


Workshop Day 2 -- Tuesday Afternoon

Workshop Day 3 -- Wednesday Morning

Workshop Day 3 -- Wednesday Afternoon

Workshop Day 2 -- Tuesday Afternoon

Findaid Class Encoding Practices and Processes

In Findaid Class Encoding Practices and Processes we discuss the elements and attributes required for "out of the box" Findaid Class delivery, preparing the work environment and validating the data, and linking from finding aids using DAOs. Greg Kinney, Associate Archivist, Bentley Historical Library, will give a short presentation on the Bentley Library's encoding practices from the point of view of the Library's interpretation of the EAD 2002 DTD along with a description of the specific tools and workflow used to create the XML encoded finding aids files.


EAD 2002 DTD Overview

It is assumed that your Finding Aids have been encoded in the XML-based EAD 2002 DTD. More specific DTD topics can be found at: http://www.loc.gov/ead/tglib/index.html.

DLPS does not have any preferred methods or quick and easy tools for this stage of the process. Only you, looking at your texts and your encoding practices, can do the intellectual work required to encode your finding aids in XML using the EAD 2002 DTD. Greg Kinney will discuss how the University of Michigan's Bentley Historical Library handles this process.


Practical EAD Encoding Issues

There are, however, two areas of practice that can have an effect on your online collection that are outside of hands-on encoding or conversion of word processed finding aids. One is the use of IDs as attributes on elements. I want to make it clear that we are NOT talking about the eadid here, but refer to IDs used to identify the element so that it can be referred to, or referenced from, somewhere else. You no doubt all know that each ID within a document must be unique (and the DTD enforces this). However, you may not have thought about the consequences of joining all your finding aids into one collection. Your IDs will need to be unique across the entire collection. One way to ensure this is to prefix ID values with the eadid for a given document. At this time, there is no functionality in DLXS that requires you to have IDs on any elements, but you may have used them for your own internal purposes. We have run into this ourselves and I just wanted to give everyone a heads-up, on the theory that our problems are fairly typical.

Another issue that you might run into, especially if you are migrating finding aids from SGML EAD 1.0 to XML EAD 2002, is that of handling special characters. If you are authoring finding aids in multiple languages in XML using some XML authoring tool, this is unlikely to be a problem for you -- you are aware of the issues, UTF-8 is the default encoding for XML, you will have no problems. You'll just want to make sure to index with the UTF-8 enabled version of XPAT, as was discussed earlier. If you have finding aids with multiple languages and/or special characters, you've probably thought this through already. However, if you have the occasionally e acute (é) in your SGML finding aid, you'll need to think about what you want to do with these characters. A straight converstion from SGML to XML will probably convert your character entities (for example, é) in your files to numeric entities (for example, é). While this is valid, it will present a problem with regard to searching. XPAT will treat this as a string of characters, and in order to search for blesséd, you would need to key in blesséd. If all your special characters are ISO Latin 1, you can convert them to their 8-bit equivalents and index as usual. If you have a mixture, UTF-8 is the way to go. Again, this is merely a heads up that will have no bearing on the sample finding aids, which were chosen for their size and linking behaviors, and which are sadly conventional in their use of character entities (ampersand only, in fact).


Data Preparation

For today, we are going to be working with some texts that are already in Findaid Class. We will be building them into a collection we are going to call workshopfa.

This documentation will make use of the concept of the $DLXSROOT, which is the place at which your DLXS directory structure starts. We generally use /l1/, but for the workshop, we each have our own $DLXSROOT in the form of /l1/workshop/userX/dlxs/. To check your $DLXSROOT, type the following commands at the command prompt:

cd $DLXSROOT
pwd

The prep directory under $DLXSROOT is the space for you to take your encoded finding aids and "package them up" for use with the DLXS middleware. Create your basic directory $DLXSROOT/prep/w/workshopfa and its data subdirectory with the following command:

mkdir -p $DLXSROOT/prep/w/workshopfa/data

Move into the prep directory with the following command:

cd $DLXSROOT/prep/w/workshopfa

This will be your staging area for all the things you will be doing to your texts, and ultimately to your collection. At present, all it contains is the data subdirectory you created a moment ago. We will be populating it further over the course of the next two days. Unlike the contents of other collection-specific directories, everything in prep should be ultimately expendable in the production environment.

Copy the necessary files into your data directory with the following commands:

cp $DLXSROOT/prep/s/samplefa/data/*.xml $DLXSROOT/prep/w/workshopfa/data/.

We'll also need a few files to get us started working. They will need to be copied over as well, and also have paths adapted and collection identifiers changed. Follow these commands:

cp $DLXSROOT/prep/s/samplefa/validateeach.csh $DLXSROOT/prep/w/workshopfa/.
cp $DLXSROOT/prep/s/samplefa/samplefa.text.inp $DLXSROOT/prep/w/workshopfa/workshopfa.text.inp
cp $DLXSROOT/prep/s/samplefa/samplefa.inp $DLXSROOT/prep/w/workshopfa/workshopfa.inp
mkdir -p $DLXSROOT/obj/w/workshopfa
mkdir -p $DLXSROOT/bin/w/workshopfa
cp $DLXSROOT/bin/s/samplefa/preparedocs.pl $DLXSROOT/bin/w/workshopfa/.
cp $DLXSROOT/bin/s/samplefa/validate.pl $DLXSROOT/bin/w/workshopfa/.

Now you'll need to edit these files to ensure that the paths match your $DLXSROOT and that the collection name is workshopfa instead of samplefa.

With the ready-to-go ead2002 encoded finding aids files in the data directory, we are ready to begin the preparation process. This will include:

  1. validating the files individually against the EAD 2002 DTD
  2. concatenating the files into one larger XML file
  3. validating the concatenated file against the dlxsead2002 DTD

These steps are generally handled via the Makefile in $DLXSROOT/bin/s/samplefa but during this workshop we will run through the steps "manually." To see the Makefile and how it is used, click here.


Step 1: Validating the files individually against the EAD 2002 DTD

cd $DLXSROOT/prep/w/workshopfa
./validateeach.csh

What's happening: The script creates a temporary file without the public DOCTYPE declaration, runs onsgmls on each of the resulting XML files in the data subdirectory to make sure they conform with the EAD 2002 DTD. If validation errors occur, error files will be in the data subdirectory with the same name as the finding aids file but with an extension of .err. You fix the problems in the source XML files and re-run.

There are not likely to be any errors with the workshopfa data, but tell the instructor if there are.


Step 2: Concatentating the files into one larger XML file

cd $DLXSROOT/bin/w/workshopfa
./preparedocs.pl $DLXSROOT/prep/w/workshopfa/data $DLXSROOT/obj/w/workshopfa/workshopfa.xml $DLXSROOT/prep/w/workshopfa/logfile.txt

The Perl script finds all XML files in the data subdirectory, checks the encoding type, removes the XML and DOCTYPE declarations, and then proceeds to add a prefix string to DAO links, removes empty persname, corpname, and famname elements, and then concatenates the files and wraps them with a collection (<COLL>) element.

If your collections need to be transformed in any way, or if you do not want the transformations to take place (the DAO changes, for example), edit this file to effect the changes.

The output of this will be the one collection named xml file which is deposited into the obj subdirectory.


Step 3: Validating the concatenated file against the dlxsead2002 DTD

onsgmls -s -f $DLXSROOT/prep/w/workshopfa/workshopfa.errors $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/w/workshopfa/workshopfa.inp $DLXSROOT/obj/w/workshopfa/workshopfa.xml

The dlxsead2002 DTD is exactly the same as the EAD2002 DTD, but adds a wrapping element, <COLL>, to be able to combine more than one ead element, more than one finding aid, into one file. The larger file will be indexed with XPAT tomorrow. It is, of course, a good idea to validate the file now before going further.

If there is an error, the file $DLXSROOT/prep/w/workshopfa/workshopfa.errors will be present and contain messages about the invalidities found.

 

More Documentation

Workshop Day 3 -- Wednesday Morning

Findaid Class Index Building with XPAT

In this section the workshopfa XML will be indexed with the XPAT search engine, preparing it for use with the DLXS middleware.


Set Up Directories and Files for XPAT Indexing

Yesterday, we did what we needed to do with our materials "by hand" -- today, we will work with the materials packaged in the samplefa collection and adapt them for use with workshopfa. This should parallel what you'll be doing back at your institutions. First, we need to create the rest of the directories in the workshopfa environment with the following commands:

mkdir -p $DLXSROOT/idx/w/workshopfa

The bin directory we created yesterday holds any scripts or tools used for the collection specifically; obj (again, created yesterday) holds the "object" or XML file for the collection, and idx holds the XPAT indexes. Now we need to finish populating the directories.

cp $DLXSROOT/bin/s/samplefa/Makefile $DLXSROOT/bin/w/workshopfa/Makefile
cp $DLXSROOT/prep/s/samplefa/samplefa.blank.dd $DLXSROOT/prep/w/workshopfa/workshopfa.blank.dd
cp $DLXSROOT/prep/s/samplefa/samplefa.extra.srch $DLXSROOT/prep/w/workshopfa/workshopfa.extra.srch

Each of these files need to be edited to reflect the new collection name and the paths to your particular directories. This will be true when you use these at your home institution as well, even if you use the same directory architecture as we do, because they will always need to reflect the unique name of each collection. Failure to change even one file can result in puzzling errors, because the scripts are working, just not necessarily in the directories you are looking at.


Build the XPAT Index

Everything is now set up to build the XPAT index. The Makefile in the bin directory contains the commands necessary to build the index, and can be executed easily.

To create an index for use with the Findaid Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" structures based on a combination of elements (for example, defining who the "main author" of a finding aid is, without adding a <mainauthor> tag around the appropriate <author> in the eadheader element). The following commands can be used to make the index, alone or in combination. We will be using make singledd, make xml, and make post.

Aside: note that the Makefile also contains targets for the commands you entered "by hand" above to validate and prepare the data. Read more about the Makefile.

make singledd indexes words for texts that have been concatenated into on large file for a collection. This is the recommended process.

make sgml indexes the SGML structure by reading the DTD. Validates as it indexes. Slower than mutlirgn indexing (see below) for this reason. However, necessary for collections that have nested elements of the same name.

make xml indexes the XML structure by reading the DTD. Validates as it indexes. Slower than multirgn indexing (see below) for this reason. However, necessary for collections that have nested elements of the same name.

make multi (multiregion structure indexing) indexes the XML structure and relies on a "tags file" (included in the sample collection) to know what XML elements and attributes to index. Rarely used with fully-encoded collections because of the nesting problem mentioned above.

make mfsdd (multi-file system indexing) indexes words and structure for each XML text listed in the data dictionary (dd) individually. Seems like a good idea -- no redundant copies of files! -- but searching is slower than an index built of concatenated files. Also, if any one of the files referenced changes in any way, the entire index fails. We no longer use MFS indexes ourselves for this reason.

make post builds and indexes fabricated regions based on the XPAT queries stored in the workshopfa.extra.srch file.

make singledd
cp /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.blank.dd
	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
/l/local/xpat/bin/xpatbld -m 256m -D /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
cp /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
	/l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.presgml.dd
make xml
cp /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.presgml.dd
	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
/l/local/xpat/bin/xmlrgn -D /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
	/l1/workshop/test02/dlxs/misc/sgml/xml.dcl
	/l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.inp
	/l1/workshop/test02/dlxs/obj/w/workshopfa/workshopfa.xml

cp /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.prepost.dd
make post
cp /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.prepost.dd
	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
touch /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.init
/l/local/xpat/bin/xpat -q /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
	< /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.extra.srch
	| /l1/workshop/test02/dlxs/bin/t/text/output.dd.frag.pl
	/l1/workshop/test02/dlxs/idx/w/workshopfa/
	> /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.extra.dd
/l1/workshop/test02/dlxs/bin/t/text/inc.extra.dd.pl
	/l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.extra.dd
	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd

Fabricated Regions in FindaidClass

The post step above leads us into a discussion of the use of fabricated regions in FindaidClass. uses the workshopfa.extra.srch file to add to the XPAT index. The Makefile in the bin directory conta

"Fabricated" is a term we use to describe what are essentially virtual regions in an XPat indexed text. See a basic description of what a fabricated region is and how they are created.

In Finding Aids, we use fabricated regions for certain uninteresting regions simply so that some code can be shared. For example, the fabricated region "main" is set to refer to <ead> in FindaidClass with:

(region ead); {exportfile "/l1/idx/b/bhlead/main.rgn"}; export; ~sync "main";

whereas in TextClass "main" can refer to <TEXT>. Therfore, both FindaidClass and TextClass can share the Perl code, in a higher level subclass, that creates searches for "main".

More interesting and more specific to FindaidClass are some of those listed below, which are taken from the Bentley Historical Library's bhl.extra.srch file. (See the whole file here). You can see the creation of generic regions like "c0xhead", everything from a <c0x> open tag to the following closing </did> tag.

Another interesting fabricated region is the add. This used to be <ADD> in the EAD 1.0 DTD, but now, is created based on the ead2002 DTD's <descgrp> tag which contains a type attribute of add.

One of the main reasons these are used is in the creation and display of the "outline" view. The FindaidClass.pm's _initialize method sets a hash called "tocheads" whose elements correspond to most of these fabricated regions. In this way, the CGI can have a shorthand way of asking XPAT to return these regions, XPAT can have binary indexes ready to use for fast retrieval.


(((region "<c01".."</did>" + region "<c02".."</did>" +
  region "<c03".."</did>" +
    region "<c04".."</did>" + region "<c05".."</did>" + region "<c06".."</did>" +
    region "<c07".."</did>" + region "<c08".."</did>" + region "<c09".."</did>")
    not incl ("level=file" + "level=item")) incl "level="); {exportfile "/l1/idx/b/bhlead/c0xhead.rgn"};
    export; ~sync "c0xhead";
	
((region "<origination".."</unittitle>") within ((region did within region
    archdesc) not within region dsc)); {exportfile "/l1/idx/b/bhlead/maintitle.rgn"};
    export; ~sync "maintitle";

(region "abstract" within ((region did within region archdesc) not within region "c01"));
    {exportfile "/l1/idx/b/bhlead/mainabstract.rgn"}; export; ~sync "mainabstract";
	
(region "eadid"); {exportfile "/l1/idx/b/bhlead/callnum.rgn"}; export; ~sync "callnum";

(region "dsc-T"); {exportfile "/l1/idx/b/bhlead/contentslist-t.rgn"}; export;
~sync "contentslist-t";

(region dsc); {exportfile "/l1/idx/b/bhlead/contentslist.rgn"}; export; ~sync "contentslist";

admininfot = (region "descgrp-T" incl (region "A-type" incl "admin")); 
	{exportfile "/l1/idx/b/bhlead/admininfo-t.rgn"};
	export; ~sync "admininfo-t";

(region "descgrp" incl *admininfot); {exportfile "/l1/idx/b/bhlead/admininfo.rgn"};
	export; ~sync "admininfo";

addt = (region "descgrp-T" incl (region "A-type" incl "add")); 
	{exportfile "/l1/idx/b/bhlead/add-t.rgn"};
	export; ~sync "add-t";

(region "descgrp" incl *addt); {exportfile "/l1/idx/b/bhlead/add.rgn"}; export;
	~sync "add";

region "controlaccess-T" ^ region "controlaccess"; 
	{exportfile "/l1/idx/b/bhlead/controlaccess-t.rgn"};
	export; ~sync "controlaccess-t";

(region "controlaccess"); {exportfile "/l1/idx/b/bhlead/controlaccess.rgn"};
	export; ~sync "controlaccess";

(region "subject" + region "occupation" + region "corpname" + region "famname" +
	region "name" + region "persname" + region "geogname"); 
	{exportfile "/l1/idx/b/bhlead/subjects.rgn"};
	export; ~sync "subjects";

(region "corpname" + region "famname" + region "name" + region "persname"); 
	{exportfile "/l1/idx/b/bhlead/names.rgn"};
	export; ~sync "names";

(region geogname); {exportfile "/l1/idx/b/bhlead/places.rgn"}; export; ~sync "places";

See a full listing of the extra.srch file of the Bentley Historical Library's finding aids.


More Documentation

Workshop Day 3 -- Wednesday Afternoon

Findaid Class Collection to Web

These are the final steps in deploying an Findaid Class collection online. Here the Collection Manager will be used to review the Collection Database entry for workshopfa . The Collection Manager will also be used to check the Group Database. Finally, we need to work with the collection map and the set up the collection's web directory.


Review the Collection Database Entry with CollMgr

Each collection has a record in the collection database that holds collection specific configurations for the middleware. CollMgr (Collection Manager) is a web based interface to the collection database that provides functionality for editing each collection's record. Collections can be checked-out for editing, checked-in for testing, and released to production.A collection database record for workshopfa has already been created and we will edit it. In general, a new collection needs to have a CollMgr record created from scratch before the middleware can be used. Take a look at the record to become familiar with it.

http://username.ws.umdl.umich.edu/cgi/c/collmgr/collmgr

Notice that it thinks it's the samplefa collection. Change references to s/samplefa to w/workshopfa . Let's change the name as well -- remove the reference to Sample DLXS Finding Aids Collection and change it to text:whatever you want to call it.

More Documentation


Review the Groups Database Entry with CollMgr

Another function of CollMgr allows the grouping of collections for cross-collection searching. Any number of collection groups may be created for Findaid Class. Findaid Class supports a group with the groupid "all". It is not a requirement that all collections be in this group, though that's the basic idea. Groups are created and modified using CollMgr. For this workshop, the group "all" record has already been edited to include the workshopfa collection. Take a look at the record to become familiar with it.

http://username.ws.umdl.umich.edu/cgi/c/collmgr/collmgr

We won't be doing anything with groups; I'm sure you will in Image Class.


Make Collection Map

Collection mapper files exist to identify the regions and operators used by the middleware when interacting with the search forms. Each collection will need one, but most collections can use a fairly standard map file, such as the one in the samplefa collection. The map files for all Findaid Class collections are stored in $DLXSROOT/misc/f/findaid/maps

Map files take language that is used in the forms and translates it into language for the cgi and for XPAT. For example, if you want your users to be able to search within names, you would need to add a mapping for how you want it to appear in the search interface (case is important, as is pluralization!), how the cgi variable would be set (usually all caps, and not stepping on an existing variable), and how XPAT will identify and retrieve this natively (in XPAT search language).

The first part of the map file is operator mapping, for the form, the cgi, and XPAT. The second part is for region mapping, as in the example above.

cd $DLXSROOT/misc/f/findaid/maps
cp samplefa.map workshopfa.map

 

You might note that some of the fields that are defined in the map file correspond to some of the fabricated regions.

More Documentation


Set Up the Collection's Web Directory

Each collection may have a web directory with custom Cascading Style Sheets, interface templates, graphics, and javascript. The default is for a collection to use the web templates at $DLXSROOT/web/f/findaid. Of course, collection specific templates and other files can be placed in a collection specific web directory, and it is necessary if you have any customization at all. DLXS Middleware uses fallback to find HTML related templates, chunks, graphics, js and css files.

For a minimal collection, you will want three files: index.html, home.tpl, and FindaidClass-specific.css. You'll also need a browse.tpl if you want the collection to be browseable.

mkdir -p $DLXSROOT/web/w/workshopfa
cp $DLXSROOT/web/s/samplefa/index.html $DLXSROOT/web/w/workshopfa/index.html
cp $DLXSROOT/web/s/samplefa/home.tpl $DLXSROOT/web/w/workshopfa/home.tpl
cp $DLXSROOT/web/s/samplefa/findaidclass-specific.css $DLXSROOT/web/w/workshopfa/findaidclass-specific.css

As always, we'll need to change the collection name and paths. You might want to change the look radically, if your HTML skills are up to it.

More Documentation


Try It Out

http://username.ws.umdl.umich.edu/cgi/f/findaid/findaid-idx

More Documentation

Linking from Finding Aids Using the ID Resolver

How do you do this?

ID Resolver Data Transformation and Deployment

The ID Resolver is a CGI that takes as input a unique identifier and returns a URI. It is used, for example, by Harper's Weekly to link the text pages in Text Class middleware to the image pages in the Image Class middleware, and vice versa.


Master Data

Resolver master is managed in a FileMaker Pro database named "resolverdata.FP5" which is stored on the HTIWork server.

In order to help keep track of changes to the idresolver database, duplicate the most recent idresolver folder. Rename the folder to include the current date and your uniqname. Copy the new folder to your local machine and work with it there. When done, be sure to move your copy of the FileMaker file back to the server replacing the folder with your name on it. Finally, remove your name from the folder.

Data should be provided in a tab delimited ascii text file that has two fields.

  1. id
  2. URI

Data additions can simply be imported into new records.

Data updates should in most cases be done for an entire collection at a time, rather than selectively for records within a collection. In such a case, it is necessary to delete all of the existing records for a collection so that duplicate data does not end up in the system. FileMaker is not good about selectively replacing records.

In some cases IDs have been altered to achieve uniqness within the system. For example, some of the BHL IDs get prefixed with "dao-href-". There are other situations like this. So, if updating data, be aware.


Export from FileMaker


Upload


Load Data into MySQL

First, clean off any Macintosh line breaks (if the data was exported from Filemaker on the Macintosh).

perl -pi -e 's,\x0d,\x0a,g' idresolver.tab

use the mysqldump utilitiy to drop the existing production table and to document the sql commands needed to rebuild the table without data. You'll need to know the password for the dlxsadm MySQL account.

mysqldump -u dlxsadm -p -h mysql.umdl.umich.edu --add-drop-table -d dlxs idresolver > /tmp/idresolver<

Recreate the table just dropped, except without data.

mysql -u dlxsadm -p -h mysql.umdl.umich.edu dlxs < /tmp/idresolver

load the new data to the fresh table.

mysql -u dlxsadm -p -h mysql.umdl.umich.edu dlxs

load data local infile '/l1/prep/i/idresolver/idresolver.tab' into table idresolver;

Use the following as a quick test...

select * from idresolver limit 1;
select count(*) from idresolver;

Quit mysql

quit


Test

Plug something like the following in to your web browser and you should get something back. If you choose to test middleware on a development machine that uses the id resolver, make sure that the middleware on that machine is calling the resolver on the machine with the data, and not the resolver on the production server.