Image Class XPAT Index Building

This document describes the steps necessary to build an XPAT index for the Image Class.

New in DLXS 12:

About image-blank

Distributed with Image Class is a preconfigured XPAT index directory named "image-blank" that can be used as a boilerplate for building new Image Class XPAT indexes. You find image-blank at...


Setup New Collection Specific Directories (if necessary)

In DLXS all content data (sgml for Image Class) are stored under $DLXSDATAROOT/obj with the exception of continuous tone images which are stored under $DLXSDATAROOT/img. It is necessary to create collection specific obj and idx directories for each collection.

The shell script ($DLXSDATAROOT/bin/i/image/setupcollindex) automatically creates and configures idx and obj directories for a new collection. It also copies the sgml file from $DLXSDATAROOT/prep/c/collid to $DLXSDATAROOT/obj/c/collid. It stops short of actually building the index.

usage: $DLXSROOT/bin/i/image/setupcollindex c/collid

example: $DLXSROOT/bin/i/image/setupcollindex s/sampleic

Build the XPAT Index

With all of the SGML files properly placed in the $DLXSDATAROOT/obj/c/collid directory, and the $DLXSDATAROOT/idx/c/collid directory setup, the XPAT index can be built. Most collections of several thousand records will build in less than an hour. Large collections could take several hours. It depends on the amount of data and the available computing power. Building an index with a small amount of data is recommended on the first try. A few hundred records is appropriate for starters, and will only take a few minutes to run.

  1. Navigate to the /11/idx/c/collid directory
  2. Issue the comand make all (previously make dd)
  3. Wait until it says it is done

Tip: If you want the index to build in the background and without needing to worry about the process dying if the session is lost, try... nohup make all &

Test the Index

It is possible to test the index by starting an XPAT session on the command line from within $DLXSDATAROOT/idx/c/collid.

jweise@sangria% xpatu image.dd
        Digital Library eXtension Service, XPAT, Release 5.3
		        COPYRIGHT (c) 2000, 2003, 2004 The Regents of the University of Michigan
				        All Rights Reserved
						>> region "ENTRY"   
						  1: 8 matches
						  >> pr sample
						       1327, ..D</BASE></GEN><ENTRY COLLID="MCSAMPLEIC" ENTRYID="X-34" CA="samp..
							   4245, ..D></I></ENTRY><ENTRY COLLID="MCSAMPLEIC" ENTRYID="X-49" CA="samp..
							   5090, ..D></I></ENTRY><ENTRY COLLID="MCSAMPLEIC" ENTRYID="X-51" CA="samp..
							   5970, ..D></I></ENTRY><ENTRY COLLID="MCSAMPLEIC" ENTRYID="X-52" CA="samp..
							   6802, ..D></I></ENTRY><ENTRY COLLID="MCSAMPLEIC" ENTRYID="X-59" CA="samp..
							   7581, ..D></I></ENTRY><ENTRY COLLID="MCSAMPLEIC" ENTRYID="X-62" CA="samp..
							   10101, ..D></I></ENTRY><ENTRY COLLID="MCSAMPLEIC" ENTRYID="X-77" CA="samp..
							   14959, ..D></I></ENTRY><ENTRY COLLID="MCSAMPLEIC" ENTRYID="X-84" CA="samp..

Moving an Index to a Different Machine and Into Service

It is possible and favorable to move a built index to a new location. For example, at Michigan, an XPAT index is built on a development machine and then moved to a production machine. Building an index is an intensive CPU process that can take a few minutes to several hours. Building an XPAT index on the development machine removes the burden from the production machine. It also allows an index to be tested thoroughly in the development environment before being moved to production.

The steps for moving an index and associated SGML files from one machine to another, and into production are:

  1. Create a tar file of the $DLXSDATAROOT/idx/c/collid directory (cd $DLXSDATAROOT/idx/c; tar cf idxcollid.tar ./collid)
  2. Create a tar file of the $DLXSDATAROOT/obj/c/collid directory
  3. Transfer the tar files to the destination machine.
  4. Remove any existing $DLXSDATAROOT/idx/c/collid and $DLXSDATAROOT/c/collid directories from the destination.
  5. Extract the files from the tar files.

It is important to know that since paths are hard-coded in the index, the index must be put into an identical directory location at the destination; otherwise it will not work.

Other Ideas

It might be useful to manage multiple instances of idx and obj directories for a single collection and then use a symlink to point to the index that is to be used by the middleware. For example, one could have $DLXSDATAROOT/idx/c/collid-a and $DLXSDATAROOT/idx/c/collid-b plus a symlink $DLXSDATAROOT/idx/c/collid that points to the a or b instance. This approach might simplify the deployment of collection updates with minimal disruption of service.

A better approach is to build indexes in a development environment (preferably on a separate machine) and use a tool such as rdist to transfer the index files to the production location.