DLXS Programmers & Technical Workshop

Indexing with XPat

How To

See the information on Directory Structure for information on how we recommend directories to be named and set up.

Once all the transformation and preprocessing is done, the finished, ready to index SGML is placed in the proper obj directory.

The collection's Makefile has variables that should be edited for the proper paths to the collection's data, dtd, doctype, etc.

Makefile: management of the normalization and indexing steps
coll.blank.dd: This file is a skeleton of a data dictionary configuration. The Makefile copies this "blank" or template file onto the corresponding coll.dd which is then expanded during the actual indexing process.
coll.init: This can conceivably contain any valid XPat command. But this would mean that they would get loaded each time XPat is invoked. Since most invocations are via a CGI script, we leave this empty. Unfortunately, empty does not mean non-existent; it must be there. So, a zero length file will do!
coll.inp (aka doctype): This file contains a DOCTYPE declaration; it points to a dtd, which we also usually place in this directory although for some common dtds which we place in the lib directory for access from many collections. It may also contain other things, like entity references or APPINFO. Note: Although with some difficulty, the older pat50 was able to use PUBLIC identifiers, we find it much easier to use SYSTEM identifiers. The proper use of catalogs may be an enhancement brougth to XPat.
coll.extra.srch: used by Makefile to create fabricated regions (see below).

These next files are created or modified by the XPat indexing process (dbbuild50 or its subprocesses) and the region indexing processsgmlrgn50 (which are part of the Makefile

coll.idx: The actual index into the text. A binary file.
coll.rgn: Another index into the file. This one is an index into the regions (usually SGML) that exist in the file.
coll.dd: This file contains:

the specifications for the full paths to the text itself and the text and region indexes.
information about index points; that is, points in the stream of text at which to start a "semi-infinite string".
the specification of how characters are mapped to other characters when receiving a user search term. You're allowed to map any character to any other character. The <Map> elements in the plain vanilla coll.blank.dd file account for case insensitivity, and for the mapping of punctuation characters to spaces (so that word boundaries are recognized). Other mappings are also possible.
list of stopwords. We don't use these, since most of our text is literary in nature, searching is also allowed on "the" and "of", etc.
NOTE: this file starts from the coll.blank.dd file but it is fleshed out after indexing. Interestingly, as you saw before, this is a tagged ASCII text file and is therefore editable. That could be dangerous, but it is also handy. Later for fabricated regions....

NOTE:

.dd

XPat

.dd

The idx directory in which all these files live will also serve as the scratch file space used by the indexing process, and as the permanent home of all the index and region files made by XPat, all the fabricated region files we make ourselves. The scratch file space is not to be overlooked if you're tight on space: with a normal token definition and relatively simple SGML, an indexing run will need between 150% and 175% of the space the final index and region files take up.

One way to index with XPat is with dbbuild50, which is a wrapper for several other indexing subprocesses. We used to use dbbuild50 or even simply patbld50 thusly:

patbld50 -m 64m -D coll.dd

However, we have begun to standardize on more explicit invocations of the subprocesses. See a typical Makefile for an example of all the proper commands in order or check the printed XPat documentation.

The Makefile, while indexing, also runs the sgmlrgn50 command. This gives us the all important index built on the regions declared in the DTD and allows searching within SGML regions.

Indexing and Region Building: Hands On

To get more of a feel for the process we'll use the bosnia Makefile to do the above steps. Recall that prior to this step we did normalization (along with nodefy and validate steps).

% cd $HOME/dlxs/idx/b/bosnia
% make dd

What dbbuild50 (or its equivalent subprocesses) Gets Us

What the dd step gets us depends almost entirely on what was in the coll.dd file: the Index Points and Character Mappings. The Index Points are descriptions of the type of text where where XPat will begin a semi-infinite string. The Mappings we usually use allow for case insensitivity and, in the case of some collections, for search mappings of 8 bit characters, e.g., thorns and eths.

We can now fire up the database we've just indexed successfully with:

pat50 coll.dd

and try some searching.

After the dd command, we continue with the extra make target. This is a series of perl scripts that use the coll.extra.srch file, containing export-ed XPat searches, to create a bit of text that is then added in the <Regions> area of the .dd file. In this way, XPat .rgn files (region indexes) will be created. In the past, until we knew that we needed fabricated regions, we would usually comment this out in the Makefile. However, the new TextClass model requires them for regions like page and mainheader.

The coll.extra.srch file contains the searches for all the fabricated regions we want to make. See fabricated regions.

Fabricated Region Building: Hands On

% cd $HOME/dlxs/idx/b/bosnia
% make extra

For those of you who have used ot60, a note

It is here that pat50 and ot60 diverge significantly. When pat50 is reporting location information of points and regions, it always refers to an absolute byte offset in the database. ot60 always refers to an absolute token offset in the database, or that token at which the point or region begins. Unless the definition of token is "any byte", then tokens != bytes. This did not normally affect most systems' functionality: many of the SSP platform stunts can be done either way, because they do not depend on absolute byte offset information. But there are several stunts the XPat based cgi programs do that now can and do assume bytes instead of token offsets. Therefore, since from now on we are dealing only with pat50/XPat, we needn't worry about how ot60 deals with token offset. Yay!

By the way,some of those byte offset dependent stunts are:

Egregious Inline Markup

XPat

Ordering Results

Aside: one rather interesting bit of functionality that is more easily accomplished with token-based ot60 than with byte-offset-based pat50 is some of the interesting context and ordering performed in the MiCASE collection.

After the sgmlrgn50 target and the optional post target for fabricated regions, depending on the collection, the Makefile may continue with the pageview and/or the wordwheel steps.