Indexing with XPat

How To

See the information on Directory Structure for information on how we recommend directories to be named and set up.

Once all the transformation and preprocessing is done, the finished, ready to index SGML is placed in the proper obj directory.

The collection's Makefile has variables that should be edited for the proper paths to the collection's data, dtd, doctype, etc.

The Makefile, while indexing, also runs the sgmlrgn50 command. This gives us the all important index built on the regions declared in the DTD and allows searching within SGML regions.


Indexing and Region Building: Hands On

To get more of a feel for the process we'll use the bosnia Makefile to do the above steps.  Recall that prior to this step we did normalization (along with nodefy and validate steps).
% cd $HOME/dlxs/idx/b/bosnia
% make dd

What dbbuild50 (or its equivalent subprocesses) Gets Us

What the dd step gets us depends almost entirely on what was in the coll.dd file: the Index Points and Character Mappings. The Index Points are descriptions of the type of text where where XPat will begin a semi-infinite string. The Mappings we usually use allow for case insensitivity and, in the case of some collections, for search mappings of 8 bit characters, e.g., thorns and eths.

We can now fire up the database we've just indexed successfully with:

pat50 coll.dd

and try some searching.


After the dd command, we continue with the extra  make target. This is a series of perl scripts that use the coll.extra.srch file, containing export-ed XPat searches, to create a bit of text that is then added in the <Regions> area of the .dd file. In this way, XPat .rgn files (region indexes) will be created. In the past, until we knew that we needed fabricated regions, we would usually comment this out in the Makefile. However, the new TextClass model requires them for regions like page and mainheader.

The coll.extra.srch file contains the searches for all the fabricated regions we want to make. See fabricated regions.



For those of you who have used ot60, a note

It is here that pat50 and ot60 diverge significantly. When pat50 is reporting location information of points and regions, it always refers to an absolute byte offset in the database. ot60 always refers to an absolute token offset in the database, or that token at which the point or region begins. Unless the definition of token is "any byte", then tokens != bytes. This did not normally affect most systems' functionality: many of the SSP platform stunts can be done either way, because they do not depend on absolute byte offset information. But there are several stunts the XPat based cgi programs do that now can and do assume bytes instead of token offsets. Therefore, since from now on we are dealing only with pat50/XPat, we needn't worry about how ot60 deals with token offset. Yay!

By the way,some of those byte offset dependent stunts are:

Aside: one rather interesting bit of functionality that is more easily accomplished with token-based ot60 than with byte-offset-based pat50 is some of the interesting context and ordering performed in the MiCASE collection.

After the sgmlrgn50 target and the optional post target for fabricated regions, depending on the collection, the Makefile may continue with the pageview and/or the wordwheel steps.