Indexing with XPat
How To
See the information on Directory
Structure for information on how we recommend directories to be named
and set up.
Once all the transformation
and preprocessing is done, the finished, ready to index SGML is placed
in the proper obj directory.
The collection's Makefile
has variables that should be edited for the proper paths to the collection's
data, dtd, doctype, etc.
- Makefile:
management of the normalization and indexing steps
- coll.blank.dd:
This file is a skeleton of a data dictionary configuration. The Makefile copies
this "blank" or template file onto the corresponding coll.dd
which is then expanded during the actual indexing process.
- coll.init: This can conceivably contain any valid XPat command.
But this would mean that they would get loaded each time XPat is invoked.
Since most invocations are via a CGI script, we leave this empty. Unfortunately,
empty does not mean non-existent; it must be there. So, a zero length file
will do!
- coll.inp
(aka doctype): This file contains a DOCTYPE declaration; it
points to a dtd,
which we also usually place in this directory although for some common dtds
which we place in the lib directory for access from many collections.
It may also contain other things, like entity references or APPINFO. Note:
Although with some difficulty, the older pat50 was able to use PUBLIC identifiers,
we find it much easier to use SYSTEM identifiers. The proper use of catalogs
may be an enhancement brougth to XPat.
- coll.extra.srch:
used by Makefile
to create fabricated regions (see below).
These next files are created or modified by the XPat indexing process (dbbuild50
or its subprocesses) and the region indexing processsgmlrgn50
(which are part of the Makefile
- coll.idx: The actual index into the text. A binary file.
- coll.rgn: Another index into the file. This one is an index
into the regions (usually SGML) that exist in the file.
- coll.dd:
This file contains:
- the specifications for the full paths to the text itself and the text
and region indexes.
- information about index points; that is, points in the stream of text
at which to start a "semi-infinite string".
- the specification of how characters are mapped to other characters when
receiving a user search term. You're allowed to map any character to any
other character. The <Map> elements in the plain vanilla coll.blank.dd
file account for case insensitivity, and for the mapping of punctuation
characters to spaces (so that word boundaries are recognized). Other mappings
are also possible.
- list of stopwords. We don't use these, since most of our text is literary
in nature, searching is also allowed on "the" and "of", etc.
- NOTE: this file starts from the coll.blank.dd
file but it is fleshed out after indexing. Interestingly, as you saw before,
this is a tagged ASCII text file and is therefore editable. That could be
dangerous, but it is also handy. Later for fabricated
regions....
NOTE: All of the pathnames and directories to which the .dd file
refers need to always exist, either real or symbolic. XPat has the interesting
feature that you could open up the .dd file and change directories and
filenames at whim (to suit whatever moving of largish files you need to do).
Sometimes, though a bit like cheating, it is easier to edit the paths after
moving text and/or index files than to reindex a large amount of text.
- The idx directory in which all these files live will also
serve as the scratch file space used by the indexing process, and as the permanent
home of all the index and region files made by XPat, all the fabricated
region files we make ourselves. The scratch file space is not to be overlooked
if you're tight on space: with a normal token definition and relatively simple
SGML, an indexing run will need between 150% and 175% of the space the final
index and region files take up.
One way to index with XPat is with dbbuild50, which is a wrapper for
several other indexing subprocesses. We used to use dbbuild50 or even
simply patbld50 thusly:
patbld50 -m 64m -D coll.dd
However, we have begun to standardize on more explicit invocations of the
subprocesses. See a typical Makefile
for an example of all the proper commands in order or check the printed XPat
documentation.
The Makefile,
while indexing, also runs the sgmlrgn50 command. This gives us the
all important index built on the regions declared in the DTD and allows searching
within SGML regions.
Indexing and Region Building: Hands On
To get more of a feel for the process we'll use the bosnia Makefile to do the
above steps. Recall that prior to this step we did normalization
(along with nodefy and validate steps).
% cd $HOME/dlxs/idx/b/bosnia
% make dd
What dbbuild50 (or its equivalent subprocesses) Gets Us
What the dd step gets us depends almost entirely on what was in the
coll.dd
file: the Index Points and Character Mappings. The Index Points
are descriptions of the type of text where where XPat will begin a semi-infinite
string. The Mappings we usually use allow for case insensitivity and, in the case
of some collections, for search mappings of 8 bit characters, e.g., thorns and
eths.
We can now fire up the database we've just indexed successfully with:
pat50 coll.dd
and try some searching.
After the dd
command, we continue with the extra make target. This is
a series of perl scripts that use the coll.extra.srch
file, containing export-ed XPat searches, to create a bit
of text that is then added in the <Regions> area of the .dd
file. In this way, XPat .rgn files (region indexes) will be created.
In the past, until we knew that we needed fabricated regions, we would usually
comment this out in the Makefile.
However, the new TextClass model requires them for regions like page
and mainheader.
The coll.extra.srch
file contains the searches for all the fabricated regions we want to make. See
fabricated regions.
For those of you who have used ot60, a note
It is here that pat50 and ot60 diverge significantly.
When pat50 is reporting location information of points and
regions, it always refers to an absolute byte offset in the database. ot60
always refers to an absolute token offset in the database, or that
token at which the point or region begins. Unless the definition of token
is "any byte", then tokens != bytes. This did not normally affect
most systems' functionality: many of the SSP platform stunts can be done
either way, because they do not depend on absolute byte offset information.
But there are several stunts the XPat based cgi programs do that now can
and do assume bytes instead of token offsets. Therefore, since from
now on we are dealing only with pat50/XPat, we needn't worry about
how ot60 deals with token offset. Yay!
By the way,some of those byte offset dependent stunts are:
-
Egregious Inline Markup
Briefly, sometimes we want to be able to provide phrase searching but
are hindered by copious markup. Under XPat we have been known to
do the following: make a mirror database that has whitespace substituted
in for the "egregious" markup in question.
-
Ordering Results
As shall be seen in the CGI programs, there is a lot of the DLPS platform
that depends on the offset order of results as they are returned from XPat.
Aside: one rather interesting bit of functionality that is more
easily accomplished with token-based ot60 than with byte-offset-based
pat50
is some of the interesting context and ordering performed in the MiCASE
collection.
After the sgmlrgn50 target and the optional post target for fabricated
regions, depending on the collection, the Makefile may continue
with the pageview and/or
the wordwheel steps.