XPat Details

Indexing will be covered in detail during the Text Class Data Preparation section.

Discussion of Text Indexing and Region indexing
- XPat indexes strings rather than words
- XPat indexes SGML regions, allowing searching of text within regions, regions including text or other regions, etc.
Query Language Syntax
“quieton raw” mode and programming XPat queries in Perl
- {quieton raw}
- Programming XPat queries in Perl
Additional Details (not covered explicitly during the workshop)

A full list of XPat commands can be found at: http://www.hti.umich.edu/sgml/pat/pat50manual.html

Query Language Syntax

Invoking XPat

To start an interactive session with XPat, enter xpat along with the name of the data dictionary (dd) file.:

% xpat $DLXSROOT/idx/s/sampletc/sampletc.dd

Identifying Points

In XPat, a point is a unique byte offset in the full text where XPat has indexed a string. Enter a string or byte offset in square brackets and set of points is returned:

>> "prince"
1: 375 matches

>> "prince "
2: 304 matches

>> [290947]
3: one match

The first query finds all semi-infinite strings (sistrings) that begin with "prince", the second finds those that are "prince" exactly, the third finds the string beginning at the byte 290947.

Identifying Regions

A region in XPat is a span of text comprising zero or more bytes. sgmlrgn50 or multirgn(discussed in the TextClass Collection Implementation/Indexing Section) handles the create of these regions.

To find how many of a particular region type exist, enter region plus the name of the region (double quotes are needed if the name contains non-alphanumeric characters).

>> region "DIV1"

  1: 120 matches

>> region "A-NODE"

  2: 132 matches

Also see the {ddinfo regionnames} command.
Also see the history command.

Identifying Sets (Named Sets)

Any collection of points or regions can be grouped together in a set. Sets can be combined or split with XPat's boolean operators. All sets created during a session have unique number identifier They can be can given names (name = ). They can be printed out (pr), saved, exported (useful in the creation of "fabricated regions"). Here are just a few examples:

>> long
  1: 532 matches

>> help
  2: 133 matches

>> 1 + 2
  3: 665 matches

>> "subsequently"
  4: 5 matches

>> pr 4
   819525, ..eparture, and subsequently confirmed in their position by the So..
  2764281, ..ra, and often subsequently during our stay, we walked on the mou..
  2936185, .. Kara George, subsequently he returned, but unexpectedly, and at..
   201591, .., whom we met subsequently, however, at Castelnuovo, seemed to r..
  2104209, .. of Russia.   Subsequently,  however, they showed more discrimin..

>> mysearch = "lasting"
  5: mysearch = 2 matches

>> pr *mysearch
  1380924, ..tion, nothing lasting could be established. The Servians were de..
  2465605, .. room.  After lasting out five hundred years !</P><P>Perhaps a l..

Also see the subset command.
Also see the {sortorder} setting.
Also see other operators and relations.

Using the Operators to Make Sets of Interest

Using some basic XPat operators, we can build some very specific searches that take advantage of the SGML's markup. Here is an actual example from the TextClass implementation. The following query is actually the basis for the fabricated region called mainauthor in most of our text collections. Note that this query depends on knowing the structure of the document's markup (in case of TextClass documents, the regions here are essentially the same as in the TEIHEADER of the TEI.2 DTD.)

>> ((region AUTHOR within (region TITLESTMT within region FILEDESC)) 
     not within (region SOURCEDESC))
   6: 4 matches 

>> pr.region.6
      143, ..<AUTHOR> Holbach, Maude M. </AUTHOR>..
   298344, ..<AUTHOR> Yriarte, Charles, 1832-1898. </AUTHOR>..
   792438, ..<AUTHOR> Laveleye, Emile de, 1822-1892. </AUTHOR>..
  1689410, ..<AUTHOR> Sebright, Georgina Mary Muir (Mackenzie), Lady, d. 1874- </AUTHOR>..

Here we construct a query to return a PSet consisting of hits on a user-entered search term. We want to display a line containing the immediate context of the hit and also a title from an enclosing division:

The query for the user's search is simply:

>> firstsearch = ("Branivoj " + "Branivoj<")
7: firstsearch = one match

To get a division title for the hit we need to build up regions based on the hit:

>> slicesearch = subset.1.25 *firstsearch
8: slicesearch = one match
>> mainslicesearch = (region "DLPSTEXTCLASS" incl *slicesearch)
9: mainslicesearch = one match
>> mainheader = (region "HEADER" within *mainslicesearch)
10: mainheader = one match

Finally to view the content of the region we have constructed we enter:

>> pr.region."HEADER" (region *mainheader)

“quieton raw” mode and programming XPat queries in Perl

“quieton raw” mode

The default mode, in an interactive XPat session, is "quietoff". This gives the results messages you have seen so far: numbered sets, byte offsets followed by snippets of SGML with ".." on either end, etc. Another mode, and the most useful for interacting with XPat programmatically, is "quieton raw". Nothing seems to happen when one enters:

>> {quieton raw}

However, entering queries now produces results that are tagged in a way that is easily parsable from within a program. First enter an earlier point search:

firstsearch = ("Branivoj " + "Branivoj<")
<SSize>1</SSize>
pr
<PSet><Start>313615</Start><Raw><Size>64</Size>res du nom de Branivoj
 s'emparent du territoire qu'ils gouvernen</Raw></PSet>

Now enter an earlier region search:

((region AUTHOR within (region TITLESTMT within region FILEDESC)) 
  not within (region SOURCEDESC))
<SSize>4</SSize>
pr.region.AUTHOR
<RSet><Start>143</Start><End>178</End><Raw><Size>36</Size>
<AUTHOR>Holbach, Maude M. </AUTHOR></Raw><Start>298344</Start>
<End>298391</End><Raw><Size>48</Size><AUTHOR>Yriarte, Charles, 1832-1898. </AUTHOR></Raw>
<Start>792438</Start><End>792487</End><Raw><Size>50</Size>
<AUTHOR>Laveleye, Emile de, 1822-1892. </AUTHOR></Raw><Start>1689410</Start>
<End>1689486</End><Raw><Size>77</Size>
<AUTHOR>Sebright, Georgina Mary Muir (Mackenzie), Lady, d. 1874- </AUTHOR></Raw></RSet>

Some of these tags are self-explanatory (e.g., SSize = set size). But some may need a bit of explanation.

PSet: These tags surround an entire set of point results.
RSet: These tags surround an entire set of region results.
Start: Byte offset of beginning of one result, either point or region.
End: Byte offset of end of one result, either point set string or region.
Raw: The "raw" information of one particular result.
Size: Byte offset of end of one result, either point set string or region.
text following the </Size> tag: Actual retrieved text of result.

Programming XPat queries in Perl

XPat's ability to return results with tags allows a program to parse the results into pieces. In the DLXS Middleware this is done by a group of DLXS Perl modules. These modules have methods to let the CGI program interact with XPat (an XPat process is forked off by the CGI program and queries can be made of it at any time). The main object the code uses is the xpat object. It has methods for making queries in different ways and for interacting with the forked off XPat process.

Here is some code (from TextClass.pm) that illustrates how the middleware uses a method of the Perl-based XPat object (created in an earlier part of the code).

...
my $query = qq{(region mainheader incl ( $idnorgn incl "$idno" ) );};
my ( $error, $result) = $xpat->GetSimpleResultsFromQuery( $query );
if ( $error )
{ &DlpsUtils::errorBail( qq{Query error in FindXPatContainingIdno: $result} ); }
&DlpsUtils::StripAllRSetCruft( \$result );
$result =~ m,<SSize>(\d+)</SSize>,;
my $hit = $1;
if ( $hit > 0 )
{
     $returnXpat = $xpat;
     last;
}
...

While some code, such as this, makes a query via a method, most queries in the middleware are actually made by other means, through other objects and methods. Once data has been prepared according to the DLXS Class DTDs, in terms of searching, the middleware can be thought of as an engine that simply "runs" the data. If there are any code changes that need to be made by DLXS users, it is usually when different display of data is needed ("filtering"). That is outside the scope of this section of the workshop.

Additional Details (not covered explicitly during the course of this workshop)

Viewing Sets

The pr command is the heart of viewing sets. In an interactive XPat session, it lets you view the results you've searched for. Within the middleware, getting the data back from XPat is just one step; it is followed by "filtering" operations (Perl substitutions using regular expressions) to remove or change other tags in the the content and to change the appearance tof the content; e.g. highlighting hits, etc.

The format the results that XPat returns with pr or save is determined by the current {quieton} setting. There is a big difference between the normal user-sitting-at-the-pat-terminal interactive mode, and the machine-readable modes.

pr (point-set): This prints out the members of the point-set, starting with the first, according to the current {sortorder} setting.
pr.X shift.-Y (point-set): Print the results in the point-set in a string X bytes wide, offset to the left of the matching point Y bytes. X and Y overide the settings of {printlength} and {leftcontext} respectively (which are described below).
pr.region."region-name" (region-set of type "region-name"): prints the entire span of each the members in the region set. It seems redundant to have to tell XPat the "format" of the region you would like to see, when it should already know!

In interactive mode, the following print the last set created.
pr

Note: The save command is, in a sense, the same as the pr command: pr displays to STDOUT, save utputs (appends) to a file whose name is given by {savefile}. The format of the output is the same.

{settings}

Settings control certain behaviors of XPat during a search session. DLXS middleware explicitly uses the {quieton} command. A full list of XPat commands, which includes the { } settings, can be found at: http://www.hti.umich.edu/sgml/pat/pat50manual.html

{printlength #}: This setting controls the default print window size for point sets, how many total bytes are given when a point set result is printed. See the discussion of pr above. Default is 64.
{leftcontext #}: This setting controls how many characters before the matching text will be given when a point set is printed. If there are 100 characters of {printlength}, and 14 of {leftcontext}, then the point where the matching text starts will be the 15th character. See the discussion of pr above. Default is 14
{sortorder <order>}: This determines in what order a given set of results is sorted by XPat. There are other modes, but DLXS middleware always uses {sortorder occur}, which is to say that results are returned in the byte order in which they occur in the source text.
{savefile "file"}: Changes the default save file name.
{exportfile "file"}: Changes the default export file name. When the export command is given, results are appended to the file.

Miscellaneous and Useful Commands

{ddinfo regionnames}: Lists all the currently-defined regions. A very useful command for document analysis
history: List of results sets from previously issued searches and the commands that created them.

subset.X.Y A: Make a new set that consists of Y members of A, starting at the Xth member of A. Members of A start numbering at 1. Note: This command is used in the middleware to get results in slices.
~sync "string": A very useful command; basically an echo sort of command. This is used in the Middleware to signal when XPat is done sending results. In any of the {quieton} modes, this returns:

Operators and Relations

These are the operators most used in the Middleware.

A ^ B: the "and" or "intersection" operator: A and B are two sets, or expressions that evaluate to sets, and the resulting set includes those points or regions in both A and B that have the exact same start offsets.
A + B + C + ...: the "or" or "union" operator: A, B, C... are sets. The resulting set (which is a point set if at least one of the sets being combined is a point set), consisting of the start offsets of all the points or regions in the original sets. If all the sets being combined are region sets, then regions that nest inside other listed regions (either entirely or at their start byte offset) will be removed from the resultant set.
A incl B: A is a region set, B is either a point or region set. The result is a region set of all members of A that contain at least one member of B, containment meaning that a given B has a start offset within the inclusive range of a given A's start and end offsets.
A within B: In many ways the complement to incl: A is a point or region set, B is a region set, the resulting set is all members of A that are contained (by the start offset rule as under incl) in any B. This also takes the not operator to return all A's that are not within any B.
A near B: A and B are either points or regions. The result is all A's whose start offsets are within # number of bytes of the start offset of any B (# is either explicitly stated (with near.#), or taken from the {proximity} setting). The not form returns all A's whose start offsets are not within the specified number of bytes from the start offset of any B. The nearest B might be earlier or later in the source file.
A fby B: This is like the near operator, except that an A must be followed within the specified number of bytes by a B to be in the result set. This can also takes the not operator.
not: This reverses the sense of the expression it modifies, usable with incl, within, near, and fby.