XPAT Details

Indexing will be covered in detail during the Text Class Data Preparation section.

A full list of XPAT commands can be found at: http://www.hti.umich.edu/sgml/pat/pat50manual.html


Query Language Syntax

Invoking XPAT

To start an interactive session with XPAT, enter xpat or xpatu (for UTF-8 data indexing/searching) along with the name of the data dictionary (dd) file.:
% xpatu $DLXSROOT/idx/s/sampletc_utf8/sampletc_utf8.dd
Back to top

Identifying Points

In XPAT, a point is a unique byte offset in the full text where XPAT has indexed a string. Enter a string or byte offset in square brackets and set of points is returned:
 >> "prince"
1: 134 matches

>> "prince "
2: 123 matches

>> sample
3: 10 matches

>> pr
 539939, ..was said that Prince Alexander of Battenberg had changed into a ..
 957348, ..e only child, Prince Alexander, who came in before we went to ta..
1390470, ..TEM>Bismarck,Prince, and the Austro-German alliance ~ <REF>xxiv..
 552103, ..alliance that Prince Bismarck, in 1879, entered into the very cl..
 208247, .. sceptre d'un prince de religion orthodoxe.</P> <P> <..
1016444, ..n the streets Prince Michael and Teresia, 20 to 30 dinars toward..
 943446, ..ian statue of Prince Michael, whose name and portrait are found ..
 483031, ..la volonté du prince Nicolas, ses résolutions personnelles au su..
1411801, ..udolph, Crown Prince, Popularity of ~ <REF>69</REF> </ITEM..
1141121, ..raged it. The Prince suspected nothing of what was taking place ..

>> [290947]
4 : one match

The first query finds all "semi-infinite strings" that begin with "prince", the second finds those that are "prince" exactly (with the space, or anything that has been mapped to a space), and the third query finds the string beginning at the byte 290947.

Back to top

Identifying Regions

A region in XPAT is a span of text comprising zero or more bytes. sgmlrgn50 or xmlrgn or multirgn (discussed in the TextClass Collection Implementation/Indexing Section) handles the create of these regions.

To find how many of a particular region type exist, enter region plus the name of the region (double quotes are needed if the name contains non-alphanumeric characters).

>> region "DIV1"
1: 38 matches
>> region "A-NODE" 
2: 46 matches

Also see the {ddinfo regionnames} command.
Also see the history command.

Back to top

Identifying Sets (Named Sets)

Any collection of points or regions can be grouped together in a set. Sets can be combined or split with XPAT's boolean operators. All sets created during a session have unique number identifier They can be can given names (name = ). They can be printed out (pr), saved, exported (useful in the creation of "fabricated regions"). Here are just a few examples:
>> long
1: 244 matches

>> help
2: 54 matches

>> 1 + 2
3: 298 matches

>> "alternate" 
4: 5 matches

>> pr 4
1175485, ..most from the alternate advance and retreat of the Russian and T..
1165090, ..in. Vineyards alternated with fields of barley, oats, and maize;..
 967310, ..men and women alternately; <EPB/> <PB REF="00000208.tif" S..
1313659, ..a and Austria alternately. But, when able to repel aggression, s..
1303571, .. each country alternately. It should be composed of three secti..

>> mysearch = "pair"
5: mysearch = 3 matches

>> pr *mysearch
1170568, ..and a half; a pair of buffaloes, 600 francs (£24).</P> <P>B..
 848085, ..s dress was a pair of large Turkish trousers of white wool, a sh..
1085132, ..nd thick; two pairs of oxen drew it by means of a pole which was.. 

Also see the subset command.
Also see the {sortorder} setting.
Also see other operators and relations.

Back to top

Using the Operators to Make Sets of Interest

Using some basic XPAT operators, we can build some very specific searches that take advantage of the SGML's markup. Here is an actual example from the TextClass implementation. The following query is actually the basis for the fabricated region called mainauthor in most of our text collections. Note that this query depends on knowing the structure of the document's markup (in case of TextClass documents, the regions here are essentially the same as in the TEIHEADER of the TEI.2 DTD.)

>> ((region AUTHOR within (region
TITLESTMT within region FILEDESC)) not within (region SOURCEDESC)) 6:
2 matches

>> pr.region.6
   235, ..<AUTHOR> Yriarte, Charles, 1832-1898. </AUTHOR> ..
513768, ..<AUTHOR> Laveleye, Emile de, 1822-1892. </AUTHOR>.. 

Here we construct a query to return a PSet consisting of hits on a user-entered search term. We want to display a line containing the immediate context of the hit and also a title from an enclosing division:

The query for the user's search is simply:

 >> firstsearch= ("Branivoj " + "Branivoj<") 7
 firstsearch = one match

To get a division title for the hit we need to build up regions based on the hit:

>> slicesearch= subset.1.25 *firstsearch
8: slicesearch = one match

>> mainslicesearch = (region "DLPSTEXTCLASS" incl *slicesearch)
9: mainslicesearch = one match

>> mainheader = (region "HEADER" within *mainslicesearch)
10 : mainheader = one match

Finally to view the content of the region we have constructed we enter:

>> pr.region."HEADER" (region *mainheader) 

See also viewing sets.

  Back to top


“quieton raw” mode and programming XPAT queries in Perl

“quieton raw” mode

The default mode, in an interactive XPAT session, is "quietoff". This gives the results messages you have seen so far: numbered sets, byte offsets followed by snippets of SGML with ".." on either end, etc. Another mode, and the most useful for interacting with XPAT programmatically, is "quieton raw". Nothing seems to happen when one enters:

>> {quieton raw}

However, entering queries now produces results that are tagged in a way that is easily parsable from within a program. First enter an earlier point search:

firstsearch = ("Branivoj " + "Branivoj<")
<SSize>1</SSize> pr
<PSet><Start>313615</Start><Raw><Size>64</Size>res du nom de Branivoj s'emparent du territoire qu'ils gouvernen</Raw></PSet>

Now enter an earlier region search:

((region AUTHOR within (region
TITLESTMT within region FILEDESC)) not within (region SOURCEDESC)) <SSize>4</SSize> pr.region.AUTHOR
<RSet><Start>143</Start><End>178</End><Raw><Size>36</Size> <AUTHOR>Holbach, Maude M. </AUTHOR></Raw><Start>298344</Start> <End>298391</End><Raw><Size>48</Size><AUTHOR>Yriarte, Charles, 1832-1898. </AUTHOR></Raw> <Start>792438</Start><End>792487</End><Raw><Size>50</Size> <AUTHOR>Laveleye, Emile de, 1822-1892. </AUTHOR></Raw><Start>1689410</Start> <End>1689486</End><Raw><Size>77</Size> <AUTHOR>Sebright, Georgina Mary Muir (Mackenzie), Lady, d. 1874- </AUTHOR></Raw></RSet>

Some of these tags are self-explanatory (e.g., SSize = set size). But some may need a bit of explanation.

PSet
These tags surround an entire set of point results.
RSet
These tags surround an entire set of region results.
Start
Byte offset of beginning of one result, either point or region.
End
Byte offset of end of one result, either point set string or region.
Raw
The "raw" information of one particular result.
Size
Byte offset of end of one result, either point set string or region.
text following the </Size> tag
Actual retrieved text of result.
Back to top

Programming XPAT queries in Perl

XPAT's ability to return results with tags allows a program to parse the results into pieces. In the DLXS Middleware this is done by a group of DLXS Perl modules. These modules have methods to let the CGI program interact with XPAT (an XPAT process is forked off by the CGI program and queries can be made of it at any time). The main object the code uses is the xpat object. It has methods for making queries in different ways and for interacting with the forked off XPAT process.

Here is some code (from TextClass.pm) that illustrates how the middleware uses a method of the Perl-based XPAT object (created in an earlier part of the code).

... my $query = qq{(region mainheader incl ( $idnorgn incl "$idno" ) );};
my ( $error, $result) = $xpat->GetSimpleResultsFromQuery( $query );
if ( $error )
{
     &DlpsUtils::errorBail( qq{Query error in FindXPATContainingIdno: $result} );
}

&DlpsUtils::StripAllRSetCruft( \$result );
$result =~ m,<SSize>(\d+)</SSize>,;
my $hit = $1;
if ( $hit > 0 )
{
    $returnXpat = $xpat; last;
}
... 

While some code, such as this, makes a query via a simple method, most queries in the middleware are actually made by other means, through other objects and their methods. Once data has been prepared according to the DLXS Class DTDs, in terms of searching, the middleware can be thought of as an engine that simply "runs" the data.

NOTE: Whereas in Release 11a and before, if there was any code change that needed to be made by DLXS users, it was usually when different display of data was needed ("filtering"). Now, nearly all "filtering" of data for display is done via XLST stylesheets. Occasionally, collection-specific searches need to be made (based on, for example, idiosyncratic markup). The query building for those searches may still need to be subclassed. However, most text type collections, if using the admittedly loose Text Class DTD, will run through the middleware with little if any modification, since most standard searches are done via those things that help abstract out many idiosyncracies of mark up: fabricated regions, mapped search region names, etc.

Back to top

Introduction to fabricated regions

A fabricated region is a "virtual" region that has been indexed. You can use any valid XPAT query to create a result set. Then, with the {export} command, you can have XPAT create a binary index of the points in the result.

Why would you want to do this? If you, or your program, will be making queries often on something that is a bit complex (in terms of the query needed), you can have XPAT consult a previously created index rather than have it do the complex query, each time it might need it, using the usual idx and SGML rgn indexes.

For examples and more discussion of fabricated regions, see: Fabricated Regions.

Once the fabricated regions are created and indexed, they can be searched for and printed just like any other region.

>> region maindate
1: 4 matches

>> pr.region.maindate region maindate
   990, ..<DATE>1910.</DATE>.. 299182, ..<DATE>1876.</DATE>..
793555, ..<DATE>1887.</DATE>.. 1690542, ..<DATE>1877.</DATE>.. 

Back to top


Additional Details (not covered explicitly during the course of this workshop)

For more information about all XPAT commands, see the regular DLXS documentation about XPAT.

Viewing Sets

The pr command is the heart of viewing sets. In an interactive XPAT session, it lets you view the results you've searched for. Within the middleware, getting the data back from XPAT is just one step; before Release 12, it was followed by "filtering" operations, Perl substitutions using regular expressions, to remove or change other tags in the the content and to change the appearance tof the content; e.g. highlighting hits, etc., eventually resulting in HTML. As of Release 12, though there is some small amount of manipulation of the XML that is returned from XPAT queries, essentially all "filtering" (conversion to HTML) is done via XSLT stylesheets.

The format the results that XPAT returns with pr or save is determined by the current {quieton} setting. There is a big difference between the normal user-sitting-at-the-pat-terminal interactive mode, and the machine-readable modes.

pr (point-set)
This prints out the members of the point-set, starting with the first, according to the current {sortorder} setting.
pr.X shift.-Y (point-set)
Print the results in the point-set in a string X bytes wide, offset to the left of the matching point Y bytes. X and Y overide the settings of {printlength} and {leftcontext} respectively (which are described below).
pr.region."region-name" (region-set of type "region-name")
prints the entire span of each the members in the region set. It seems redundant to have to tell XPAT the "format" of the region you would like to see, when it should already know!
 
 
In interactive mode, the following print the last set created.
pr

pr %
pr.X shift.-Y
 

Note: The save command is, in a sense, the same as the pr command: pr displays to STDOUT, save outputs (appends) to a file whose name is given by {savefile}. The format of the output is the same.

Back to top

{settings}

Settings control certain behaviors of XPAT during a search session. DLXS middleware explicitly uses the {quieton} command. A full list of XPAT commands, which includes the { } settings, can be found at: http://www.hti.umich.edu/sgml/pat/pat50manual.html
 
{printlength #}
This setting controls the default print window size for point sets, how many total bytes are given when a point set result is printed. See the discussion of pr above. Default is 64.

{leftcontext #}
This setting controls how many characters before the matching text will be given when a point set is printed. If there are 100 characters of {printlength}, and 14 of {leftcontext}, then the point where the matching text starts will be the 15th character. See the discussion of pr above. Default is 14.
 
{sortorder <order>}
This determines in what order a given set of results is sorted by XPAT. There are other modes, but DLXS middleware always uses {sortorder occur}, which is to say that results are returned in the byte order in which they occur in the source text.

{savefile "file"}
Changes the default save file name.
When the save command is given, results are appended to the file.
 
{exportfile "file"}
Changes the default export file name. When the export command is given, results are appended to the file.
Back to top

Miscellaneous and Useful Commands

{ddinfo regionnames}
Lists all the currently-defined regions in the .idx, .rgn and even fabricated region .rgn files. A very useful command for document analysis

 
history
List of results sets from previously issued searches and the commands that created them.
 
subset.X.Y A
Make a new set that consists of Y members of A, starting at the Xth member of A. Members of A start numbering at 1. Note: This command is used in the middleware to get results in slices.
 
~sync "string"
A very useful command; basically an echo sort of command. This is used in the Middleware to signal when XPAT is done sending results. In any of the {quieton} modes, this returns:
<Sync>string</Sync>
Back to top

Operators and Relations

These are the operators most used in the Middleware.
A ^ B
the "and" or "intersection" operator: A and B are two sets, or expressions that evaluate to sets, and the resulting set includes those points or regions in both A and B that have the exact same start offsets.
A + B + C + ...
the "or" or "union" operator: A, B, C... are sets. The resulting set (which is a point set if at least one of the sets being combined is a point set), consisting of the start offsets of all the points or regions in the original sets. If all the sets being combined are region sets, then regions that nest inside other listed regions (either entirely or at their start byte offset) will be removed from the resultant set.
A incl B

A not incl B
A is a region set, B is either a point or region set. The result is a region set of all members of A that contain at least one member of B, containment meaning that a given B has a start offset within the inclusive range of a given A's start and end offsets.
A within B

A not within B
In many ways the complement to incl: A is a point or region set, B is a region set, the resulting set is all members of A that are contained (by the start offset rule as under incl) in any B. This also takes the not operator to return all A's that are not within any B.
A near B
A and B are either points or regions. The result is all A's whose start offsets are within # number of bytes of the start offset of any B (# is either explicitly stated (with near.#), or taken from the {proximity} setting). The not form returns all A's whose start offsets are not within the specified number of bytes from the start offset of any B. The nearest B might be earlier or later in the source file.
A fby B
This is like the near operator, except that an A must be followed within the specified number of bytes by a B to be in the result set. This can also takes the not operator.
not
This reverses the sense of the expression it modifies, usable with incl, within, near, and fby.
 
Back to top