XPAT Details

Indexing TextClass and FindaidClass data will be covered in detail during the Text Class Data Preparation and FindaidClass Data Preparation sections.

A full list of XPAT commands can be found at: http://quod.lib.umich.edu/sgml/pat/pat50manual.html


Collmgr and System Configuration for XPAT

Collmgr fields

To invoke collmgr: http://______.ws.umdl.umich.edu/cgi/c/collmgr/collmgr (replace ______ with you account user id.

Troubleshooting

Discussion of Text Indexing and Region indexing

Semi-infinite strings

XPAT indexes strings (semi-infinite strings) rather than words. Consider this text:

... called Kitchee-Gumeeng, also great lake. The
words Mitchee and Kitchee both seem to mean
the same thing, great, large; whether there is a
shade of difference in applying ...
Searching for same thing great will retrieve the string beginning with "same" and followed by "thing great". However searching for same great will not retrieve anything. XPAT searches for strings, anchored at index points that match up to virtually the end of the document.

Index points are offsets into the text where XPAT looks for matches. Generally, index points are characters following spaces.

Ssearching for several words with XPAT is implicitly a phrase search. To search for "same" AND/OR "great" requires the use of boolean operators (^ and +) and regions.

XPAT compresses multiple spaces in the text to a single space when indexing and searching.

XPAT also can perform case mapping and character mapping. This is specified in the data dictionary (.dd) file.

Back to top

Characters, Unicode, Tools

This section treats issues of character encoding as it applies to XPAT and mentions a few tools we've written you may find useful. There is also and expanded treatment of this subject.

Some Unicode / XPAT facts:

There are many reasons to use Unicode.

We deliver a few locally developed tools you may find useful.

Back to top

Indexing and the Data Dictionary

This section applies only to XPAT-based classes: TextClass, FindaidClass. ImageClass is MySQL-based. More when we talk about data preparation for the classes more fully.

Here's a look at the resulting files:

ls -al /l1/workshop/pfarber/dlxs/idx/s/sampletc_utf8 
                  
-rw-rw-r--  1 pfarber  dlps    816 Jun 16  2005 div1head.rgn
-rw-rw-r--  1 pfarber  dlps    576 Jun 16  2005 div2head.rgn
-rw-rw-r--  1 pfarber  dlps    528 Jun 16  2005 id.rgn
-rw-rw-r--  1 pfarber  dlps    528 Jun 16  2005 mainauthor.rgn
-rw-rw-r--  1 pfarber  dlps    528 Jun 16  2005 maindate.rgn
-rw-rw-r--  1 pfarber  dlps    528 Jun 16  2005 mainheader.rgn
-rw-rw-r--  1 pfarber  dlps    528 Jun 16  2005 main.rgn
-rw-rw-r--  1 pfarber  dlps    528 Jun 16  2005 maintitle.rgn
-rw-rw-r--  1 pfarber  dlps   6704 Apr  4  2007 page.rgn
-rw-rw-r--  1 pfarber  dlps   6704 Apr  4  2007 page-t.rgn
-rw-rw-r--  1 pfarber  dlps 138040 Jun 16  2005 sampletc_utf80.rgn
-rw-rw-r--  1 pfarber  dlps  50907 Apr  5  2007 sampletc_utf8.dd
-rw-rw-r--  1 pfarber  dlps 968452 Apr  4  2007 sampletc_utf8.idx
-rw-rw-r--  1 pfarber  dlps      0 Jan 30  2004 sampletc_utf8.init
              

The data dictionary is an XML file in a collection subdirectory of the idx directory. It ties all the the index files together and holds the specifications for index points and character mappings.

Here's a bit of the section of the data dictionary that specifies the index points, i.e. the points in your data where XPAT will look for matches to your query string:

            <IndexPoints>
            <IndexPt> &printable.</IndexPt>
            <IndexPt>&printable.-</IndexPt>
            <IndexPt>-&printable.</IndexPt>

            <IndexPt> &Latin.</IndexPt>
            <IndexPt>&Latin.-</IndexPt>
            <IndexPt>-&Latin.</IndexPt>

            <IndexPt> &Greek.</IndexPt>
            <IndexPt>&Greek.-</IndexPt>
            <IndexPt>-&Greek.</IndexPt>
            </IndexPoints>
            

Note the metacharacters like &printable., &amp. or &Greek. that represent all characters from one of the blocks of Unicode Plane 0. Index point metacharacters are based on the Unicode block definitions, Perl unicode lib (e.g. lib/5.8.3/unicore/lib/Latin.pl) and modified as described in the XPAT data dictionary document.

Here's a bit of the section of the data dictionary where character mapping is specified. Refer to the Unicode character database. This is mainly used for upper to lower case mapping for alphabets that have case.

            ...

            <Map><From>!</From><To> </To></Map>
            <Map><From>"</From><To> </To></Map>
            <Map><From>$</From><To> </To></Map>
            <Map><From>%</From><To> </To></Map>
                             ...
            <Map><From>U+0391</From><To>U+03B1</To></Map>

            <Map><From>U+0392</From><To>U+03B2</To></Map>
            <Map><From>U+0393</From><To>U+03B3</To></Map>

            <Map><From>U+0394</From><To>U+03B4</To></Map>
            <Map><From>U+0395</From><To>U+03B5</To></Map>

            ...
            

Finally here's an example of a full data doctionary.

Back to top

Regions

XPAT indexes XML regions (via xmlrgn), allowing searching of text within regions, regions including text or other regions, etc. Consider this diagram of the kind of regions XPAT can index.

|<------------------------- region FAMNAME --------------------------->|
|                                                                      |
<FAMNAME SOURCE="lcnaf" ENCODINGANALOG="100">Whittemore Family</FAMNAME>
|               |     |                     |          
|             ->|     |<- region "A-SOURCE" |
|                                           |
|<------------ region "FAMNAME-T" --------->|

There are three kinds of regions:

  1. region FAMNAME - Everything between the beginning and end of a tag pair, including the tags
  2. region "A-SOURCE" - The value of an XML attribute
  3. region "FAMNAME-T" - The contents of the begin tag

Back to top

Query Language Syntax

Invoking XPAT

To start an interactive session with XPAT, enter xpatu (for UTF-8 data indexing/searching) along with the name of the data dictionary (dd) file.:

% xpatu $DLXSROOT/idx/s/sampletc_utf8/sampletc_utf8.dd
Back to top

Identifying Points

In XPAT, a point is a unique byte offset in the full text where XPAT has indexed a string. Enter a string or byte offset in square brackets and set of points is returned:

 >> "prince"
1: 134 matches

>> "prince "
2: 123 matches

>> sample
3: 10 matches

>> pr
 539939, ..was said that Prince Alexander of Battenberg had changed into a ..
 957348, ..e only child, Prince Alexander, who came in before we went to ta..
1390470, ..TEM>Bismarck, Prince, and the Austro-German alliance ~ <REF>xxiv..
 552103, ..alliance that Prince Bismarck, in 1879, entered into the very cl..
 208247, .. sceptre d'un prince de religion orthodoxe.</P> <P> <..
1016444, ..n the streets Prince Michael and Teresia, 20 to 30 dinars toward..
 943446, ..ian statue of Prince Michael, whose name and portrait are found ..
 483031, ..la volonté du prince Nicolas, ses résolutions personnelles au su..
1411801, ..udolph, Crown Prince, Popularity of ~ <REF>69</REF> </ITEM..
1141121, ..raged it. The Prince suspected nothing of what was taking place ..

>> "emile "
4 : 9 matches

>> "Émile "
5 : 9 matches

>> [290947]
6 : one match
                  

The first query finds all "semi-infinite strings" that begin with "prince", the second finds those that are "prince" exactly (with the space, or anything that has been mapped to a space). The "Emile" queries demonstrate character mapping and case mapping. The last, finds the string beginning at the byte offset 290947.

Back to top

Identifying Regions

A region in XPAT is a span of text comprising zero or more bytes. xmlrgn or multirgn (discussed in the TextClass Collection Implementation/Indexing Section) handles the creation of these regions.

To find how many of a particular region type exist, enter region plus the name of the region (double quotes are needed if the name contains non-alphanumeric characters).

>> region "DIV1"
1: 38 matches
>> region "A-NODE" 
2: 46 matches

Also see the {ddinfo regionnames} command.
Also see the history command.

Back to top

Identifying Sets (Named Sets)

Any collection of points or regions can be grouped together in a set. Sets can be combined or split with XPAT's boolean operators. All sets created during a session have unique number identifier They can be can given names (name = ). They can be printed out (pr), saved, exported (useful in the creation of "fabricated regions"). Here are just a few examples:

>> long
1: 244 matches

>> help
2: 54 matches

>> 1 + 2
3: 298 matches

>> "alternate" 
4: 5 matches

>> pr 4
1175485, ..most from the alternate advance and retreat of the Russian and T..
1165090, ..in. Vineyards alternated with fields of barley, oats, and maize;..
 967310, ..men and women alternately; <EPB/> <PB REF="00000208.tif" S..
1313659, ..a and Austria alternately. But, when able to repel aggression, s..
1303571, .. each country alternately. It should be composed of three secti..

>> mysearch = "pair"
5: mysearch = 3 matches

>> pr *mysearch
1170568, ..and a half; a pair of buffaloes, 600 francs (£24).</P> <P>B..
 848085, ..s dress was a pair of large Turkish trousers of white wool, a sh..
1085132, ..nd thick; two pairs of oxen drew it by means of a pole which was.. 

Also see the subset command.
Also see the {sortorder} setting.
Also see other operators and relations.

Back to top

Viewing Sets

The pr command is the heart of viewing sets. In an interactive XPAT session, it lets you view the results you've searched for. Within the middleware, getting the data back from XPAT is the first step; next there is a small amount of manipulation of the XML that is returned from XPAT queries; finally conversion to HTML is done via XSLT stylesheets.

pr (point-set)
This prints out the members of the point-set, starting with the first, according to the current {sortorder} setting.
pr.X shift.-Y (point-set)
Print the results in the point-set in a string X bytes wide, offset to the left of the matching point Y bytes. X and Y overide the settings of {printlength} and {leftcontext} respectively (which are described below).
pr.region."region-name" (region-set of type "region-name")
prints the entire span of each the members in the region set. It seems redundant to have to tell XPAT the "format" of the region you would like to see, when it should already know!
 
 
In interactive mode, the following prints the last set created.
pr

pr %
pr.X shift.-Y
 

Note: The save command is, in a sense, the same as the pr command: pr displays to STDOUT, save outputs (appends) to a file whose name is given by {savefile}. The format of the output is the same.

Back to top

Using the Operators to Make Sets of Interest

Using some basic XPAT operators, we can build some very specific searches that take advantage of the XML markup. Here is an actual example from the TextClass implementation.

Consider this (edited) XML of a TEI header element and note the highlighted portion:


  <HEADER>
    <FILEDESC>
      <TITLESTMT>
        <TITLE TYPE="245"> The Balkan Peninsula, / by Émile de Laveleye</TITLE>
        <AUTHOR> Laveleye, Emile de,  1822-1892. </AUTHOR>
      </TITLESTMT>
      <PUBLICATIONSTMT>
        <PUBLISHER>DLPS ...</PUBLISHER>
        <IDNO TYPE="dlps">abe5413.0001.001</IDNO>
      </PUBLICATIONSTMT>
      <SOURCEDESC>
        <BIBLFULL>
          <TITLESTMT>
            <TITLE TYPE="main"> The Balkan Peninsula, / by Émile de Laveleye</TITLE>
            <AUTHOR> Laveleye, Emile de,  1822-1892. </AUTHOR>
            <AUTHOR> Thorpe, Mary,  Mrs.,  tr. </AUTHOR>
          </TITLESTMT>
          <PUBLICATIONSTMT>
            <PUBLISHER>G. P. Putnam's sons,</PUBLISHER>
          </PUBLICATIONSTMT>
        </BIBLFULL>
      </SOURCEDESC>
    </FILEDESC>
    <ENCODINGDESC> ... </ENCODINGDESC>
    <PROFILEDESC> ... </PROFILEDESC>
  </HEADER>
  

The following query is actually the basis for the fabricated region called mainauthor in most of our text collections.

>> ((region AUTHOR within (region TITLESTMT within region FILEDESC)) not within (region SOURCEDESC)) 
6: 2 matches

>> pr.region.6
   235, ..<AUTHOR> Yriarte, Charles, 1832-1898. </AUTHOR> ..
513768, ..<AUTHOR> Laveleye, Emile de, 1822-1892. </AUTHOR>.. 

Let's say we want a slice of the the title(s) for the chapters in a given volume that contain hits for a users search for the word prince. We construct the query in stages.

  1. Query to return a PSet consisting of hits on a user-entered search term:
     >> hitssearch = ("prince " + "prince<")
      1: hitssearch = one match
  2. Query for the DIV1 regions (chapters) that contain the hits:
     >> chapters = (region DIV1 incl (region "A-TYPE" incl "chapter")) incl *hitssearch
      2: chapters = 14 matches
  3. Query for the regions contain HEAD elements that are not chapter heads:
     >> excludedheadregions = (region LIST) + (region FIGURE) + (region DIV2)
      3: excludedheadregions = 25 matches
  4. Query for the HEAD elements in the DIV1 regions that contain the hits without the HEAD elements we don't want:
     >> chapterheads = (region HEAD within *chapters) not within *excludedheadregions
      4: chapterheads = 14 matches
  5. Query for the particular volume we're interested in:
     >> volume = region main incl (region HEADER incl (region IDNO incl "abe5413.0001.001"))
      5: volume = one match
  6. Query for the chapter heads just in that volume:
     >> volumechapterheads = *chapterheads within *volume
      6: volumechapterheads = 11 matches
  7. Query for a slice of those chapter heads:
     >> volumechapterheadsslice = subset.1.5 *volumechapterheads
      7: volumechapterheadsslice = 5 matches
  8. At last! Print them out:
     >> pr.region.HEAD *volumechapterheadsslice
    
       523428, ..<HEAD>INTRODUCTORY  CHAPTER.<LB/>THE PRESENT POSITION OF BULGARIAN AFFAIRS.</HEAD>..
       557986, ..<HEAD>CHAPTER I.<LB/>VIENNA—MINISTERS AND FEDERALISM.</HEAD>..
       600631, ..<HEAD>CHAPTER II.<LB/>BISHOP STROSSMAYER.</HEAD>..
       707081, ..<HEAD>CHAPTER III.<LB/>HISTORY AND RURAL ECONOMY OF BOSNIA.</HEAD>..
       819018, ..<HEAD>CHAPTER IV.<LB/>BOSNIA—ITS SOURCES OF WEALTH, ITS INHABITANTS, AND RECENT PROGRESS.</HEAD>..
    

Back to top

Introduction to fabricated regions

A fabricated region is a "virtual" region that has been indexed. You can use any valid XPAT query to create a result set. Then, with the {export} command, you can have XPAT create a binary index of the points in the result.

There are two basic reasons to do this:

Once the fabricated regions are created and indexed, they can be searched for and printed just like any other region.

We've actually already seen an example of a region that could be made into a a fabricated region in the last section. Recall these two named regions:

 >> excludedheadregions = (region LIST) + (region FIGURE) + (region DIV2)
  3: excludedheadregions = 25 matches
 >> chapterheads = (region HEAD within *chapters) not within *excludedheadregions
  4: chapterheads = 14 matches
We could make the named query chapterheads into a fabricated region with the {export} and {exportfile} commands as follows:
{exportfile "/l1/idx/s/sample/chapterheads.rgn"}; export *chapterheads; ~sync "chapterheads";
Another example of an important fabricated region in TextClass and FindaidClass is maindate.
>> region maindate
1: 2 matches

>> pr.region.maindate region maindate
     1181, ..<DATE>1876.</DATE>..
   514996, ..<DATE>1887.</DATE>..

For more examples and discussion of fabricated regions, see: Fabricated Regions.

Back to top

Debugging Complex Queries

The most likely queries you may need to debug are those involving fabregions because those will be queries you construct yourself as opposed to the hard-coded queries in the middleware. Nonetheless, this technique is useful when debugging any involved query.

The idea is simple. Start XPAT at the command line and submit the sub-queries of the full query until you find one that does not return the result you expect. To see the queries submitted to XPAT append ;debug=search to the end of your URL and copy/paste the query strings into the XPAT command line prompt, submittting named queries before submitting queries that refer to the named queries. Here's an example from the sampletc_utf8 collection

Back to top


Additional Details (not covered explicitly during the course of this workshop)

For more information about all XPAT commands, see the regular DLXS documentation about XPAT.

{settings}

Settings control certain behaviors of XPAT during a search session. DLXS middleware explicitly uses the {quieton} command. A full list of XPAT commands, which includes the { } settings, can be found at: http://quod.lib.umich.edu/sgml/pat/pat50manual.html
 
{printlength #}
This setting controls the default print window size for point sets, how many total bytes are given when a point set result is printed. See the discussion of pr above. Default is 64.

{leftcontext #}
This setting controls how many characters before the matching text will be given when a point set is printed. If there are 100 characters of {printlength}, and 14 of {leftcontext}, then the point where the matching text starts will be the 15th character. See the discussion of pr above. Default is 14.
 
{sortorder <order>}
This determines in what order a given set of results is sorted by XPAT. There are other modes, but DLXS middleware always uses {sortorder occur}, which is to say that results are returned in the byte order in which they occur in the source text.

{savefile "file"}
Changes the default save file name.
When the save command is given, results are appended to the file.
 
{exportfile "file"}
Changes the default export file name. When the export command is given, results are appended to the file.
Back to top

Miscellaneous and Useful Commands

{ddinfo regionnames}
Lists all the currently-defined regions in the .idx, .rgn and even fabricated region .rgn files. A very useful command for document analysis

 
history
List of results sets from previously issued searches and the commands that created them.
 
subset.X.Y A
Make a new set that consists of Y members of A, starting at the Xth member of A. Members of A start numbering at 1. Note: This command is used in the middleware to get results in slices.
 
~sync "string"
A very useful command; basically an echo sort of command. This is used in the Middleware to signal when XPAT is done sending results. In any of the {quieton} modes, this returns:
<Sync>string</Sync>
Back to top

Operators and Relations

These are the operators most used in the Middleware.
A ^ B
the "and" or "intersection" operator: A and B are two sets, or expressions that evaluate to sets, and the resulting set includes those points or regions in both A and B that have the exact same start offsets.
A + B + C + ...
the "or" or "union" operator: A, B, C... are sets. The resulting set (which is a point set if at least one of the sets being combined is a point set), consisting of the start offsets of all the points or regions in the original sets. If all the sets being combined are region sets, then regions that nest inside other listed regions (either entirely or at their start byte offset) will be removed from the resultant set.
A incl B

A not incl B
A is a region set, B is either a point or region set. The result is a region set of all members of A that contain at least one member of B, containment meaning that a given B has a start offset within the inclusive range of a given A's start and end offsets.
A within B

A not within B
In many ways the complement to incl: A is a point or region set, B is a region set, the resulting set is all members of A that are contained (by the start offset rule as under incl) in any B. This also takes the not operator to return all A's that are not within any B.
A near B
A and B are either points or regions. The result is all A's whose start offsets are within # number of bytes of the start offset of any B (# is either explicitly stated (with near.#), or taken from the {proximity} setting). The not form returns all A's whose start offsets are not within the specified number of bytes from the start offset of any B. The nearest B might be earlier or later in the source file.
A fby B
This is like the near operator, except that an A must be followed within the specified number of bytes by a B to be in the result set. This can also takes the not operator.
not
This reverses the sense of the expression it modifies, usable with incl, within, near, and fby.
 
Back to top

“quieton raw” mode

The default mode, in an interactive XPAT session, is "quietoff". This gives the results messages you have seen so far: numbered sets, byte offsets followed by snippets of SGML with ".." on either end, etc. Another mode, and the most useful for interacting with XPAT programmatically, is "quieton raw". Nothing seems to happen when one enters:

>> {quieton raw}

However, entering queries now produces results that are tagged in a way that is easily parsable from within a program. First enter an earlier point search:

firstsearch = ("Branivoj " + "Branivoj<")
<SSize>1</SSize> pr
<PSet><Start>313615</Start><Raw><Size>64</Size>res du nom de Branivoj s'emparent du territoire qu'ils gouvernen</Raw></PSet>

Now enter an earlier region search:

((region AUTHOR within (region
TITLESTMT within region FILEDESC)) not within (region SOURCEDESC)) 

<SSize>4</SSize>

pr.region.AUTHOR
<RSet><Start>143</Start><End>178</End><Raw><Size>36</Size> <AUTHOR>Holbach, Maude M. </AUTHOR></Raw><Start>298344</Start> <End>298391</End><Raw><Size>48</Size><AUTHOR>Yriarte, Charles, 1832-1898. </AUTHOR></Raw> <Start>792438</Start><End>792487</End><Raw><Size>50</Size> <AUTHOR>Laveleye, Emile de, 1822-1892. </AUTHOR></Raw><Start>1689410</Start> <End>1689486</End><Raw><Size>77</Size> <AUTHOR>Sebright, Georgina Mary Muir (Mackenzie), Lady, d. 1874- </AUTHOR></Raw></RSet>

Some of these tags are self-explanatory (e.g., SSize = set size). But some may need a bit of explanation.

PSet
These tags surround an entire set of point results.
RSet
These tags surround an entire set of region results.
Start
Byte offset of beginning of one result, either point or region.
End
Byte offset of end of one result, either point set string or region.
Raw
The "raw" information of one particular result.
Size
Byte offset of end of one result, either point set string or region.
text following the </Size> tag
Actual retrieved text of result.
Back to top

Programming XPAT queries in Perl

XPAT's ability to return results with tags allows a program to parse the results into pieces. In the DLXS Middleware this is done by a group of DLXS Perl modules. These modules have methods to let the CGI program interact with XPAT (an XPAT process is forked off by the CGI program and queries can be made of it at any time). The main object the code uses is the xpat object. It has methods for making queries in different ways and for interacting with the forked off XPAT process.

Here is some code (from TextClass.pm) that illustrates how the middleware uses a method of the Perl-based XPAT object (created in an earlier part of the code).

... my $query = qq{(region mainheader incl ( $idnorgn incl "$idno" ) );};
my ( $error, $result) = $xpat->GetSimpleResultsFromQuery( $query );
if ( $error )
{
     &DlpsUtils::errorBail( qq{Query error in FindXPATContainingIdno: $result} );
}

&DlpsUtils::StripAllRSetCruft( \$result );
$result =~ m,<SSize>(\d+)</SSize>,;
my $hit = $1;
if ( $hit > 0 )
{
    $returnXpat = $xpat; last;
}
... 

While some code, such as this, makes a query via a simple method, most queries in the middleware are actually made by other means, through other objects and their methods. Once data has been prepared according to the DLXS Class DTDs, in terms of searching, the middleware can be thought of as an engine that simply "runs" the data.

Back to top