Query Language Details

Some documentation can be found at: http://www.hti.umich.edu/sgml/pat/pat50manual.html

Some additional material from past workshops concerning Query Language is available.

We're going to start out in user mode, and shift into what's called {quieton} and {quieton raw} once some basic concepts have been demonstrated. In the first few sections, I'm going to non-chalantly introduce items that I won't fully explain until the section on operators and relationships.

We're going to start with just the regions that the sgmlrgn step gives us, because the stunts we learn here will make the fabricated regions step a lot easier to think about.

Invoking pat50

We've come all this way to in the end type a command like:
% pat50 /l1/idx/b/bosnia.bosnia.dd

Identifying Points

In XPat a point is some unique byte offset in the full text under consideration, corresponding to those places where XPat was told to begin sistrings. We get a set (see below for a more formal discussion of sets) of points by performing a search for a string or a particular offset:
>> "mulberry"
>> "mulberry "
>> mulberry
>> [118312]
The first finds all sistrings that begin with mulberry, the second finds those that are "mulberry" exactly, third finds the byte offset for the byte 118312, and the fourth is the same as the first.

Searching for something that doesn't exist in the database gets you zero results, or a point set with zero members:

>> "syzygy"
  4: no match
NOTE: XPat supports the history command, through which you could get a list of all sets created in the session, and the command that created them.

    >> history

Identifying Regions

A region in pat50 is a span of text comprising zero or more bytes. sgmlrgn50 wants to create some for us by default (please see the sgmlrgn step discussion), creates DTD-determined regions in the sgmlrgn step, and permits the creation of new "fabricated" regions.

The {ddinfo regionnames} command will list all the currently-defined regions, and from this list we can find out about particular regions:

>> region DIV1
  1: 113 matches
>> region "A-NODE"
  2: 125 matches
That is, the region command followed by the name of some region (remembering that regions with non-alphanumerics in its name must be double-quoted) evaluates to the number of members in that region. There are 113 DIV1 regions and 125 A-NODE regions above.

We'll say more about the definitions of regions like A-NODE and DIV-T in the sgmlrgn step discussion.

Looking for a region that is not defined gets you a result set of "-1" and an error message.

See about ERROR .

Identifying Sets (Numbered and Named Sets)

Any collection of zero or more points or regions can be grouped together in a set, sets can be combined or split with pat50's boolean operators, all sets created during a session have unique number identifier, one can give them names, sets can be printed out, saved, exported and imported. I gloss over some operators and commands here until the next section.
>> "long "
  1: 352 matches
>> region "DIV1" incl "long "
  2: 76 matches
>> "help "
  3: 100 matches
>> 2 + 3
  4: 176 matches
>> region "TEXT" incl 1
  5: 4 matches
>> vsearch = "vardar "
  6: vsearch = 10 matches
>> vsearchnext = 6
  7: vsearchnext = 10 matches
>> pr *vsearchnext
  1886962, ..l crosses the Vardar, and to this end a bridge is in process of ..
  2058198, ..fall into the Vardar (Axius); and two—the Lab and the Sitn..
   683365, .. région où le Vardar et la Morava prennent leur source, que pass..
  1818056, ..s; and on the Vardar Gate and Arch of Constantine<NOTE>"The Egna..
  2023056, .. of the river Vardar. Our host was a grumbling old man, who asto..
  1902124, ..side into the Vardar plain. The plain in its purple distance mel.
The vsearchnext = 6 line os interesting: 6 is a number, it might be a character you want to search for. So is XPat definitely looking for a numbered set in this session, or the number 6? Another reason to always put search terms in quotes. Try a command like vearchnext = 243 where the 243 is a number that is larger than the set number for the last-created set...

There are additionally two special commands to create subsets of sets:

subset.X.Y A
Make a new set that consists of Y members of A, starting at the Xth member of A. Members of A start numbering at 1.
This command is used to get result content in slices.
sample.X A
Make a new set that consists of X members of A, selected from A of size Y such that each Y/Xth is in the new set.
How does XPat know which member of a set is the "first", "second", and so on. This is set with the (sortorder} setting.  TextClass uses only: {sortorder occur}, which is to say that results are returned in the byte order in which they occur in the source text: the byte offset of a member of a set is <= the byte offset of the next member, if any. TextClass as it stands orders results for display to the user by occurance order, and any ordering other than that is accomplished outside XPat. Not to say that the other {sortorder} settings shouldn't be used, just that nothing designed so far does. See {settings} below more on the different settings.

Operators and Relations

After tantalizing with various operators, weI'll now actually define the ones we use most in the form in which they trpically occur.  You may wish to refer to nn even more detailed discussion.
A ^ B
the "and" or "intersection" operator: A and B are two sets, or expressions that evaluate to sets, and the resulting set includes those points or regions in both A and B that have the exact same start offsets.
A + B + C + ...
the "or" or "union" operator: A, B, C... are sets, the resulting set is a point set if at least one of the sets being combined is a point set, consisting of the start offsets of all the points or regions in the original sets. If all the sets being combined are region sets, then regions that nest inside other listed regions (either entirely or at their start byte offset) will be removed from the resultant set.
We saw an earlier example of the "+" operator.
A incl B

A not incl B
A is a region set, B is either point or region, the result is a region set of all members of A that contain at least one member of B, containment meaning that a given B has a start offset within the inclusive range of a given A's start and end offsets.
A within B

A not within B
In many ways the complement to incl: A is a point or region set, B is a region set, the resulting set is all members of A that are contained (by the start offset rule as under incl) in any B. This also takes the not operator to return all A's that are not within any B.
A near B
A and B are either points or regions, and # is either explicitly stated, or taken from the {proximity} setting (see about {settings} below). The result is all A's whose start offsets are within the specified number of bytes of the start offset of any B. The not form returns all A's whose start offsets are not within the specified number of bytes from the start offset of any B. The nearest B might be earlier or later in the source file.
A fby B
This is just as the near operator, save that an A must be followed within the specified number of bytes by a B to be in the result set. This also takes the not operator.
This reverses the sense of the expression it modifies, usable with incl, within, near, and fby.

Using the Operators to Make Sets of Interest

Now that we have our basic concepts and operators, let's get in there and do some searching and document analysis: the process by which we figure out what is there and how the DTD was applied to this SGML, and what of it we can use. When developing an online system, this is the most important step. Some important commands will be introduced in this experimentation.

What a set of interest is is entirely up to the user, and the notion of user ranges from developers to content specialists to the patrons floating around out there. I'm going to walk through some increasingly complicated possible sets of interest here.

We ignore in this section the matter of displaying retrieved results. That comes in a later section.

find all the words that start with "diff", and find all the words that are "different" exactly
>> "diff" 
  8: 354 matches
>> "different "
  9: 134 matches
find all the "gate" follows "vardar"
>> "vardar " fby "gate "
  1: one match

Now some actual examples from the TextClass implementation.  This query is actually the basis for the fabricated region called mainauthor in Bosnia and illustrates within:

>> ((region AUTHOR within (region TITLESTMT within region FILEDESC)) not within (region SOURCEDESC))
  17: 4 matches

The motivation here depends on knowing that the AUTHOR element appears within the TITLESTMT and that the TITLESTMT element appears within both the FILEDESC and indirectly within the SOURCEDESC element:

<!ELEMENT fileDesc - - (titleStmt, ..., (sourceDesc)+)>
<!ELEMENT titleStmt - - ((title)+, (author | editor | respStmt)*)>
<!ELEMENT sourceDesc - - (p | bibl | biblFull)+>
<!ELEMENT biblFull - - (titleStmt, ..., (sourceDesc)*)>

Here is a simplified query using intersection (^) to fetch the regions that are notes in Bosnia
sgml.  The full-blown form is the union of DIV2 and P tags in addition ot DIV1 tags.

>> (region "DIV1-T" incl "NODE=aas7611.0001.001:11")
 1: one match
>> region DIV1 ^ 1
 2: 7 matches

So we have the DIV1 regions which correspond exactly with DIV1 tags containing the "id" attribute, which is how notes are marked up in Yeats.

Suppose we constructed a query which has returned a PSet consisting of hits on a term the user has entered to search on and now we would line to display the immediate context of the hit and also a title from an  enclosing division:

The query for the user's search is simply:

>> firstsearch = ("Branivoj " + "Branivoj<")
  2:  firstsearch = one match

To get an division title for the hit we need to build up regions based on the hit:

>> slicesearch = subset.1.25 *firstsearch
  3: slicesearch = one match
>> mainslicesearch = (region DLPSTEXTCLASS incl *slicesearch)
  4: mainslicesearch = one match
>> mainheader = (region HEADER within *mainslicesearch)
  5: mainheader = one match

Finally to view the content of the region we have constructed we do:

>> pr.region.mainheader (region mainheader)

The next section discusses the pr command which is the heart of viewing sets. Of course, we are not finished at this point.  Getting the data back from XPat is just one step.  It is followed by filtering  operations (perl substitutions using regular expressions) to remove other tags that may be mucking up our content and to change the appearance tof the content e.g. highlighting hits, etc.

Viewing the Sets We've Constructed

Now what we've all been waiting for, we have some results or sets of interest, and we want to look at them. The two commands for viewing results are pr and save. In a sense, they are really the same command: pr displays to STDOUT, save displays to {savefile}. Since they behave the same way, I will use pr in my examples.
NOTE: save appends to the current {savefile}.
The kind of text for each result that XPat returns with pr and save is determined by the current {quieton} setting (which see, below under {settings}). There is a big difference between the normal user-sitting-at-the-pat-terminal interaction mode, and the machine-readable modes.
pr (point-set)
This prints out up to {ordersize} (see {settings} below) members of the point-set, starting with the first, according to the current {sortorder} setting.
pr.X shift.-Y (point-set)
For the first results in (point-set), a string X bytes wide, offset to the left of the matching point Y bytes. X and Y overide the {settings} of {printlength} and {leftcontext} respectively, which see below.
pr.region."region-name" (region-set of type "region-name")
prints the entire span of the members in (region-set). This is a bit of a pain; to have to tell pat50 the "format" of the region you would like to see, when it should already know!

pr %
pr.X shift.-Y
All these are variations on "print the last set created".


These are settings that control certain behaviors of pat50 during a search session. There is only one setting that our programs use explicitly as a set options command, the {quieton} command. The other settings that are used by TextClass search strategies are made explicit through the commands in which they are relevant, and aren't ever set with a set options type command. I list here the {quieton} variants used in TextClass, and then those {settings} explicitly used in commands but not set per se. There are other {settings} that we don't use.

{quieton raw}
{quieton} and {quieton raw} change the interaction mode of pat50 from whatever it was to one of the quieton modes.  The {quieton} modes have no user command prompt; multiple commands are separated with a ;, and zero or more ;-separated commands are sent to pat50 with a newline. pat50 returns information about sets and prints out results delimited by special tags:
SSize tags surround a number, meaning that the set created by the search corresponding to this SSize has number-many members.
When a region set is printed, all the members of the set printed are surrounded by a pair of of RSet tags. In {quieton} mode, each result from the region set consists of two tags:
refering to the start and end byte offsets for that particular result. It is the responsibility of the programmer to already know or be able to handle how many results there are in this RSet (like, knowing what search generated the set). In {quieton raw} mode we get more information about each region result:
<Start>#</Start><End>#</End><Raw><Size>#</Size>blah blah blah</Raw>
Start and End are byte offsets as before, but Size is the byte length of the text delimited by the close Size tag and the close Raw tag.
When a point set is printed, all the members of the set printed are surrounded by a pair of PSet tags. In {quieton} mode, each result from the point set consists of one tag:
Where the surrounded number refers to a byte offset of the point. In {quieton raw} mode, we get some more information:
<Start>#</Start><Raw><Size>#</Size>flug flug fluggy!</Raw>
Start is a byte offset, Size is a byte size, and the text of the point is delimited by the close Size and close Raw. The dimensions of the text printed depends on the combinations of the {printlength} and {leftcontext} settings, or their explicit definition with the pr command involved.
If some kind of non-fatal error occured during a search, pat50 will, in lieu of any of the preceding tags, send an error tag with some hopefully helpful error message in it. The SSP CGI platform captures this, but doesn't always do a great job of letting the programmer/user know, and the SSP CGI platform always considers this fatal (ie, the CGI script tries to bail out with a message whenever it gets this from pat50).
{quietoff} is used to bring the interaction back into the normal user interaction mode.
{printlength #}
This setting controls the default print window size for point sets, how many total bytes are printed when a point set result is printed. See the discussion of pr above. Default is 64.

{leftcontext #}
This setting controls how many characters before the matching text will be printed when a point set is printed. If there are 100 characters of {printlength}, and 14 of {leftcontext}, then the point where the matching text starts will be the 15th character. See the discussion of pr above. Default is 14

{sortorder <order>}
This determines in what order a given set of results is sorted by pat50. I always use {sortorder occur}, but there are other modes.

{savefile "file"}
Changes the default save file name.

{exportfile "file"}
changes the default export file name.

Miscellaneous and Useful Commands

{ddinfo regionnames}
Lists all the currently-defined regions. A very useful command for document analysis

List of results sets from previously issued searches.

~sync "string"
A fabulously useful command, basically an echo sort of command. We use this in the TextClass perl modules to signal when pat50 is done sending results. In any of the {quieton} modes, this returns:
Returns the number pat50 will assign to the next-created set. This is useful when you're making a lot of sets that don't have names, and you need to keep track of what you have (though our usual strategy is to explicitly name things...). In any of the {quieton} modes, this returns: