Query Language Details

Some documentation can be found at: http://www.hti.umich.edu/sgml/pat/pat50manual.html

Some additional material from past workshops concerning Query Language is available.

We're going to start out in user mode, and shift into what's called {quieton} and {quieton raw} once some basic concepts have been demonstrated. In the first few sections, I'm going to non-chalantly introduce items that I won't fully explain until the section on operators and relationships.

We're going to start with just the regions that the sgmlrgn step gives us, because the stunts we learn here will make the fabricated regions step a lot easier to think about.

Invoking `pat50`

We've come all this way to in the end type a command like:

% pat50 /l1/idx/b/bosnia.bosnia.dd

Identifying Points

In XPat a point is some unique byte offset in the full text under consideration, corresponding to those places where XPat was told to begin sistrings. We get a set (see below for a more formal discussion of sets) of points by performing a search for a string or a particular offset:

>> "mulberry"
>> "mulberry "
>> mulberry
>> [118312]

The first finds all sistrings that begin with mulberry, the second finds those that are "mulberry" exactly, third finds the byte offset for the byte 118312, and the fourth is the same as the first.

Searching for something that doesn't exist in the database gets you zero results, or a point set with zero members:

>> "syzygy"
  4: no match

NOTE: XPat supports the history command, through which you could get a list of all sets created in the session, and the command that created them.

>> history

Identifying Regions

A region in pat50 is a span of text comprising zero or more bytes. sgmlrgn50 wants to create some for us by default (please see the sgmlrgn step discussion), creates DTD-determined regions in the sgmlrgn step, and permits the creation of new "fabricated" regions.

The {ddinfo regionnames} command will list all the currently-defined regions, and from this list we can find out about particular regions:

>> region DIV1
  1: 113 matches
>> region "A-NODE"
  2: 125 matches

That is, the region command followed by the name of some region (remembering that regions with non-alphanumerics in its name must be double-quoted) evaluates to the number of members in that region. There are 113 DIV1 regions and 125 A-NODE regions above.

We'll say more about the definitions of regions like A-NODE and DIV-T in the sgmlrgn step discussion.

Looking for a region that is not defined gets you a result set of "-1" and an error message.

See about ERROR .

Identifying Sets (Numbered and Named Sets)

Any collection of zero or more points or regions can be grouped together in a set, sets can be combined or split with pat50's boolean operators, all sets created during a session have unique number identifier, one can give them names, sets can be printed out, saved, exported and imported. I gloss over some operators and commands here until the next section.

>> "long "
  1: 352 matches
>> region "DIV1" incl "long "
  2: 76 matches
>> "help "
  3: 100 matches
>> 2 + 3
  4: 176 matches
>> region "TEXT" incl 1
  5: 4 matches
>> vsearch = "vardar "
  6: vsearch = 10 matches
>> vsearchnext = 6
  7: vsearchnext = 10 matches
>> pr *vsearchnext

  1886962, ..l crosses the Vardar, and to this end a bridge is in process of ..
  2058198, ..fall into the Vardar (Axius); and two&mdash;the Lab and the Sitn..
   683365, .. région où le Vardar et la Morava prennent leur source, que pass..
  1818056, ..s; and on the Vardar Gate and Arch of Constantine<NOTE>"The Egna..
  2023056, .. of the river Vardar. Our host was a grumbling old man, who asto..
  1902124, ..side into the Vardar plain. The plain in its purple distance mel.
 
>>

The vsearchnext = 6 line os interesting: 6 is a number, it might be a character you want to search for. So is XPat definitely looking for a numbered set in this session, or the number 6? Another reason to always put search terms in quotes. Try a command like vearchnext= 243 where the 243 is a number that is larger than the set number for the last-created set...

There are additionally two special commands to create subsets of sets:

subset.X.Y A: Make a new set that consists of Y members of A, starting at the Xth member of A. Members of A start numbering at 1.; This command is used to get result content in slices.
sample.X A: Make a new set that consists of X members of A, selected from A of size Y such that each ^Y/_Xth is in the new set.

How does XPat know which member of a set is the "first", "second", and so on. This is set with the (sortorder} setting. TextClass uses only: {sortorder occur}, which is to say that results are returned in the byte order in which they occur in the source text: the byte offset of a member of a set is <= the byte offset of the next member, if any. TextClass as it stands orders results for display to the user by occurance order, and any ordering other than that is accomplished outside XPat. Not to say that the other {sortorder} settings shouldn't be used, just that nothing designed so far does. See {settings} below more on the different settings.

Operators and Relations

After tantalizing with various operators, weI'll now actually define the ones we use most in the form in which they trpically occur. You may wish to refer to nn even more detailed discussion.

A ^ B: the "and" or "intersection" operator: A and B are two sets, or expressions that evaluate to sets, and the resulting set includes those points or regions in both A and B that have the exact same start offsets.
A + B + C + ...: the "or" or "union" operator: A, B, C... are sets, the resulting set is a point set if at least one of the sets being combined is a point set, consisting of the start offsets of all the points or regions in the original sets. If all the sets being combined are region sets, then regions that nest inside other listed regions (either entirely or at their start byte offset) will be removed from the resultant set.; We saw an earlier example of the "+" operator.
A incl B: A is a region set, B is either point or region, the result is a region set of all members of A that contain at least one member of B, containment meaning that a given B has a start offset within the inclusive range of a given A's start and end offsets.
A within B: In many ways the complement to incl: A is a point or region set, B is a region set, the resulting set is all members of A that are contained (by the start offset rule as under incl) in any B. This also takes the not operator to return all A's that are not within any B.
A near B: A and B are either points or regions, and # is either explicitly stated, or taken from the {proximity} setting (see about {settings} below). The result is all A's whose start offsets are within the specified number of bytes of the start offset of any B. The not form returns all A's whose start offsets are not within the specified number of bytes from the start offset of any B. The nearest B might be earlier or later in the source file.
A fby B: This is just as the near operator, save that an A must be followed within the specified number of bytes by a B to be in the result set. This also takes the not operator.
not: This reverses the sense of the expression it modifies, usable with incl, within, near, and fby.

Using the Operators to Make Sets of Interest

Now that we have our basic concepts and operators, let's get in there and do some searching and document analysis: the process by which we figure out what is there and how the DTD was applied to this SGML, and what of it we can use. When developing an online system, this is the most important step. Some important commands will be introduced in this experimentation.

What a set of interest is is entirely up to the user, and the notion of user ranges from developers to content specialists to the patrons floating around out there. I'm going to walk through some increasingly complicated possible sets of interest here.

We ignore in this section the matter of displaying retrieved results. That comes in a later section.

find all the words that start with "diff", and find all the words that are "different" exactly

Viewing the Sets We've Constructed

Now what we've all been waiting for, we have some results or sets of interest, and we want to look at them. The two commands for viewing results are pr and save. In a sense, they are really the same command: pr displays to STDOUT, save displays to {savefile}. Since they behave the same way, I will use pr in my examples.

NOTE: save appends to the current {savefile}.

The kind of text for each result that XPat returns with pr and save is determined by the current {quieton} setting (which see, below under {settings}). There is a big difference between the normal user-sitting-at-the-pat-terminal interaction mode, and the machine-readable modes.

pr (point-set): This prints out up to {ordersize} (see {settings} below) members of the point-set, starting with the first, according to the current {sortorder} setting.
pr.X shift.-Y (point-set): For the first results in (point-set), a string X bytes wide, offset to the left of the matching point Y bytes. X and Y overide the {settings} of {printlength} and {leftcontext} respectively, which see below.
pr.region."region-name" (region-set of type "region-name"): prints the entire span of the members in (region-set). This is a bit of a pain; to have to tell pat50 the "format" of the region you would like to see, when it should already know!
pr: All these are variations on "print the last set created".

`{settings}`

These are settings that control certain behaviors of pat50 during a search session. There is only one setting that our programs use explicitly as a set options command, the {quieton} command. The other settings that are used by TextClass search strategies are made explicit through the commands in which they are relevant, and aren't ever set with a set options type command. I list here the {quieton} variants used in TextClass, and then those {settings} explicitly used in commands but not set per se. There are other {settings} that we don't use.

{quieton}: {quieton} and {quieton raw} change the interaction mode of pat50 from whatever it was to one of the quieton modes. The {quieton} modes have no user command prompt; multiple commands are separated with a ;, and zero or more ;-separated commands are sent to pat50 with a newline. pat50 returns information about sets and prints out results delimited by special tags:
{printlength #}: This setting controls the default print window size for point sets, how many total bytes are printed when a point set result is printed. See the discussion of pr above. Default is 64.
{leftcontext #}: This setting controls how many characters before the matching text will be printed when a point set is printed. If there are 100 characters of {printlength}, and 14 of {leftcontext}, then the point where the matching text starts will be the 15th character. See the discussion of pr above. Default is 14
{sortorder <order>}: This determines in what order a given set of results is sorted by pat50. I always use {sortorder occur}, but there are other modes.
{savefile "file"}: Changes the default save file name.
{exportfile "file"}: changes the default export file name.

Miscellaneous and Useful Commands

{ddinfo regionnames}: Lists all the currently-defined regions. A very useful command for document analysis
history: List of results sets from previously issued searches.
~sync "string": A fabulously useful command, basically an echo sort of command. We use this in the TextClass perl modules to signal when pat50 is done sending results. In any of the {quieton} modes, this returns:
~qnum: Returns the number pat50 will assign to the next-created set. This is useful when you're making a lot of sets that don't have names, and you need to keep track of what you have (though our usual strategy is to explicitly name things...). In any of the {quieton} modes, this returns: