Query Language Details

This is some additional, extremely detailed examples on XPat searching. Note: they are included here for completeness' sake. Only some of these are used within the middleware.


Special commands that generate point sets:

"a".."c"
"198"~"199"
This is a range searching command, find all "words" that begin with letters ranging from a to c for the first example, or all words that begin with the numbers ranging from 198 to 199 in the second example.
shift.# A
Make a new point set consisting of all the points that are #bytes to the right of all the start byte offsets of set A (a negative number shifts to the left). A can be a region or point set, the result set is always demoted to a point set.

Operators and Relations:

More discussion of the operators in detail concerning the types of sets that result
when the operator is applied to region sets, point sets, or a combination. In addition,
some of the extended forms of the operators are discussed.
A ^ B
the "and" or "intersection" operator: A and B are two sets, or expressions that evaluate to sets, and the resulting set includes those points or regions in both A and B that have the exact same start offsets. It makes little sense to "and" together two point sets, unless those two point sets contain heterogenous members. "and"-ing on two region sets can be spectacularly useful. "and"-ing together a point and a region set has the interesting property that it seems to inherit "regionness".
A - B
the "minus" or "difference" operator. A and B are both sets, the resulting set is those members of A that do not share the same start offset with any member of B. This behaves much the same as the ^ operator in terms of how appropriate it is to points or regions.
A + B + C + ...
the "or" or "union" operator: A, B, C... are sets, the resulting set is a point set if at least one of the sets being combined is a point set, consisting of the start offsets of all the points or regions in the original sets. If all the sets being combined are region sets, then regions that nest inside other listed regions (either entirely or at their start byte offset) will be removed from the resultant set. If we were to + together all the P and B regions in an HTML database, for instance, all the B elements nested inside P elements would be removed, leaving just the P's and B's not in P. I question the utility of this behavior, and search strategies the SSP platform takes will be later seen to avoid this at all costs.
A incl B

A incl.# B
A not incl B
A is a region set, B is either point or region, the result is a region set of all members of A that contain at least one member of B, containment meaning that a given B has a start offset within the inclusive range of a given A's start and end offsets. The alternative form allows a number # of the least number of B's that must be contained in an A. incl can take the not operator to return all A's that don't have any B's (without a # argument), or all A's that don't have # or more B's.
A within B

A not within B
In many ways the complement to incl: A is a point or region set, B is a region set, the resulting set is all members of A that are contained (by the start offset rule as under incl) in any B. This also takes the not operator to return all A's that are not within any B.
A near B

A near.# B
A not near B
A not near.# B
A and B are either points or regions, and # is either explicitly stated, or taken from the {proximity} setting. The result is all A's whose start offsets are within the specified number of bytes of the start offset of any B. The not form returns all A's whose start offsets are not within the specified number of bytes from the start offset of any B. The nearest B might be earlier or later in the source file.
A fby B

A fby.# B
A not fby B
A not fby.# B
This is just as the near operator, save that an A must be followed within the specified number of bytes by a B to be in the result set. This also takes the not operator.
not
This reverses the sense of the expression it modifies, usable with incl, within, near, and fby.

Using the Operators to Make Sets of Interest

Now that we have our basic concepts and operators, let's get in there and do some searching and document analysis: the process by which we figure out what is there and how the DTD was applied to this SGML, and what of it we can use. When developing an online system, this is the most important step. Some important commands will be introduced in this experimentation.
NOTE: All the search and set identification strategies from here on out are heavily influenced by what SSP content-specialists and developers have cared about over the years. I will not claim that this is the only way to go about searching or retrieving text.
What a set of interest is is entirely up to the user, and the notion of user ranges from developers to content specialists to the patrons floating around out there. I'm going to walk through some increasingly complicated possible sets of interest here. These examples speak to the concept of fabricated regions, which are more the domain of the content-specialist..
find all the words that start with "diff", and find all the words that are "different" exactly
>> "diff" 

  8: 9 matches

>> "different "

  9: 3 matches

>>
find all the "pie" follows "apple" within 20 bytes
>> "apple " fby.20 "pie "

  1: 2 matches

>>
find all the single lines/ stanzas/ poems where "pie" follows "apple" within a 20 characters Here we need to have spent some time with the DTD documentation to know that lines are L elements, stanzas are STANZA elements, and that poems themselves are POEM elements in the sgml
>> region L incl ("apple " fby.20 "pie ")

  2: 2 matches

>> region STANZA incl ("apple " fby.20 "pie ")

  3: 2 matches

>> region POEM incl ("apple " fby.20 "pie ")

  4: one match

>>
find all the places where "orange" appears at the end of a line of poetry
>> "orange</l>"

  5: no match

>> "orange " fby.20 "</l>"

  6: 17 matches

>>
find all stanzas with at least 6 lines
>> region STANZA incl.6 region L

  8: 5801 matches

>>
find all stanzas with exactly 6 lines
>> region STANZA incl.6 region L

  8: 5801 matches

>> region STANZA incl.7 region L

  9: 5297 matches

>> 8 - 9

  10: 504 matches

>>
find six line stanzas with words starting with "the" in them
>> ((region STANZA incl.6 region L) - (region STANZA incl.7 region L)) incl "the "

  14: 429 matches

>>
[a region set of STANZA regions]
find all the words starting with "the" that occur in six-line stanzas
>> "the " within ((region STANZA incl.6 region L) - (region STANZA incl.7 region L))

  15: 1187 matches

>>
[a point set of points where "the" is]
all the poems that are classified as being written or published between 1801 and 1850 Here we need to have spent time with the DTD documentation to know that the CBEL element contains information about publication and authorship, and that a given CBEL element lives in a POETGRP element, which will contain all the poems by a single poet. We also need to know what will show up in the CBEL element (this last bit skips ahead to displaying sets).
>> region POEM within (region POETGRP incl (region CBEL incl "1801-1850"))

  19: 675 matches

>> region POEM within (region POETGRP incl "<CBEL>1801-1850")

  20: 675 matches

>>
[are we absolutely certain that CBEL doesn't ever take attributes?!]
roughly how many pages of text are there in this database? For Chadwyck-Healeys, often an estimate can be obtained from the number of PB (page break) elements they use
>> region PB

  22: 5650 matches

>>
[is this the only way they mark pages?]

what typographical renderings are used in this transcription and representation of the original text? Here we need to have found out from the DTD documentation that the R attribute on a lot of different elements holds rendering information.

>> region "A-R"

  31: 155536 matches

>>
We want to search in poems by women for instances of "mother" Here, we need to find out how (IF!) gender is attached to poems or poets (in the example here from DAAP, POEM has a required GENDER attribute, which takes "male" or "female" as a value)
>> "mother " within (region POEM incl (region "A-GENDER" incl "female"))

  36: 266 matches

>>