Query Language Details (Extra)

This is some additional, extremely detailed information not covered in the 'details' section.


Special commands that generate point sets:

"a".."c"
"198"~"199"
This is a range searching command, find all "words" that begin with letters ranging from a to c for the first example, or all words that begin with the numbers ranging from 198 to 199 in the second example.
shift.# A
Make a new point set consisting of all the points that are #bytes to the right of all the start byte offsets of set A (a negative number shifts to the left). A can be a region or point set, the result set is always demoted to a point set.

Operators and Relations:

More discussion of the operators in detail concerning the types of sets that result
when the operator is applied to region sets, point sets, or a combination. In addition,
some of the extended forms of the operators are discussed.
A ^ B
the "and" or "intersection" operator: A and B are two sets, or expressions that evaluate to sets, and the resulting set includes those points or regions in both A and B that have the exact same start offsets. It makes little sense to "and" together two point sets, unless those two point sets contain heterogenous members. "and"-ing on two region sets can be spectacularly useful. "and"-ing together a point and a region set has the interesting property that it seems to inherit "regionness".
A - B
the "minus" or "difference" operator. A and B are both sets, the resulting set is those members of A that do not share the same start offset with any member of B. This behaves much the same as the ^ operator in terms of how appropriate it is to points or regions.
A + B + C + ...
the "or" or "union" operator: A, B, C... are sets, the resulting set is a point set if at least one of the sets being combined is a point set, consisting of the start offsets of all the points or regions in the original sets. If all the sets being combined are region sets, then regions that nest inside other listed regions (either entirely or at their start byte offset) will be removed from the resultant set. If we were to + together all the P and B regions in an HTML database, for instance, all the B elements nested inside P elements would be removed, leaving just the P's and B's not in P. I question the utility of this behavior, and search strategies the SSP platform takes will be later seen to avoid this at all costs.
A incl B

A incl.# B
A not incl B
A is a region set, B is either point or region, the result is a region set of all members of A that contain at least one member of B, containment meaning that a given B has a start offset within the inclusive range of a given A's start and end offsets. The alternative form allows a number # of the least number of B's that must be contained in an A. incl can take the not operator to return all A's that don't have any B's (without a # argument), or all A's that don't have # or more B's.
A within B

A not within B
In many ways the complement to incl: A is a point or region set, B is a region set, the resulting set is all members of A that are contained (by the start offset rule as under incl) in any B. This also takes the not operator to return all A's that are not within any B.
A near B

A near.# B
A not near B
A not near.# B
A and B are either points or regions, and # is either explicitly stated, or taken from the {proximity} setting (see about {settings} below). The result is all A's whose start offsets are within the specified number of bytes of the start offset of any B. The not form returns all A's whose start offsets are not within the specified number of bytes from the start offset of any B. The nearest B might be earlier or later in the source file.
A fby B

A fby.# B
A not fby B
A not fby.# B
This is just as the near operator, save that an A must be followed within the specified number of bytes by a B to be in the result set. This also takes the not operator.
not
This reverses the sense of the expression it modifies, usable with incl, within, near, and fby.

Using the Operators to Make Sets of Interest

Now that we have our basic concepts and operators, let's get in there and do some searching and document analysis: the process by which we figure out what is there and how the DTD was applied to this SGML, and what of it we can use. When developing an online system, this is the most important step. Some important commands will be introduced in this experimentation.
NOTE: All the search and set identification strategies from here on out are heavily influenced by what SSP content-specialists and developers have cared about over the years. I will not claim that this is the only way to go about searching or retrieving text.
What a set of interest is is entirely up to the user, and the notion of user ranges from developers to content specialists to the patrons floating around out there. I'm going to walk through some increasingly complicated possible sets of interest here. This will all come back to haunt us when we make fabricated regions.

I'm ignoring in this section the matter of displaying retrieved results. That comes in a later section.

find all the words that start with "diff", and find all the words that are "different" exactly
>> "diff" 
  8: 9 matches
>> "different "
  9: 3 matches
>>
find all the "pie" follows "apple" within 20 bytes
>> "apple " fby.20 "pie "
  1: 2 matches
>>
find all the single lines/ stanzas/ poems where "pie" follows "apple" within a 20 characters Here we need to have spent some time with the DTD documentation to know that lines are L elements, stanzas are STANZA elements, and that poems themselves are POEM elements in the sgml
>> region L incl ("apple " fby.20 "pie ")
  2: 2 matches
>> region STANZA incl ("apple " fby.20 "pie ")
  3: 2 matches
>> region POEM incl ("apple " fby.20 "pie ")
  4: one match
>>
find all the places where "orange" appears at the end of a line of poetry
>> "orange</l>"
  5: no match
>> "orange " fby.20 "</l>"
  6: 17 matches
>>
find all stanzas with at least 6 lines
>> region STANZA incl.6 region L
  8: 5801 matches
>>
find all stanzas with exactly 6 lines
>> region STANZA incl.6 region L
  8: 5801 matches
>> region STANZA incl.7 region L
  9: 5297 matches
>> 8 - 9
  10: 504 matches
>>
[cruising for sonnets?]
find six line stanzas with words starting with "the" in them
>> ((region STANZA incl.6 region L) - (region STANZA incl.7 region L)) incl "the "
  14: 429 matches
>>
[a region set of STANZA regions]
find all the words starting with "the" that occur in six-line stanzas
>> "the " within ((region STANZA incl.6 region L) - (region STANZA incl.7 region L))
  15: 1187 matches
>>
[a point set of points where "the" is]
all the poems that are classified as being written or published between 1801 and 1850 Here we need to have spent time with the DTD documentation to know that the CBEL element contains information about publication and authorship, and that a given CBEL element lives in a POETGRP element, which will contain all the poems by a single poet. We also need to know what will show up in the CBEL element (this last bit skips ahead to displaying sets).
>> region POEM within (region POETGRP incl (region CBEL incl "1801-1850"))
  19: 675 matches
>> region POEM within (region POETGRP incl "<CBEL>1801-1850")
  20: 675 matches
>>
[are we absolutely certain that CBEL doesn't ever take attributes?!]
roughly how many pages of text are there in this database? For Chadwyck-Healeys, often an estimate can be obtained from the number of PB (page break) elements they use
>> region PB
  22: 5650 matches
>>
[is this the only way they mark pages?]
what typographical renderings are used in this transcription and representation of the original text? Here we need to have found out from the DTD documentation that the R attribute on a lot of different elements holds rendering information.
>> region "A-R"
  31: 155536 matches
>>
see below for printing this out and doing something intelligible with those thousands of items
We want to search in poems by women for instances of "mother" Here, we need to find out how (IF!) gender is attached to poems or poets (in the example here from DAAP, POEM has a required GENDER attribute, which takes "male" or "female" as a value)
>> "mother " within (region POEM incl (region "A-GENDER" incl "female"))
  36: 266 matches
>> "mother " within (region POEM incl ("gender=female"))
  37: 266 matches
>>
[why does 37 work? what kind of set results here?]
We want to find stanzas that follow stanzas that use the word "pie"


This is a trick question, there is no way, using the commands and sgml region relationships given, to express this relationship within pat50. Siblinghood, and immediate parent-child relationships cannot be conclusively established in pat50. One needs to ask some questions of pat50, massage the answers, and then feed them back in as a second layer of questions to get at these kinds of relationships.

One way would be to try something like this:

>> "</stanza" within (region STANZA incl "pie ")
  40: 99 matches
>> 

## print out that point set, add the length of the text
## "/STANZA>" (8) to each offset, and submit
## those back to OT as searches such as (assuming that 26644 was a
## result in the above search set 40):

>> region STANZA ^ [26652]
  41: one match
>>
Another way might be something like this:
>> "</stanza" within (region STANZA incl "pie ")
  42: 99 matches
>> shift.8 42
  43: 99 matches
>> region STANZA ^ 43     
  44: 81 matches
>>
Wait just a minute! I thought we said this couldn't be done within pat50! Doesn't this give us the exact same thing as the first suggestion: all STANZA's that follow a STANZA with "pie*"? Yes, it should give us the same thing. I'm going to stick by my claim that there is no way to do this within pat50 because the last stunt above (and even the first stunt, now that I mention it) depend utterly on: A better strategy might be to modify the first suggestion to:
("</stanza" within (region STANZA incl "pie ")) fby.15 "<STANZA"
And home in on the following stanzas that way. Or at least hope that 15 is a good number.
You might note that I almost always invoke the " " at the end of a search term. SSP collections as distributed do this. One of the notable DLPS exception is Making of America, which puts the application of the trailing * in the hands of the user (and don't doubt for a moment that we have a validity check in MoA that zaps *'s that aren't at the end of an input string...). Note that MOA is implemented using OT60, which handles these things a bit differently. Other collections using pat50 simply search for, say "pie" and the user should expect to get "piedmont" and "piers", etc.

The more complicated these get, the more we want to be able to look at our end results, or look at the building blocks on which complicated results are built.