Query Language Details

Query Language Details (Extra)

This is some additional, extremely detailed information not covered in the 'details' section.

Special commands that generate point sets:

This is a range searching command, find all "words" that begin with letters ranging from a to c for the first example, or all words that begin with the numbers ranging from 198 to 199 in the second example.
shift.# A: Make a new point set consisting of all the points that are #bytes to the right of all the start byte offsets of set A (a negative number shifts to the left). A can be a region or point set, the result set is always demoted to a point set.

Operators and Relations:

More discussion of the operators in detail concerning the types of sets that result
when the operator is applied to region sets, point sets, or a combination. In addition,
some of the extended forms of the operators are discussed.

A ^ B: the "and" or "intersection" operator: A and B are two sets, or expressions that evaluate to sets, and the resulting set includes those points or regions in both A and B that have the exact same start offsets. It makes little sense to "and" together two point sets, unless those two point sets contain heterogenous members. "and"-ing on two region sets can be spectacularly useful. "and"-ing together a point and a region set has the interesting property that it seems to inherit "regionness".
A - B: the "minus" or "difference" operator. A and B are both sets, the resulting set is those members of A that do not share the same start offset with any member of B. This behaves much the same as the ^ operator in terms of how appropriate it is to points or regions.
A + B + C + ...: the "or" or "union" operator: A, B, C... are sets, the resulting set is a point set if at least one of the sets being combined is a point set, consisting of the start offsets of all the points or regions in the original sets. If all the sets being combined are region sets, then regions that nest inside other listed regions (either entirely or at their start byte offset) will be removed from the resultant set. If we were to + together all the P and B regions in an HTML database, for instance, all the B elements nested inside P elements would be removed, leaving just the P's and B's not in P. I question the utility of this behavior, and search strategies the SSP platform takes will be later seen to avoid this at all costs.
A incl B: A is a region set, B is either point or region, the result is a region set of all members of A that contain at least one member of B, containment meaning that a given B has a start offset within the inclusive range of a given A's start and end offsets. The alternative form allows a number # of the least number of B's that must be contained in an A. incl can take the not operator to return all A's that don't have any B's (without a # argument), or all A's that don't have # or more B's.
A within B: In many ways the complement to incl: A is a point or region set, B is a region set, the resulting set is all members of A that are contained (by the start offset rule as under incl) in any B. This also takes the not operator to return all A's that are not within any B.
A near B: A and B are either points or regions, and # is either explicitly stated, or taken from the {proximity} setting (see about {settings} below). The result is all A's whose start offsets are within the specified number of bytes of the start offset of any B. The not form returns all A's whose start offsets are not within the specified number of bytes from the start offset of any B. The nearest B might be earlier or later in the source file.
A fby B: This is just as the near operator, save that an A must be followed within the specified number of bytes by a B to be in the result set. This also takes the not operator.
not: This reverses the sense of the expression it modifies, usable with incl, within, near, and fby.

Using the Operators to Make Sets of Interest

Now that we have our basic concepts and operators, let's get in there and do some searching and document analysis: the process by which we figure out what is there and how the DTD was applied to this SGML, and what of it we can use. When developing an online system, this is the most important step. Some important commands will be introduced in this experimentation.

NOTE: All the search and set identification strategies from here on out are heavily influenced by what SSP content-specialists and developers have cared about over the years. I will not claim that this is the only way to go about searching or retrieving text.

What a set of interest is is entirely up to the user, and the notion of user ranges from developers to content specialists to the patrons floating around out there. I'm going to walk through some increasingly complicated possible sets of interest here. This will all come back to haunt us when we make fabricated regions.

I'm ignoring in this section the matter of displaying retrieved results. That comes in a later section.

find all the words that start with "diff", and find all the words that are "different" exactly
find all the "pie" follows "apple" within 20 bytes
find all the single lines/ stanzas/ poems where "pie" follows "apple" within a 20 characters Here we need to have spent some time with the DTD documentation to know that lines are L elements, stanzas are STANZA elements, and that poems themselves are POEM elements in the sgml
find all the places where "orange" appears at the end of a line of poetry
find all stanzas with at least 6 lines
find all stanzas with exactly 6 lines
find six line stanzas with words starting with "the" in them
find all the words starting with "the" that occur in six-line stanzas
all the poems that are classified as being written or published between 1801 and 1850 Here we need to have spent time with the DTD documentation to know that the CBEL element contains information about publication and authorship, and that a given CBEL element lives in a POETGRP element, which will contain all the poems by a single poet. We also need to know what will show up in the CBEL element (this last bit skips ahead to displaying sets).
roughly how many pages of text are there in this database? For Chadwyck-Healeys, often an estimate can be obtained from the number of PB (page break) elements they use
what typographical renderings are used in this transcription and representation of the original text? Here we need to have found out from the DTD documentation that the R attribute on a lot of different elements holds rendering information.
We want to search in poems by women for instances of "mother" Here, we need to find out how (IF!) gender is attached to poems or poets (in the example here from DAAP, POEM has a required GENDER attribute, which takes "male" or "female" as a value)
We want to find stanzas that follow stanzas that use the word "pie"

You might note that I almost always invoke the " " at the end of a search term. SSP collections as distributed do this. One of the notable DLPS exception is Making of America, which puts the application of the trailing * in the hands of the user (and don't doubt for a moment that we have a validity check in MoA that zaps *'s that aren't at the end of an input string...). Note that MOA is implemented using OT60, which handles these things a bit differently. Other collections using pat50 simply search for, say "pie" and the user should expect to get "piedmont" and "piers", etc.

The more complicated these get, the more we want to be able to look at our end results, or look at the building blocks on which complicated results are built.