Query Language Details (Extra)
This is some additional, extremely detailed information not covered in
the 'details' section.
Special commands that generate point sets:
"a".."c"
"198"~"199"
-
This is a range searching command, find all "words" that begin with letters
ranging from a to c for the first example, or all words that begin with
the numbers ranging from 198 to 199 in the second example.
-
shift.# A
-
Make a new point set consisting of all the points that are #bytes to the
right of all the start byte offsets of set A (a negative number shifts
to the left). A can be a region or point set, the result set is always
demoted to a point set.
Operators and Relations:
More discussion of the operators in detail concerning the types of sets
that result
when the operator is applied to region sets, point sets, or a combination.
In addition,
some of the extended forms of the operators are discussed.
-
A ^ B
-
the "and" or "intersection" operator: A and B are two sets, or expressions
that evaluate to sets, and the resulting set includes those points or regions
in both A and B that have the exact same start offsets. It makes little
sense to "and" together two point sets, unless those two point sets contain
heterogenous members. "and"-ing on two region sets can be spectacularly
useful. "and"-ing together a point and a region set has the interesting
property that it seems to inherit "regionness".
-
A - B
-
the "minus" or "difference" operator. A and B are both sets, the resulting
set is those members of A that do not share the same start offset with
any member of B. This behaves much the same as the ^ operator in terms
of how appropriate it is to points or regions.
-
A + B + C + ...
-
the "or" or "union" operator: A, B, C... are sets, the resulting set is
a point set if at least one of the sets being combined is a point set,
consisting of the start offsets of all the points or regions in the original
sets. If all the sets being combined are region sets, then regions
that nest inside other listed regions (either entirely or at their start
byte offset) will be removed from the resultant set. If we were to + together
all the P and B regions in an HTML database, for instance, all the B elements
nested inside P elements would be removed, leaving just the P's and B's
not in P. I question the utility of this behavior, and search strategies
the SSP platform takes will be later seen to avoid this at all costs.
-
A incl B
A incl.# B
A not incl B
-
A is a region set, B is either point or region, the result is a region
set of all members of A that contain at least one member of B, containment
meaning that a given B has a start offset within the inclusive range of
a given A's start and end offsets. The alternative form allows a number
# of the least number of B's that must be contained in an A. incl can take
the not operator to return all A's that don't have any B's (without a #
argument), or all A's that don't have # or more B's.
-
A within B
A not within B
-
In many ways the complement to incl: A is a point or region set, B is a
region set, the resulting set is all members of A that are contained (by
the start offset rule as under incl) in any B. This also takes the not
operator to return all A's that are not within any B.
-
A near B
A near.# B
A not near B
A not near.# B
-
A and B are either points or regions, and # is either explicitly stated,
or taken from the {proximity} setting (see about {settings}
below). The result is all A's whose start offsets are within the specified
number of bytes of the start offset of any B. The not form returns all
A's whose start offsets are not within the specified number of bytes from
the start offset of any B. The nearest B might be earlier or later in the
source file.
-
A fby B
A fby.# B
A not fby B
A not fby.# B
-
This is just as the near operator, save that an A must be followed within
the specified number of bytes by a B to be in the result set. This also
takes the not operator.
-
not
-
This reverses the sense of the expression it modifies, usable with incl,
within, near, and fby.
Using the Operators to Make Sets of Interest
Now that we have our basic concepts and operators, let's get in there and
do some searching and document analysis: the process by which we
figure out what is there and how the DTD was applied to this SGML,
and what of it we can use. When developing an online system, this
is the most important step. Some important commands will be introduced
in this experimentation.
NOTE: All the search and set identification strategies
from here on out are heavily influenced by what SSP content-specialists
and developers have cared about over the years. I will not claim that this
is the only way to go about searching or retrieving text.
What a set of interest is is entirely up to the user, and the notion
of user ranges from developers to content specialists to the patrons floating
around out there. I'm going to walk through some increasingly complicated
possible sets of interest here. This will all come back to haunt us when
we make fabricated
regions.
I'm ignoring in this section the matter of displaying retrieved results.
That comes in a later section.
-
find all the words that start with "diff", and find all the words that
are "different" exactly
>> "diff"
8: 9 matches
>> "different "
9: 3 matches
>>
-
find all the "pie" follows "apple" within 20 bytes
>> "apple " fby.20 "pie "
1: 2 matches
>>
-
find all the single lines/ stanzas/ poems where "pie" follows "apple"
within a 20 characters Here we need to have spent some time with the
DTD documentation to know that lines are L elements, stanzas are STANZA
elements, and that poems themselves are POEM elements in the sgml
>> region L incl ("apple " fby.20 "pie ")
2: 2 matches
>> region STANZA incl ("apple " fby.20 "pie ")
3: 2 matches
>> region POEM incl ("apple " fby.20 "pie ")
4: one match
>>
-
find all the places where "orange" appears at the end of a line of poetry
>> "orange</l>"
5: no match
>> "orange " fby.20 "</l>"
6: 17 matches
>>
-
find all stanzas with at least 6 lines
>> region STANZA incl.6 region L
8: 5801 matches
>>
-
find all stanzas with exactly 6 lines
>> region STANZA incl.6 region L
8: 5801 matches
>> region STANZA incl.7 region L
9: 5297 matches
>> 8 - 9
10: 504 matches
>>
[cruising for sonnets?]
-
find six line stanzas with words starting with "the" in them
>> ((region STANZA incl.6 region L) - (region STANZA incl.7 region L)) incl "the "
14: 429 matches
>>
[a region set of STANZA regions]
-
find all the words starting with "the" that occur in six-line stanzas
>> "the " within ((region STANZA incl.6 region L) - (region STANZA incl.7 region L))
15: 1187 matches
>>
[a point set of points where "the" is]
-
all the poems that are classified as being written or published between
1801 and 1850 Here we need to have spent time with the DTD documentation
to know that the CBEL element contains information about publication and
authorship, and that a given CBEL element lives in a POETGRP element, which
will contain all the poems by a single poet. We also need to know what
will show up in the CBEL element (this last bit skips ahead to displaying
sets).
>> region POEM within (region POETGRP incl (region CBEL incl "1801-1850"))
19: 675 matches
>> region POEM within (region POETGRP incl "<CBEL>1801-1850")
20: 675 matches
>>
[are we absolutely certain that CBEL doesn't ever take attributes?!]
-
roughly how many pages of text are there in this database? For Chadwyck-Healeys,
often an estimate can be obtained from the number of PB (page break) elements
they use
>> region PB
22: 5650 matches
>>
[is this the only way they mark pages?]
-
what typographical renderings are used in this transcription and representation
of the original text? Here we need to have found out from the DTD documentation
that the R attribute on a lot of different elements holds rendering information.
>> region "A-R"
31: 155536 matches
>>
see below for printing this out and doing something intelligible with those
thousands of items
-
We want to search in poems by women for instances of "mother" Here,
we need to find out how (IF!) gender is attached to poems or poets (in
the example here from DAAP, POEM has a required GENDER attribute, which
takes "male" or "female" as a value)
>> "mother " within (region POEM incl (region "A-GENDER" incl "female"))
36: 266 matches
>> "mother " within (region POEM incl ("gender=female"))
37: 266 matches
>>
[why does 37 work? what kind of set results here?]
-
We want to find stanzas that follow stanzas that use the word "pie"
This is a trick question, there is no way, using the commands and
sgml region relationships given, to express this relationship within
pat50. Siblinghood, and immediate parent-child relationships cannot
be conclusively established in pat50. One needs to ask some questions of
pat50, massage the answers, and then feed them back in as a second layer
of questions to get at these kinds of relationships.
One way would be to try something like this:
>> "</stanza" within (region STANZA incl "pie ")
40: 99 matches
>>
## print out that point set, add the length of the text
## "/STANZA>" (8) to each offset, and submit
## those back to OT as searches such as (assuming that 26644 was a
## result in the above search set 40):
>> region STANZA ^ [26652]
41: one match
>>
Another way might be something like this:
>> "</stanza" within (region STANZA incl "pie ")
42: 99 matches
>> shift.8 42
43: 99 matches
>> region STANZA ^ 43
44: 81 matches
>>
Wait just a minute! I thought we said this couldn't be done within pat50!
Doesn't this give us the exact same thing as the first suggestion: all
STANZA's that follow a STANZA with "pie*"? Yes, it should give us the same
thing. I'm going to stick by my claim that there is no way to do this within
pat50 because the last stunt above (and even the first stunt, now that
I mention it) depend utterly on:
-
the knowledge we have about the length of the open tag for STANZA
-
that there is nothing like a PI or PB or some such other intervening nonsense
sibling
A better strategy might be to modify the first suggestion to:
("</stanza" within (region STANZA incl "pie ")) fby.15
"<STANZA"
And home in on the following stanzas that way. Or at least hope that 15
is a good number.
You might note that I almost always invoke the " " at the end of a search
term. SSP collections as distributed do this. One of the notable DLPS exception
is Making of
America, which puts the application of the trailing * in the hands
of the user (and don't doubt for a moment that we have a validity check
in MoA that zaps *'s that aren't at the end of an input string...). Note
that MOA is implemented using OT60, which handles these things a bit differently.
Other collections using pat50 simply search for, say "pie" and the user
should expect to get "piedmont" and "piers", etc.
The more complicated these get, the more we want to be able to look
at our end results, or look at the building blocks on which complicated
results are built.