Creating Fabricated Regions

Throughout the documents and discussion so far, we've avoided going into depth about the concept of fabricated regions, using it but not explaining it on purpose. The things included in the extra target of the Makefile perform the steps needed to create a set of regions according to a qualification or from-here-to-there argument.

What's a Fabricated Region and Why Do We Care?

As the DLPS platform uses the phrase, a fabricated region is some span of text in an indexed document not well or exactly defined by the SGML region distinctions from sgmlrgn, but requires some extra decision-making. These fabricated regions exist as conveniences for speeding up searching and displaying meaningful results of a search.

For the speeding up searches convenience, consider an SGML DTD that describes books, using a DIV element to describe large divisions in the text like chapters, prefaces, appendices, and the like. To distinguish different DIVs from each other, it has a TYPE attribute that takes CDATA. If a given collection of books marked-up with this DTD all use a DIV TYPE="ABSTRACT" tag to surround the abstract of a book, and we know that we will want to allow searches in the abstracts often, we might want to define a fabricated region for all the abstracts of the books, so that "region abstract" would be something we could simply refer to (and find out about with {ddinfo regionnames}), rather than having to compute region DIV incl "type=abstract" each time we wanted to refer to abstracts.

For the displaying meaningful results convenience, consider an SGML DTD that permits deep hierarchy, and a collection of documents that make extensive use of that deep hierarchy. Just about any Chadwyck-Healey literature collection you can think of (especially poetry databases) permit the arbitrary nesting of DIV-like elements, and then frequently use this nesting ability quite a bit. One of the HTI/DLPS philosophies is to try to provide as much useful context to the user as possible, and in the case of such a possibly deep hierarchy, one take on context is to show as much of the hierarchy as possible when displaying results. We might want search matches to be displayed along with their structural or logical parents, and some indication of just what those parents are.

To further complicate this, the DIVs that nest might have a couple of different ways that we could identify them with text meaningful to the user. We might see in the mark-up:

<DIV ID="347"><HEAD>Chapter the 9th: "Hiya Toots!"</HEAD>

But at the same time:

<DIV ID="15.a" TYPE="chapter" NUM="13">

These both have some identification information on them, but in different ways: the first example has a nice HEAD element, whose contents are inscrutable to the program, but which we know represent the title of this section. The second has no HEAD element, but has a TYPE and NUM attribute for some intelligible, human understandable title information. We'd like to make a fabricated region that would capture all the "title" type information for each DIV, regardless of what kind of DIV it is (we also don't have to recompute it each time; it will be indexed like any other region.)

When analysing the documents, and thinking about what kinds of searching and retrieval we want to do on them, we might come up with these kinds of distinctions that we'll want to make often and conveniently. We then consider defining some fabricated regions.

region and Qualification

These forms of the command are the simplest, in fact we've used them already:

region A + region B
region A - region B
region A ^ region B
(point set) - region B
(point set) ^ region B
region A - (point set)
region A ^ (point set)
region A incl (point or region set)
region A within region B
region A fby (point or region set)
region A near (point or region set)

The above will always evaluate to new region sets.

region and From-Here-To-There

The region command:

region <from> .. <to>

takes a valid point expression in the <from>, and some valid point expression as the <to>. It will compute the resulting regions set as all those spans of text starting at the start offset of each <from>s, to and including the start offset of the nearest following <to>, with no other <from> intervening. Anytime a <from> has no following <to>, either because there are none further in the text, or they are all already claimed by some other <from>, un-mated <from> doesn't appear in the resulting region set. None of the result regions will overlap. For instance, in an HTML database we might:

>> region A
  5: 50 matches
>> region BR
  6: 5 matches
>> region "A-T" .. "<BR>"
  7: 2 matches

Each of the members in 7 starts with the start tag of an A element, and runs up to the next BR element (we get at the BR element in a back-handed way...), without another A start tag intervening. Because of the way the BR and A elements fall in this HTML, it can only make two such regions.

Fabricated Region Examples

I need to reserve all the abstracts for fast searching We need to know the DTD and usage details as above.
>> region DIV incl "type=abstract"
  1: 15 matches
>> region DIV incl (region "DIV-T" incl "type=abstract")
  2: 12 matches

These examples are given in the order of increasing paranoia. The difference between first and the last has to do with whether or not strings matching "type=abstract" can occur in more than one kind of place in a DIV (like, as an attribute on a sub element of a DIV).

Exporting a Fabricated Region and Using it in the Index

This example taken from the bosnia.extra.srch file. I want a region that represents a head of a DIV1. I need to find all DIV1s that have a HEAD. But since there are some DIV1s that do not have HEAD regions, for those DIV1s I need to get the DIV1 tag itself. I must remember during all this that the HEAD regions in question must be ones at the DIV1 level, not inside sub-DIVs....

>> ((region "<DIV1".."</HEAD>") ^ (region DIV1 incl (region HEAD not within region DIV2)))
 1: 101 matches
>> ((region "DIV1-T") ^ (region DIV1 not incl (region HEAD not within region DIV2)))
 2: 12 matches


Once we've constructed a search that identifies the regions that we care about, we export that region to a file, and incorporate that file into the main index, so that the region we've identified is loaded when the database is run:

>> 1 + 2
 3: 113 matches
>> {exportfile "./div1headtest.rgn"}
>> export

Now there is a new file ./div1headtest.rgn that is a binary index file of this new 'virtual' region. We can make XPat aware of it by adding it to the Regions area of the .dd file.

The Makefile contains a command which takes commands like these and uses them to create a .rgn file and a text file coll.extra.dd which is then incorporated into the main .dd file with the perl script.

We can then refer to the fabricated region with "region div1headtest", and see "div1headtext" as one of the available regions from {ddinfo regionnames}.