This chapter discusses the building of region indices on tagged or encoded text files. Tagged text files are ones in which structures are denoted by <tagname> and </tagname> delimiters. These delimiters are often found in structured text, typically either in SGML or XML.
The Data Dictionary control file discussed in the preceding chapter is the main control file for every XPAT database. There are three additional control files that play a role in the index-building process. These are the Region Tagnames file and the Document Type Definition (DTD). The Region Tagnames file employs a fairly simple method of specifying the data regions for which indices must be built, and can be used for well-formed (rather than valid) XML and for similarly constructed encoded text. This file is discussed in Section 2.2 of this chapter. A DTD may be used to index all elements and attributes in valid SGML and XML which uses 7- and 8-bit character encodings, and is especially encouraged in the case of fully-encoded Text Class documents. This file is discussed in Section 2.3.
The Region Tagnames file is created by the XPAT database manager to specify the data regions for which indices are to be built using the multirgn utility. Tagged or encoded files have tags in the data to indicate a hierarchy, content type, or features for specific portions of text. XPAT refers to these as "regions." While they are similar in concept to fields, the special terminology is used to help make clear that these units of information can exist in complex relationships to each other, including nested relationships. The XPAT software will build indices on regions specified by the XPAT database manager. Regions for index-building can be specified in the Region Tagnames file (or Tagnames for short). For instance, using the patent application example, specifying the region "inventor" would tell XPAT that an index should be built for every data region in the source files that is surrounded by the start and end tags <inventor> and </inventor>. The Tagnames file need not contain every unique tagname in a database; however, DLXS implementers often find that using a complete or nearly complete list, especially in early experiments with the data, is useful. The Tagnames file typically has the suffix '.tag' and usually uses the same prefix as for the Data Dictionary file.
Entries for Tagnames file use a special 'tagged' format. The tagged format provides the ability to distinguish between three types of region information:
Please note that all elements, attributes, and tags in your document(s) must be in a consistent case (e.g., all upper or lower case, or in the same mixed case form), and that they should be declared in this form in your Tagnames file. Refer to that man page for further details.
The Document Type Definition (DTD) file is used by the XPAT database manager to create data regions for each element, attribute, and tag name in the encoded text. A DTD is used with valid SGML and XML (currently only in 7- and 8-bit character encodings), and indices are built using the sgmlrgn and xmlrgn utilities. As noted above, encoded files have tags in the data to indicate a hierarchy, content type, or features for specific portions of text, and XPAT refers to these as "regions." Unlike the Tagnames file, which results in building indices only on regions specified by the XPAT database manager, use of a DTD and sgmlrgn or xmlrgn will result in regions being created for all elements, attributes, and tag names in the file.
Several features of using a DTD and sgmlrgn/xmlrgn are notable:
Regions can be built in two different ways. The first method uses multirgn and a Tagnames file, is designed for speed and simplicity of indexing, and allows the XPAT database manager to choose regions that should be indexed. The second method uses a DTD (along with sgmlrgn or xmlrgn), and builds indexes on every element, attribute, and tag name in the file. This approach, using a DTD, is typically necessary for fully-encoded Text Class documents.