|Last updated||2002-07-08 12:16:20 EDT|
|Doc Title||Page Image Access Mechanisms|
|Author 1||Powell, Chris|
|CVS Revision||$Revision: 1.8 $|
This document describes the mechanisms and programs used by DLXS for accessing and viewing images of pages that correspond to pages in TextClass documents (and possibly other classes). It also explains the particular metadata requirements that exist for this functionality to be possible and shows a sample pageview.dat file. The pageview.dat mechanisms are now deprecated, but still useful for importing information into the Pageview table (see instructions below in Populating the Pageview Table).
For collections where the middleware delivers page images rather than or in addition to the text content of the pages, the main mechanism for viewing the pages is the pageviewer-idx CGI program. In order to link from the SGML/XML text to the corresponding image, this CGI expects that page images are stored in directories based on (1) the DLXSROOT value, (2) the object directory recorded in the collection manager, and (3) the unique identifier assigned to the SGML/XML text and stored in the IDNO element, and that there are page break elements in the document referencing the images. In addition, there must be a Pageview table in the dlxs metadata database that should contain a column for each page image, listing the image file name, its sequence in the SGML/XML text, the page number (if any) specified on the page, the OCR confidence value (if available), and a three-letter code for any special features of the page (the default value for no special feature is UNS; see below for more information).
pageviewer-idx connects to the database and retrieves the name and location of the page image file from the Pageview table. pageviewer-idx then decides how to deliver the page. If the stored file format is different from the requested format as recorded in the collection manager (e.g., stored as tiff and requested as gif), a separate program, tif2web, is started to convert the image on the fly.
The PB tag in the SGML data, representing a page break, has this form in Text Class:
<PB REF="00000009.tif" SEQ="00000009" RES="600dpi" FMT="TIFF6.0" FTR="TPG" CNF="856" N=iiii">
The attributes are:
The information in this PB tag allows the Text Class middleware to create a URL to call the pageviewer-idx program with the parameters necessary to retrieve and display the corresponding page image. pageviewer-idx uses the Pageview table of the dlxs metadata database to do so.
The following are examples of feature codes and their expanded definitions that have been used in various collections mounted by DLPS. You may redefine these or use other codes, but will need to add or change the values in the PageView.cfg file found in the Text Class cgi directory. To see an example of these codes in use in the Text Class interface, go to this page from The Use of the Barometer in the Making of America and note the variety of features in the upper right hand pull down menu labeled "go to." If UNS is the sole feature recorded, no special features will be listed in this menu.
In DLXS releases prior to CD-ROM 8, the pageview.dat, a tab-delimited ASCII file used to locate page images associated with a text, was stored in the directory with the page images for a particular collection. If you have created pageview.dat files and would like to migrate them to the Pageview table, instructions can be found here. Otherwise, metadata about page images for a collection should be entered directly into the Pageview table.
The pageview.dat file for a particular SGML/XML text can be automatically generated provided that the metadata required is stored as attributes in the page break (PB) elements in the text. On the distribution CD-ROM, in the directory /l1/bin/t/text/, you will find a perl script named makepageviewdata.pl. When run with a directory path as its sole argument, it will work through the subdirectories, creating pageview.dat files for all files with a .sgm* extension. (For XML files, you will need to edit lines 27 and 51 to point the script to files with the extension .xml.) For example,
will run through all the subdirectories below /l1/obj/a/ and report on the files it finds and work it is doing:
Working on xml files in directory: /l1/obj/a/j/l/ajl7777.0001.001 Working on file: /l1/obj/a/j/l/ajl7777.0001.001/ajl7777.0001.001.xml Working on PB tag for sequence: 0001 Working on PB tag for sequence: 0002 Working on PB tag for sequence: 0003 Working on PB tag for sequence: 0004
Working with a document containing these four page break tags:
<PB REF="00000001.tif" SEQ="00000001" RES="600dpi" FMT="TIFF6.0" FTR="TPG" CNF="852" N="1"/> <PB REF="00000002.tif" SEQ="00000002" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="100" N="2"/> <PB REF="00000003.tif" SEQ="00000003" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="884" N="3"/> <PB REF="00000004.tif" SEQ="00000004" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="872" N="4"/>
would result in a pageview.dat file that contained this data:
## File: /l1/obj/b/a/b/bab3633.0001.001/pageview.dat ## Created: Mon Aug 6 11:32:55 EDT 2001 ## #filename seq pagenum confid feature 00000001.tif 00000001 00000001 852 TPG 00000002.tif 00000002 00000002 100 UNS 00000003.tif 00000003 00000003 884 UNS 00000004.tif 00000004 00000004 872 UNS