Last updated	2002-07-08 12:16:20 EDT
Doc Title	Page Image Access Mechanisms
Author 1	Powell, Chris
CVS Revision	$Revision: 1.8 $

Page Image Access Mechanisms

Introduction

This document describes the mechanisms and programs used by DLXS for accessing and viewing images of pages that correspond to pages in TextClass documents (and possibly other classes). It also explains the particular metadata requirements that exist for this functionality to be possible and shows a sample pageview.dat file. The pageview.dat mechanisms are now deprecated, but still useful for importing information into the Pageview table (see instructions below in Populating the Pageview Table).

General Information

For collections where the middleware delivers page images rather than or in addition to the text content of the pages, the main mechanism for viewing the pages is the pageviewer-idx CGI program. In order to link from the SGML/XML text to the corresponding image, this CGI expects that page images are stored in directories based on (1) the DLXSROOT value, (2) the object directory recorded in the collection manager, and (3) the unique identifier assigned to the SGML/XML text and stored in the IDNO element, and that there are page break elements in the document referencing the images. In addition, there must be a Pageview table in the dlxs metadata database that should contain a column for each page image, listing the image file name, its sequence in the SGML/XML text, the page number (if any) specified on the page, the OCR confidence value (if available), and a three-letter code for any special features of the page (the default value for no special feature is UNS; see below for more information).

pageviewer-idx connects to the database and retrieves the name and location of the page image file from the Pageview table. pageviewer-idx then decides how to deliver the page. If the stored file format is different from the requested format as recorded in the collection manager (e.g., stored as tiff and requested as gif), a separate program, tif2web, is started to convert the image on the fly.

The Page Break Element and Page Image Metadata

The PB tag in the SGML data, representing a page break, has this form in Text Class:

<PB REF="00000009.tif" SEQ="00000009" RES="600dpi" FMT="TIFF6.0" FTR="TPG" CNF="856" N=iiii">

The attributes are:

REF: file name of page image
SEQ: the sequence number of the page in the series, from start to finish, of all the pages in the document.
RES: the resolution of the page image.
FMT: the file format of the page image.
FTR: the feature of the page, given as a three letter code. Possible values are listed below.
CNF: the confidence value of the OCR for the page, given by the OCR software.
N: the page number, not as a sequence, but rather the number as printed on the page (e.g., 3, 96, ix, etc.). This may be left blank, but the attribute cannot be omitted.

The information in this PB tag allows the Text Class middleware to create a URL to call the pageviewer-idx program with the parameters necessary to retrieve and display the corresponding page image. pageviewer-idx uses the Pageview table of the dlxs metadata database to do so.

The following are examples of feature codes and their expanded definitions that have been used in various collections mounted by DLPS. You may redefine these or use other codes, but will need to add or change the values in the PageView.cfg file found in the Text Class cgi directory. To see an example of these codes in use in the Text Class interface, go to this page from The Use of the Barometer in the Making of America and note the variety of features in the upper right hand pull down menu labeled "go to." If UNS is the sole feature recorded, no special features will be listed in this menu.

'1stpg' => 'First Page'
'ADV' => 'Advertisement'
'BIB' => 'Bibliography'
'BLP' => 'Blank Pages'
'CTP' => 'Cover Title Page'
'ERR' => 'Errata'
'IND' => 'Comprehensive Index'
'LOI' => 'List of Illustrations'
'LOT' => 'List of Tables'
'PNI' => 'Author or Name Index'
'PNT' => 'Production Note'
'REF' => 'References'
'SPI' => 'Special Index'
'SUI' => 'Subject Index'
'TOC' => 'Table of Contents'
'TPG' => 'Title Page'
'UNS' => 'Unspecified'
'VES' => 'Volume End Sheets'
'VLI' => 'Volume List of Illus'
'VOI' => 'Volume Index'
'VTP' => 'Volume Title Page'
'VTV' => 'Volume Title Page Verso'

Populating the Pageview Table

In DLXS releases prior to CD-ROM 8, the pageview.dat, a tab-delimited ASCII file used to locate page images associated with a text, was stored in the directory with the page images for a particular collection. If you have created pageview.dat files and would like to migrate them to the Pageview table, instructions can be found here. Otherwise, metadata about page images for a collection should be entered directly into the Pageview table.

Creating pageview.dat Files (For Information Only)

The pageview.dat file for a particular SGML/XML text can be automatically generated provided that the metadata required is stored as attributes in the page break (PB) elements in the text. On the distribution CD-ROM, in the directory /l1/bin/t/text/, you will find a perl script named makepageviewdata.pl. When run with a directory path as its sole argument, it will work through the subdirectories, creating pageview.dat files for all files with a .sgm* extension. (For XML files, you will need to edit lines 27 and 51 to point the script to files with the extension .xml.) For example,

/l1/bin/t/text/makepageviewdata.pl /l1/obj/a/

will run through all the subdirectories below /l1/obj/a/ and report on the files it finds and work it is doing:

Working on xml files in directory: /l1/obj/a/j/l/ajl7777.0001.001
Working on file: /l1/obj/a/j/l/ajl7777.0001.001/ajl7777.0001.001.xml
Working on PB tag for sequence: 0001
Working on PB tag for sequence: 0002
Working on PB tag for sequence: 0003
Working on PB tag for sequence: 0004

Working with a document containing these four page break tags:

<PB REF="00000001.tif" SEQ="00000001" RES="600dpi" FMT="TIFF6.0" FTR="TPG" CNF="852" N="1"/>
<PB REF="00000002.tif" SEQ="00000002" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="100" N="2"/>
<PB REF="00000003.tif" SEQ="00000003" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="884" N="3"/>
<PB REF="00000004.tif" SEQ="00000004" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="872" N="4"/>

would result in a pageview.dat file that contained this data:

## File:        /l1/obj/b/a/b/bab3633.0001.001/pageview.dat
## Created:     Mon Aug  6 11:32:55 EDT 2001
##
#filename       seq       pagenum confid  feature
00000001.tif    00000001        00000001        852     TPG
00000002.tif    00000002        00000002        100     UNS
00000003.tif    00000003        00000003        884     UNS
00000004.tif    00000004        00000004        872     UNS