Newspaper Clip Image Access Mechanisms

Introduction

This document describes the mechanisms and programs used by DLXS for accessing and viewing images of newspaper clips and their containing page image that correspond to pages in TextClass documents. It also explains the particular metadata requirements that exist for this functionality for this to be possible. As of this writing, this mechanism is still in the prototype stage and will continue to advance as the defining XML text data evolves. It is currently based on XML designed by the Apex Corporation as a demonstration for the British Library.

General Information

Newspaper collections are typically accompanied by page and article clip images in addition to the text content of the papers. The mechanism for viewing these image is the pageviewer-idx CGI program which calls upon subclasses to handle the specifics of newspapers. In order to link from the XML text to the corresponding page or article clip image, this CGI expects that page images are stored in directories based on

  1. the DLXSROOT value
  2. the object directory recorded in the collection manager
  3. the unique identifier assigned to the XML text and stored in the IDNO element
In these respects, the Clipviewer is identical to the Pageviewer. However, in addition, DIV1 XML elements that wrap articles are associated with containing pages using rows in the ArticleClips database table. Note this table is a prototype and is not normalized. Each row lists the clip image file name, its sequence in the XML text, the page number (if any) specified on the containing page, the sequence of the containing page in the XML text, the page image file name.

pageviewer-idx connects to the database and retrieves the name and location of the clip image file from the ArticleClips table. pageviewer-idx then decides how to deliver the page. If the stored file format is different from the requested format as recorded in the collection manager (e.g., stored as tiff and requested as gif), a separate program, tif2web, is started to convert the image on the fly. For more information about how pageviewer-idx does its work in deciding how to deliver the page image, see Itemviewer Image Conversion .

The DIV1 and DIV2 Elements and Clip Image Metadata

The DIV1 element in the XML data, represents an article. It has this form in Text Class:

<DIV1 NODE="0FFO.1711.1012:1" TYPE="News" ID="0FFO-1711-OCT12-001-001">

The attributes are:

<DIV2 NODE="0FFO.1711.1012:1.1" TYPE="clip" REF="0FFO-1711-OCT12-001-001-001" PGREF="0FFO-1711-OCT12-001" PGSEQ="1" SEQ="1" N="1">

Note that an article can consist of more that one clip and that the clips for a give article may apan more that one page. The attributes are:

The information in thses tag allows the Text Class middleware to create a URL to call the pageviewer-idx program with the parameters necessary to retrieve and display the corresponding page image. pageviewer-idx, when viewing clips and their pages, uses the ArticleClips table of the dlxs metadata database to do so.

Populating the ArticleClips Table

The ArticleClips table rows for a particular XML newspaper text can be automatically populated provided that the metadata required is stored as attributes in the DIV1 and DIV2 elements described above. On the distribution CD-ROM, in the directory DLXSROOT/bin/t/text/, you will find a perl script named newsdtd2mysql.pl. When run at the commandline with the name of the XML file, it will populate the ArticleClips table. For example,

DLXSROOT/bin/t/text/newsdtd2mysql.pl DLXSROOT/obj/b/bldemo/bldemo.xml