DLXS Programmers & Technical Workshop

Wordwheel Program Overview and Data Preparation

Wordwheel is the name we give to a browsing capability available in TextClass for a given collection or across multiple collections.

If the user enters a word or word-beginning, wordwheel presents the user with a list of all words that appear in a collection together with the number of occurrences. The user can select from the list and perform a simple search on the selected words.

The CGI that implements this functionality is discussed later in Program Architecture - ww2-idx

Here we discuss the creation of the data which supports the wordwheel for a given collection

History

Previously implemented as a shell script that found words in the collection SGML and partitioned the result into files words_A through words_Z. The CGI then executed a grep using the user's input as a pattern over one or more of these files and extracted a window of lines around the match. Problems were slowness, difficulty of cross collection searching, extraction of windows spanning files.

Directory Structure

obj/lib/sgml/
wordwheel.dtd
idx/c/coll/WW/
coll.ww.dd etc.
makeWordWheelFiles.cfg
Makefile
bin/WW/
Makefile
makeWordWheelFiles.pl
makeWordWheelFiles.sample.cfg
sample.ww.blank.dd
sample.ww.inp

Overview of wordwheel data creation

We first run Makefile for WW which runs makeWordWheelFiles.plto create wordwheel derived SGML. This makefile has mostly the same targets as that for indexing the sgml for collections. It does not create fabricated regions -- they have no use in wordwheel.

Work with Content Specialist / encoders on realms

Looking at the makeWordWheelFIles.cfg file note that what are called realms must be defined for each collection. We work with the content specialist to determine what region definitions best define the realms in question for a given collection. A realm is a region in the collection SGML from which we would like to extract words.

The makeWordWheelFiles.pl Program

makeWordWheelFiles.pl reads realm definitions from the makeWordWheelFIles.cfg file and constructs XPat queries against the collection SGML . Note therefore that the collection must already be indexed before we can make the Wordwheel data. makeWordWheelFiles.pl takes the query results and processes them into individual word of interest, stores the word and keeps a count of occurances. It wraps each word in and <E> tag and sets the N, O and L (more later on these) attributes. Words are further categorized into alpha, numeric and other. A higher level tag, e.g. <ALPHA> wraps the word tag. Finally the< REALMNAME> tag is added and all are wrapped by a <REALM> tag. We can see this in the wordwheel.dtd.

Here is an SGML fragment:

<REALM>
<REALMNAME>full text</REALMNAME>
     <ALPHA>
        <E N="0" O="8205" L="a">a </E>
        <E N="1" O="1690" L="a">à </E>
        <E N="2" O="1" L="a'district">a'district </E>
        <E N="3" O="1" L="a-piece">a-piece </E>

After the Wordwheel SGML is created it is normalized and indexed using the same commands as for the collection SGML itself as described in XPat Indexing.

A word about the L attribute. L is short for LEMMA. This attribute is used to store 8 bit characters as their 7 bit equivalent so that à is stored as a. This feature allows the wordwheel CGI (ww2-idx) to build cross-collection wordwheels sorted alphabetically across multiple collections that may span multiple languages. Otherwise à would sort after z. The flag for this to happen is the $gLocale variable inmakeWordWheelFIles.cfg. If $gLocale is set to other than 'c', lemma attributes are created.

Make Wordwheel Data: Hands On

To get more of a feel for the process we'll use the bosnia Wordwheel Makefile to make the Wordwheel SGML, normalize the SGML and then index it. Earlier in the workshop we edited the correct paths into makeWordWheelFiles.cfg which is the input to makeWordWheelFIles.pl which is invoked by the first target.

% cd $HOME/dlxs/idx/b/bosnia/WW
% make wordwheel
% make norm
% make dd
% make sgmlrgn
% make finish