Wordwheel Program Overview and Data Preparation
Wordwheel
is the name we give to a browsing capability available in TextClass for
a given collection or across multiple collections.
If the user enters a word or word-beginning, wordwheel presents the
user with a list of all words that appear in a collection together with
the number of occurrences. The user can select from the list and
perform a simple search on the selected words.
The CGI that implements this functionality is discussed later in Program
Architecture - ww2-idx
Here we discuss the creation of the data which supports
the wordwheel for a given collection
History
Previously implemented as a shell script that found words in the collection
SGML and partitioned the result into files words_A through words_Z.
The CGI then executed a grep using the user's input as a pattern over one
or more of these files and extracted a window of lines around the match.
Problems were slowness, difficulty of cross collection searching, extraction
of windows spanning files.
Directory Structure
obj/lib/sgml/
wordwheel.dtd
idx/c/coll/WW/
coll.ww.dd etc.
makeWordWheelFiles.cfg
Makefile
bin/WW/
Makefile
makeWordWheelFiles.pl
makeWordWheelFiles.sample.cfg
sample.ww.blank.dd
sample.ww.inp
Overview of wordwheel data creation
We first run Makefile
for WW which runs makeWordWheelFiles.plto
create wordwheel derived SGML. This makefile has mostly the same
targets as that for indexing the sgml for collections. It does not
create fabricated regions -- they have no use in wordwheel.
Work with Content Specialist / encoders on realms
Looking at the makeWordWheelFIles.cfg
file note that what are called realms must be defined for
each collection. We work with the content specialist to determine
what region definitions best define the realms in question for a given
collection. A realm is a region in the
collection SGML from which we would like to extract words.
makeWordWheelFiles.pl reads realm definitions from the makeWordWheelFIles.cfg
file and constructs XPat queries against the collection SGML . Note therefore
that the collection must already be indexed before we can make the Wordwheel
data. makeWordWheelFiles.pl takes the query results
and processes them into individual word of interest, stores the word and
keeps a count of occurances. It wraps each word in and <E> tag
and sets the N, O and L (more later on these) attributes. Words are
further categorized into alpha, numeric and other. A higher level
tag, e.g. <ALPHA> wraps the word tag. Finally the< REALMNAME>
tag is added and all are wrapped by a <REALM> tag. We can see
this in the wordwheel.dtd.
Here is an SGML fragment:
<REALM>
<REALMNAME>full text</REALMNAME>
<ALPHA>
<E N="0" O="8205"
L="a">a </E>
<E N="1" O="1690"
L="a">à </E>
<E N="2" O="1" L="a'district">a'district
</E>
<E N="3" O="1" L="a-piece">a-piece
</E>
After the Wordwheel SGML is created it is normalized and indexed
using the same commands as for the collection SGML itself as described
in XPat Indexing.
A word about the L attribute. L is short for LEMMA.
This attribute is used to store 8 bit characters as their 7 bit equivalent
so that à is stored as a. This feature allows
the wordwheel CGI (ww2-idx) to
build cross-collection wordwheels sorted alphabetically across multiple
collections that may span multiple languages. Otherwise à
would sort after z. The flag for this to happen is the
$gLocale variable in makeWordWheelFIles.cfg.
If $gLocale is set to other than 'c', lemma attributes are created.
Make Wordwheel Data: Hands On
To get more of a feel for the process we'll use the bosnia Wordwheel Makefile
to make the Wordwheel SGML, normalize the SGML and then index it.
Earlier in the workshop we edited the correct paths into makeWordWheelFiles.cfg
which is the input to makeWordWheelFIles.pl which is invoked by the first
target.
% cd $HOME/dlxs/idx/b/bosnia/WW
% make wordwheel
% make norm
% make dd
% make sgmlrgn
% make finish