Normalization

Normalization of Data

You get a lot of benefit from normalization of SGML:

tag names, attribute names, and some attribute value types are normalized into upper case
record ends are normalized in a consistent fashion based on element content models
optional or minimized tags are made explicit (makes programmatic parsing much easier)
most normalizers put attributes in the order in which they are declared in the DTD (though this is not part of the formal definition of normalization)

Our favorite normalizer is sgmlnorm from James Clark's SP.

command: sgmlnorm doctype_file sgml_file > output_sgml_file

Here is an example of how normalization might change an sgml document and some detail on how this eases parsing.

Do look at all of James Clark's SGML/XML tools.

Normalization: Hands On

To get more of a feel for the process we'll use the bosnia Makefile to do the necessary normalization (sgmlnorm) step. But before we can normalize the data it must be transformed. The <PB> (pagebreak) tags are processed and their attributes
and values are changed to conform to the expectations of the Page Viewer. After the <PB> tags have been "munged" we will also use the Makefile to check for valid sgml before normalizing. This runs nsgmls, James Clark's parser.

% cd $HOME/dlxs/idx/b/bosnia
% make noded
% make validate
% make norm