Normalization

Normalization of Data

Consider this DTD fragment:

<!ELEMENT E - o ( #PCDATA )   >
<!ATTLIST E
                                N   NUMBER    #REQUIRED
                                O   NUMBER    #REQUIRED
                                L   #CDATA     #IMPLIED >

And this un-normalized sgml fragment:

<e o="l" n="3" l="emile">emile </e>
<e n="4" o="23">georgina
<e n="5" o="1" l="holbach">holbach </e>

After normalization we would expect:

<E N="3" O="1" L="emile">emile </E>
<E N="4" O="23">georgina </E>
<E N="5" O="1" L="holbach">holbach </E>

So we see that:

The "E" tag, and the "N", "O" and "L" names are now upper case
record ends are normalized in a consistent fashion based on element content models
The "E" tag is a minimized tag which has been supplied explicitly (</E>).
The attributes have been reordered.

After normalization we can write a perl regular expression to, for example, match
the attribute values and content like this:

$tag =~ m,<E N=\"([^\"]*?)\" O=\"([^\"]*?)\"( L=\"([^\"]*?)\")?>([^<]*?)</E>, ;

To match on the un-normalized SGML, we would need to handle:

Case. What if we had mixed case tag names ABC, Abc, ABc, etc.?
Order: All permutations of attribute ordering
Missing end tag: We would have to write a regular expression to match any begin tag allowed by the DTD.