Normalization of Data

Consider this DTD fragment:

<!ELEMENT E  - o  ( #PCDATA )   >
<!ATTLIST E
                                N   NUMBER    #REQUIRED
                                O   NUMBER    #REQUIRED
                                L   #CDATA     #IMPLIED >

And this un-normalized sgml fragment:

<e o="l" n="3" l="emile">emile </e>
<e n="4" o="23">georgina
<e n="5" o="1" l="holbach">holbach </e>

After normalization we would expect:

<E N="3" O="1" L="emile">emile </E>
<E N="4" O="23">georgina </E>
<E N="5" O="1" L="holbach">holbach </E>
 

So we see that:

After normalization we can write a perl regular expression to, for example, match
the attribute values and content like this:

$tag =~ m,<E N=\"([^\"]*?)\" O=\"([^\"]*?)\"( L=\"([^\"]*?)\")?>([^<]*?)</E>, ;

To match on the un-normalized SGML, we would need to handle: