UM Provider (OAI-PMH 2.0 data provider)

Overview, design and installation

Similar to our decisions regarding new nameresolver2, we wanted to create a new OAI data provider that would move away from using broker20 and Bibliographic Class. We also were creating the new MBooks environment, which used metadata directly from our online library catalog (Mirlyn). It seemed counterintuituve to put Mirlyn metadata (essentially marc21 data) into Bibliographic Class to make it work in broker20, and our method for crosswalking TEI to Text Class to Bibliographic Class had always seemed sub-standard. We also had to handle a new rights environment for MBooks-- those that were public domain and those that were restricted-- and there was no clean method to connect the rights database with broker20.

Consequently, we created UMProvider to hold and provide access to all our OAI metadata-- the MBooks metadata as well as the DLPS/DLXS metadata from our Text Class and Image Class collections. And we decided to make it re-usable at the same time, i.e., that it be a single perl module that connects to any relational database (e.g., MySQL) and that it have no other requirements other than common perl system modules (e.g., XML::LibXML, CGI, DBI).

We'll provide a brief overview of OAI before we show UMProvider.

Getting started

Steps for getting started:

  1. Get the UMProvider module
  2. Set up database tables
  3. Create or modify example CGI script
  4. Edit UMProvider config
  5. Load your data

1) Get the UMProvider module

The UMProvider will be included in DLXS release 14 ($DLXSROOT/bin/o/oai/, $DLXSROOT/cgi/o/oai/) and is available right now on sourceforge (non-DLXS enabled): http://www.sourceforge.net/projects/umoaitoolkit/. The existing OAI Provider (broker20) will continue to be distributed with DLXS. However, we encourage you to start using the UMProvider, as it is simpler to manage and conforms to the OAI specification correctly (something that broker20 never did completely).

The UM OAI Toolkit (umoaitoolkit/) available from sourceforge contains the OAI-PMH harvesting scripts as well.

2) Set up database tables

The first MySQL table is mandatory and stores all of the required data for the UMProvider. The second table can be used if you would like to organize your records into sets. Sets, in OAI-PMH, are used for organizing the data for selective harvesting of the content. Both tables are created as they appear below when you install DLXS release 14.

First table (oai):
      +-----------+--------------+------+-----+-------------------+
      | Field     | Type         | Null | Key | Default           |
      +-----------+--------------+------+-----+-------------------+
      | id        | varchar(150) | NO   | PRI |                   |
      | timestamp | timestamp    | NO   | MUL | CURRENT_TIMESTAMP |
      | oai_dc    | mediumblob   | YES  |     | NULL              |
      | marc21    | mediumblob   | YES  |     | NULL              |
      | mods      | mediumblob   | YES  |     | NULL              |
      +-----------+--------------+------+-----+-------------------+
    
Second table (oaisets) optional:
      +-----------+--------------+------+-----+---------+
      | Field     | Type         | Null | Key | Default |
      +-----------+--------------+------+-----+---------+
      | id        | varchar(150) | NO   | PRI |         |
      | oaiset    | varchar(32)  | NO   | PRI |         |
      +-----------+--------------+------+-----+---------+
    
      CREATE TABLE oai (
          id        VARCHAR(150) PRIMARY KEY,
          timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
          oai_dc    MEDIUMBLOB,
          mods      MEDIUMBLOB,
          marc21    MEDIUMBLOB,
          key 'timestamp' (timestamp) );

      CREATE TABLE oaisets (id VARCHAR(150), oaiset VARCHAR(32), PRIMARY KEY ('id','oaiset'), KEY 'oaiset' (oaiset));
    

3) Create or modify example CGI script

First, log onto pilsner with your workshop ID.

The only thing that needs to be changed for the CGI script ($DLXSROOT/cgi/o/oai/oai) is the information needed to connect to the database. Other than that, the sample script should work out of the box.

NOTE: If you are NOT using the UMProvider with DLXS, you could add the database connection information directly into the CGI script ($DLXSROOT/cgi/o/oai/oai) or load a config file like we do in DLXS.

4) Edit UMProvider config

The UMProvider configuration contains information about the repository for the Identify, ListSets and ListMetadataFormats OAI-PMH verbs. This data is not really dynamic so it is just stored in an XML configuration file.

  1. # cd $DLXSROOT/cgi/o/oai/
  2. # cp sample_config.xml oai_conf.xml
  3. edit oai_config.xml

change:

  1. <repositoryName>
  2. <baseURL> [userX.ws.umdl.umich.edu]
  3. <adminEmail>
  4. <repositoryIdentifier> [userX.ws.umdl.umich.edu]
  5. <sampleIdentifier> [ oai:userX.ws.umdl.umich.edu:MIU01-000053324]
  6. the list of sets and possible metadata formats
  7. Let's add some dlps sets for use later on. Add these under <ListSets>

Test the configuration with a few OAI requests:
Again, "userX" should be replaced with your user ID for the workshop.

      http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=Identify
      http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListSets
      http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListMetadataFormats

        [ should be one DC record in the table by default ]
      http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc
      http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:userX.ws.umdl.umich.edu:MIU01-12345678
    

5) Load your data into the database

In this step we are going to load already formatted metadata (oai_dc first) using the loadOai.pl script. The data that is fed to this script for loading needs to be wrapped in a <records> element. Also, mirroring the OAI-PMH format, a <header> (containing the unique identifier) and a <metadata> element are required for each record.

Here is an example of that data:

    <?xml version="1.0" encoding="UTF-8"?>
    <records>
      <record>
        <header>
          <identifier>MIU01-000053324</identifier>
          <setSpec>mbooks:pd</setSpec>
        </header>
        <metadata>
          <oai_dc:dc> 
            [ YOUR oai_dc DATA HERE ]
          </oai_dc:dc>
        </metadata>
      </record>
      [ MORE RECORDS HERE ]
    </records>
    
  1. Take a look at the sample data: $DLXSROOT/prep/s/sampleoai/oai_dc_samples/oai_sample.xml
  2. # cd $DLXSROOT/bin/o/oai/
  3. # ./loadOai.pl -d $DLXSROOT/prep/s/sampleoai/oai_dc_samples/
  4. Test to see if you have oai_dc records:
      http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc
    
Here are more metadata format (marc21 and mods) examples:
      # ./loadOai.pl -d $DLXSROOT/prep/s/sampleoai/marc21_samples/
      # ./loadOai.pl -d $DLXSROOT/prep/s/sampleoai/mods_samples/

      http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=marc21

      http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=mods

    

loadOai.pl also allows you to force the records at the time of loading into a specified set.

      # ./loadOai.pl -d $DLXSROOT/prep/s/sampleoai/oai_dc_samples/ -s dlps

      http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc&set=dlps

    

Tip: If your data is in broker20 already, you can use the OAI harvester to collect your data. Then, change the $recordXpath (see below) to load any OAI-PMH ListRecords response from a file.

      ## optional config -- xpath to find records
      my $recordXpath = "/OAI-PMH/ListRecords/record";
    

Converting data

The first slide of the UMProvider: Process Flows and Examples presentation shows the complete process flow for transforming a DLXS test class collection to oai_dc data and making it available from UMProvider. In this hands-on activity, we will cover the three main scripts necessary for the transformation. Those scripts are circled in red on the slide.

Instructions for DLXS Text Class to oai_dc Transformation Hands On Activity

  1. Enter: cd $DLXSROOT/bin/o/oai/provider

  2. View the contents of the collection configuration xml file by typing: more exampleColls.xml. Here you can see that we are going to process three collections: alajournals, conraditc, and emerson. These are collections ids (collids) used by collmgr. This configuration file is used as the input to all of the scripts to indicate which collections should be processed.

  3. Run: ./ExtractHeaders.pl -c exampleColls.xml. This will extract the header files from xpat. This allows us to capture only the metadata for each record instead of parsing the full text. NOTE: If there are problems running this script, you can copy the extracted header files by doing cp /l/l1-workshop/workshop-samples/kludewig/*headers.xml .

  4. Run: ls *headers.xml. You will see that there are three files: alajournals-headers.xml, conraditc-headers.xml, and emerson-headers.xml.

  5. Do: more alajournals-headers.xml to see the header data extracted from xpat.

  6. Do: more logs/log-2008-8-6.txt. This shows you the xpat commands used to create the header xml files. For example, for alajournals, you can see that we used the following 3 xpat commands:
          started xpat: $DLXSROOT/idx/a/alajournals/alajournals.dd
          executing:  pr.region.HEADER
          region HEADER
          executing: Stop
        
    If you had trouble running the ExtractHeaders.pl script, there will be no output in the log.
  7. Move these headers to the prep directory:
          mv *headers.xml $DLXSROOT/prep/o/oai/headers
        
  8. Next we're going to convert the record from Text Class format to oai dc. In the browser, run:
          ./ConvertToDc.pl -c exampleColls.xml  -d $DLXSROOT/prep/o/oai/headers
        
  9. Do: more logs/log-2008-8-6.txt again. Use the spacebar to scroll down to the section titled "Results from ConvertToDc execution". This shows you the XSLT stylesheets invoked by the script. The snippet below shows you that the alajournals-headers.xml file, the collection type (DLPS), collid, and language were passed to the textClassToDc.xsl stylesheet which were processed using the XLST program called xsltproc. This produced the output file alajournals-dc.xml.
          parsing dynamic collection alajournals
          executing xsltproc -o $DLXSROOT/prep/o/oai/provider/alajournals-dc.xml
          --param collid "'alajournals'" --param lang "'eng'" --param type
          "'DLPS'" textClassToDc.xsl
          $DLXSROOT/prep/o/oai/headers/alajournals-headers.xml
        

    Below is some example XSLT code from textClassToDc.xsl that maps the title from Text Class to the dc:title field.

    	<xsl:for-each select="FILEDESC/SOURCEDESC/BIBLFULL/TITLESTMT/TITLE">
    	    <xsl:if test="normalize-space(.)">		
    		<dc:title>
    			<xsl:apply-templates select="."/>
    		</dc:title>			
    		<xsl:call-template name="lineBreak"/>
    	    </xsl:if>
    	</xsl:for-each>
    	

  10. Do: more $DLXSROOT/prep/o/oai/provider/alajournals-dc.xml to see the transformed data.

  11. Now it's time to load the database: Run:
        ./LoadDB.pl -d $DLXSROOT/prep/o/oai/provider -c exampleColls.xml -p
        
  12. Now test the records in your dev space:
          http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListSets
        
    to see the list of sets in the repository. You'll see that there are 8 sets: dlps, dlpstext, dlps:collid (3), and dlpstext:collid (3). This set structure is optional. We chose to organize our sets this way so that a harvester could request all dlps collections or only the images or only the texts.

  13. You can view the records by collection:

          http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc&set=dlps:alajournals
          http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc&set=dlpstext:emerson
          http://userX.ws.umdl.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc&set=dlpstext:conraditc
        

Further Documentation

The UMProvider: Process Flows and Examples presentation contains a process flow diagram and examples of the oai_dc transformations. Slides 2 - 4 of the presentation provide examples of transforming other types of records (serial text, Bib Class, and Image Class) to oai_dc. See $DLXSROOT/bin/o/oai/provider/README.txt for detailed instructions. Additionally, each script has a -h option which can be used to display a usage message for that particular script.

Challenges Encountered

  1. Formatting of serial text collections

    In order to obtain the article-level metadata for the serial collections, we use the whole Text Class file in addition to the header files from xpat. We also needed to account for exceptions in how the volume and article data is organized so that our oai_dc data was cleanly formatted.

  2. Different identifiers between nameresolver and Text Class

    Identifiers cannot have colons in OAI. Some of our serial collections use colons to indicate article identifiers (e.g. 0522508.0001.001:1). We had to replace these with dashes to be OAI-PMH compatible. There were also some instances where different identifier types were used (acc.no vs. dlps).

  3. UTF8 encoding/SGML characters

    Some of the older, static collections are coded in SGML instead of XML. Since these collections are not modified often, we used the Bib Class files for the transformation instead of the Text Class files.

  4. Sub-collections (e.g. LLMC)

    We have one collection with the Scholarly Publishing Office that has 150 sub-collections. Rather than list all of the sub-collections in the configuration XML file (exampleColls.xml in our demo), we list only the base collection with the collid llmc. The script will then process all of the sub-collections within the llmc directory, e.g. $DLXSROOT/obj/l/llmc/subcoll1, $DLXSROOT/obj/l/llmc/subcoll2, etc.

  5. Image record titles almost identical

    For some Image Class collections, the title, subject, description are identical and the IDs similiar. In order to distinguish records, we appended the view (e.g. front, back, side) to the title. The collection scltinteric is such an example.

Other

Tips for Automation:

Look at: $DLXSROOT/bin/o/oai/provider/oai_update.pl
  1. Make a copy of the tables (otherwise table is contantly changing and resumption tokens become invalid)
  2. Automate harvesting or converting data
  3. Automate loading data into the copied table
  4. Make a backup of the existing MySQL tables before replacing them with the updated copies

Example Reports

The UMProvider: Process Flows and Examples presentation contains flow diagrams of the weekly automated processes for checking for updated records and new collections on slides 5 and 6.

View an example of the oai update report. The Perl script that generates the content of the report is at $DLXSROOT/bin/o/oai/provider/GenerateReport.pl. To change the email addresses to which the report is sent, you must edit $DLXSROOT/bin/o/oai/provider/text_ic_oai_cron.pl.

View an example of the new collection report. The Perl script containing the content and email addresses for this report is at $DLXSROOT/bin/o/oai/provider/GetNewCollections.pl.