OAI Harvesting System

Contents

Overview

DLXS hosts the OAIster project, as well as other OAI test portals (MODS, Aquifer, DLF). This document details how to run the harvester used for gathering records we use in these portals, and how to transform the harvested records into Bibliographic Class.

For more information on OAI and the Protocol for Metadata Harvesting (OAI-PMH), see the official site. For best practices related to OAI, see the Best Practices wiki.

To download just the harvester and transform engine, see the OAI Tools Package.

Harvester (UMHarvester)

N.B.: To use the harvester in your system, you may have to make changes to the Global Parameters located at the beginning of the UMHarvester script.

To start the harvester use ./UMHarvester from within /$DLXSROOT/bin/o/oaister/scripts/

These flags let you perform harvesting:

The Batch_UMHarvest file is used to run automated incremental harvests on repositories. See the /$DLXSROOT/bin/o/oaister/scripts/Batch_UMHarvest_sample file for an example.

my @Monday = 
(['uiucimages', 'ALA', 'oai_dc', 'dr'],
);

Add your own repository id, set, metadata format, and run specification (r to run, dr to not run OAITransform) for each repository you wish to batch harvest. Batch_UMHarvest will perform an incremental harvest from the last time you harvested, based on the .log file for that repository id.

Rename Batch_UMHarvest_sample to Batch_UMHarvest to use. To start the Batch_UMHarvest run
./Batch_UMHarvest -d M &
from within /$DLXSROOT/bin/o/oaister/scripts/. This will run all the repository ids within the "M" (or Monday) batch harvest group.

Transform engine (OAITransform)

OAITransform creates concatenated BibClass file of all oai_dc records, per repository. To start the transform tool use ./oaitransform/OAITransform [repository_id] from within /$DLXSROOT/bin/o/oaister/oaitransform/

Add the repository id you want to transform. This id is taken from repository_table.txt, which you will build using repository_table.sample.txt as your starting point.
e.g.,./oaitransform/OAITransform celebration

The transform program will process your oai_dc harvested files, first by concatenating them into raw files and then by transforming them into BibClass files. The /$DLXSROOT/bin/o/oaister/oaitransform/oai-bibclass3.xsl file is used to perform the mapping from oai_dc to BibClass.

The repository report at the end of the transform will provide a number of statistics.

Repository Report: bristol
        records with URLs       = 818
        records without URLs    = 5
        repository records      = 823
        success rate            = 99.39%
        ------------------------
        data conditioning msgs? = YES!
        deleted records (.del)  = 0
        normalization errors    = 2
        raw parse failures      = 0

For questions on how to transform MODS records, please contact Kat Hagedorn at khage at umich dot edu.