NIHMS ETL Utilities

NIHMS ETL

License

License

Categories

Categories

Data
GroupId

GroupId

org.dataconservancy.pass
ArtifactId

ArtifactId

nihms-etl-util
Last Version

Last Version

1.4.1
Release Date

Release Date

Type

Type

jar
Description

Description

NIHMS ETL Utilities
NIHMS ETL

Download nihms-etl-util

How to add to project

<!-- https://jarcasting.com/artifacts/org.dataconservancy.pass/nihms-etl-util/ -->
<dependency>
    <groupId>org.dataconservancy.pass</groupId>
    <artifactId>nihms-etl-util</artifactId>
    <version>1.4.1</version>
</dependency>
// https://jarcasting.com/artifacts/org.dataconservancy.pass/nihms-etl-util/
implementation 'org.dataconservancy.pass:nihms-etl-util:1.4.1'
// https://jarcasting.com/artifacts/org.dataconservancy.pass/nihms-etl-util/
implementation ("org.dataconservancy.pass:nihms-etl-util:1.4.1")
'org.dataconservancy.pass:nihms-etl-util:jar:1.4.1'
<dependency org="org.dataconservancy.pass" name="nihms-etl-util" rev="1.4.1">
  <artifact name="nihms-etl-util" type="jar" />
</dependency>
@Grapes(
@Grab(group='org.dataconservancy.pass', module='nihms-etl-util', version='1.4.1')
)
libraryDependencies += "org.dataconservancy.pass" % "nihms-etl-util" % "1.4.1"
[org.dataconservancy.pass/nihms-etl-util "1.4.1"]

Dependencies

compile (2)

Group / Artifact Type Version
org.dataconservancy.pass : pass-model jar 0.8.0
org.slf4j : slf4j-api jar 1.7.25

test (2)

Group / Artifact Type Version
junit : junit jar 4.12
ch.qos.logback : logback-classic jar 1.2.3

Project Modules

There are no modules declared in this project.

PASS NIHMS Submission ETL

The NIHMS Submission ETL contains the components required to download, transform and load Submission information from NIHMS to PASS. The project includes two command line tools. The first uses the NIH API to download the CSV(s) containing compliant, non-compliant, and in-process publication information. The second tool reads those files, transforms the data to the PASS data model and then loads them to PASS.

For background information on PACM, see the user guide. Limited information on using the PACM API can be found here.

NIHMS Data Harvest CLI

The NIHMS Data Harvest CLI uses the NIH API to download the PACM data.

Pre-requisites

The following are required to run this tool:

  • Java 8 or later
  • Download the latest nihms-data-harvest-cli-{version}-shaded.jar from the releases page and place in a folder on the machine where the application will run.
  • Get an account for the NIH PACM website, and obtain an API key.
  • Create a data folder that files will be downloaded to.

Data Harvest Configuration

There are several ways to configure the Data Harvest CLI. You can use a configuration file, environment variables, system variables, or a combination of these. The configuration file will set system properties. In the absence of a config file, system properties will be used, and in the absence of those, environment variables will be used. Note that to use environment variables, the system property name must be converted to upper case, and the periods replaced with underscores. For example, to define nihmsetl.data.dir as an environment variable, use NIHMSETL_DATA_DIR instead.

Note: URL parameters specified using system properties named nihmsetl.api.url.param.<param name> cannot be specified as environment variables. They may only be specified as system properties or in the nihms-harvest.properties file.

By default, the application will look for a configuration file named nihms-harvest.properties in the folder containing the java application. You can override the location of the properties file by defining an environment variable for nihmsetl.harvester.configfile e.g.

> java -Dnihmsetl.harvester.configfile=/path/to/configfile.properties -jar nihms-data-harvest-cli-1.0.0-SNAPSHOT-shaded.jar 

The configuration file should look like this:

# Example properties file for nihms-data-harvest-cli
#
# Defines a folder to download CSV files to. If it doesn’t exist, it will create it for you.
# Will default to ./data in the folder the Java app runs in
# nihmsetl.data.dir=data
#

# NIH API hostname
nihmsetl.api.host = www.ncbi.nlm.nih.gov

# HTTP scheme
nihmsetl.api.scheme = https

# Currently the port is not used
# nihmsetl.api.port = 

# NIH API URL path
nihmsetl.api.path = /pmc/utils/pacm/

# Allow 30 seconds for a request to be read before timing out
nihmsetl.http.read-timeout-ms = 30000

# Allow 30 seconds for establishing connections before timing out
nihmsetl.http.connect-timeout-ms = 30000


# URL Parameters
#   Additional parameters may be added, and they will be included in the API URL as request parameters
#   Parameters may be added as 'nihmsetl.api.url.param.<parameter name>' where '<parameter name>' is the
#   URL request parameter

# Format ought to be CSV, otherwise the loader won't be able to process the saved files
nihmsetl.api.url.param.format = csv

# Institution name, unclear as to how it is used
nihmsetl.api.url.param.inst = JOHNS HOPKINS UNIVERSITY

#  IPF (Institutional Profile File) number, the unique ID assigned to a grantee organization in the eRA system. 
nihmsetl.api.url.param.ipf = 4134401

# The API token retrieved from the PACM website.  These expire every three months.
nihmsetl.api.url.param.api-token = XXXXXXX-XXXX-XXXX-XXXXXX

# Date in MM/YYYY format that the PACM data should start from (may be set using the `-s` harvester command line option).  By default this date will be set to the current month, one year ago 
# nihmsetl.api.url.param.pdf = 07/2018

# Date in MM/YYYY format that the PACM data should end at (leave commented to default to the current month)
# nihmsetl.api.url.param.pdt = 07/2019

# Undocumented, not used
# nihmsetl.api.url.param.rd =

# Undocumented, not used
# nihmsetl.api.url.param.filter =

Running the Data Harvester

Once the Data Harvest CLI has been configured, there are a few additional options you can add when running from the command line.

By default all 3 publication statuses - compliant, non-compliant, and in-process CSVs will be downloaded. To download one or two of them, you can add them individually at the command line:

  • -c, -compliant, --compliant - Download compliant publication CSV.
  • -p, -inprocess, --inprocess - Download in-process publication CSV.
  • -n, -noncompliant, --noncompliant - Download non-compliant publication CSV.

You can also specify a start date, by default the PACM system sets the start date to 1 year prior to the date of the download. You can change this by adding a start date parameter:

  • -s, -startDate --startDate - This will return all records published since the date provided. The syntax is mm-yyyy.

So, for example, to download the compliant publications published since December 2012, you would do the following:

> java -jar nihms-data-harvest-cli-1.0.0-SNAPSHOT-shaded.jar -s 12-2012 -c

On running this command, files will be downloaded and renamed with a prefix according to their status ("compliant", "noncompliant", or "inprocess") and a timestamp integer e.g. noncompliant_nihmspubs_20180507104323.csv.

NIHMS Data Transform-Load CLI

The NIHMS Data Transform-Load CLI reads data in from CSVs that were downloaded from the PACM system, converts them to PASS compliant data and loads them into the PASS database.

Pre-requisites

The following is required to run this tool:

  • Java 8 or later
  • Download latest nihms-data-transform-load-cli-{version}-shaded.jar from the releases page and place in a folder on the machine where the application will run.

Data Transform-Load Configuration

There are several ways to configure the Data Transform-Load CLI. You can use a configuration file, environment variables, system variables, or a combination of these. The configuration file will set system properties. In the absence of a config file, system properties will be used, and in the absence of those, environment variables will be used.

By default, the application will look for a configuration file named nihms-loader.properties in the folder containing the java application. You can override the location of the properties file by defining an environment variable for nihmsetl.harvester.configfile e.g.

> java -Dnihmsetl.loader.configfile=/path/to/configfile.properties -jar nihms-data-transform-load-cli-1.0.0-SNAPSHOT-shaded.jar 

The configuration file should look like this:

nihmsetl.data.dir=/path/to/pass/loaders/data
nihmsetl.loader.cachepath=/path/to/pass/loaders/cache/compliant-cache.data
nihmsetl.repository.uri=https://example:8080/fcrepo/rest/repositories/aaa/bbb/ccc
nihmsetl.pmcurl.template=https://www.ncbi.nlm.nih.gov/pmc/articles/%s/
pass.fedora.baseurl=http://localhost:8080/fcrepo/rest/
pass.fedora.user=admin
pass.fedora.password=password
pass.elasticsearch.url=http://localhost:9200/ (default value)
pass.elasticsearcg.indices=pass (default value)
pass.elasticsearch.limit=200
  • nihmsetl.data.dir is the path that the CSV files will be read from. If a path is not defined, the app will look for a /data folder in the folder containing the java app.
  • nihmsetl.loader.cachepath designates a path to a file that will be used to store a cache of completed compliant data so that it is not reprocessed. Note that this file can be deleted to force a complete recheck of the data. If a path is not defined, this will default to a file at /cache/compliant-cache.data in the folder containing the java app.
  • nihmsetl.repository.uri the URI for the Repository resource in PASS that represents the PMC repository.
  • nihmsetl.pmcurl.template is the template URL used to construct the RepositoryCopy.accessUrl. The article PMC is passed into this URL.
  • pass.fedora.baseurl - Base URL for Fedora
  • pass.fedora.user - User name for Fedora access
  • pass.fedora.password - Password for Fedora access
  • pass.elasticsearch.url - Base URL for elasticsearch host
  • pass.elasticsearch.indices - Index target for elasticsearch
  • pass.elasticsearch.limit - Maximum number of results to return in an Elasticsearch query. This is optional, it defaults to 200 and typically will not need to be overridden.

Running the Data Transform-Load

Once the Data Transform-Load CLI has been configured, there are a few additional options you can add when running from the command line.

By default all 3 publication statuses - compliant, non-compliant, and in-process CSVs will be downloaded. To download one or two of them, you can add them individually at the command line:

  • -c, -compliant, --compliant - Download compliant publication CSV.
  • -p, -inprocess, --inprocess - Download in-process publication CSV.
  • -n, -noncompliant, --noncompliant - Download non-compliant publication CSV.

So, for example, to process non-compliant spreadsheets only:

> java -jar nihms-data-transform-load-cli-1.0.0-SNAPSHOT-shaded.jar -n

When run, each row will be loaded into the application and new Publications, Submissions, and RepositoryCopies will be created in PASS as needed. The application will also update any Deposit.repositoryCopy links where a new one is discovered. Once a CSV file has been processed, it will be renamed with a suffix of ".done" e.g. noncompliant_nihmspubs_20180507104323.csv.done. To re-process the file, simply rename it to remove the .done suffix and re-run the application.

org.dataconservancy.pass

Versions

Version
1.4.1
1.4.0
1.3.1