NIHMS ETL Model

NIHMS ETL

License	License The Apache Software License, Version 2.0
Categories	Categories Data
GroupId	GroupId org.dataconservancy.pass
ArtifactId	ArtifactId nihms-etl-model
Last Version	Last Version 1.4.1
Release Date	Release Date Apr 15, 2021
Type	Type jar
Description	Description NIHMS ETL Model NIHMS ETL

Download nihms-etl-model

Filename	Size
nihms-etl-model-1.4.1.pom
nihms-etl-model-1.4.1.jar	5 KB
nihms-etl-model-1.4.1-sources.jar	5 KB
nihms-etl-model-1.4.1-javadoc.jar	33 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/org.dataconservancy.pass/nihms-etl-model/ -->
<dependency>
    <groupId>org.dataconservancy.pass</groupId>
    <artifactId>nihms-etl-model</artifactId>
    <version>1.4.1</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/org.dataconservancy.pass/nihms-etl-model/
implementation 'org.dataconservancy.pass:nihms-etl-model:1.4.1'

Gradle Kotlin

// https://jarcasting.com/artifacts/org.dataconservancy.pass/nihms-etl-model/
implementation ("org.dataconservancy.pass:nihms-etl-model:1.4.1")

Apache Buildr

'org.dataconservancy.pass:nihms-etl-model:jar:1.4.1'

Apache Ivy

<dependency org="org.dataconservancy.pass" name="nihms-etl-model" rev="1.4.1">
  <artifact name="nihms-etl-model" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='org.dataconservancy.pass', module='nihms-etl-model', version='1.4.1')
)

Scala SBT

libraryDependencies += "org.dataconservancy.pass" % "nihms-etl-model" % "1.4.1"

Leiningen

[org.dataconservancy.pass/nihms-etl-model "1.4.1"]

Dependencies

test (1)

Group / Artifact	Type	Version
junit : junit	jar	4.12

Project Modules

There are no modules declared in this project.

PASS NIHMS Submission ETL

The NIHMS Submission ETL contains the components required to download, transform and load Submission information from NIHMS to PASS. The project includes two command line tools. The first uses the NIH API to download the CSV(s) containing compliant, non-compliant, and in-process publication information. The second tool reads those files, transforms the data to the PASS data model and then loads them to PASS.

For background information on PACM, see the user guide. Limited information on using the PACM API can be found here.

NIHMS Data Harvest CLI

The NIHMS Data Harvest CLI uses the NIH API to download the PACM data.

Pre-requisites

The following are required to run this tool:

Java 8 or later
Download the latest nihms-data-harvest-cli-{version}-shaded.jar from the releases page and place in a folder on the machine where the application will run.
Get an account for the NIH PACM website, and obtain an API key.
Create a data folder that files will be downloaded to.

Data Harvest Configuration

There are several ways to configure the Data Harvest CLI. You can use a configuration file, environment variables, system variables, or a combination of these. The configuration file will set system properties. In the absence of a config file, system properties will be used, and in the absence of those, environment variables will be used. Note that to use environment variables, the system property name must be converted to upper case, and the periods replaced with underscores. For example, to define nihmsetl.data.dir as an environment variable, use NIHMSETL_DATA_DIR instead.

Note: URL parameters specified using system properties named nihmsetl.api.url.param.<param name> cannot be specified as environment variables. They may only be specified as system properties or in the nihms-harvest.properties file.

By default, the application will look for a configuration file named nihms-harvest.properties in the folder containing the java application. You can override the location of the properties file by defining an environment variable for nihmsetl.harvester.configfile e.g.

> java -Dnihmsetl.harvester.configfile=/path/to/configfile.properties -jar nihms-data-harvest-cli-1.0.0-SNAPSHOT-shaded.jar

The configuration file should look like this:

# Example properties file for nihms-data-harvest-cli
#
# Defines a folder to download CSV files to. If it doesn’t exist, it will create it for you.
# Will default to ./data in the folder the Java app runs in
# nihmsetl.data.dir=data
#

# NIH API hostname
nihmsetl.api.host = www.ncbi.nlm.nih.gov

# HTTP scheme
nihmsetl.api.scheme = https

# Currently the port is not used
# nihmsetl.api.port = 

# NIH API URL path
nihmsetl.api.path = /pmc/utils/pacm/

# Allow 30 seconds for a request to be read before timing out
nihmsetl.http.read-timeout-ms = 30000

# Allow 30 seconds for establishing connections before timing out
nihmsetl.http.connect-timeout-ms = 30000


# URL Parameters
#   Additional parameters may be added, and they will be included in the API URL as request parameters
#   Parameters may be added as 'nihmsetl.api.url.param.<parameter name>' where '<parameter name>' is the
#   URL request parameter

# Format ought to be CSV, otherwise the loader won't be able to process the saved files
nihmsetl.api.url.param.format = csv

# Institution name, unclear as to how it is used
nihmsetl.api.url.param.inst = JOHNS HOPKINS UNIVERSITY

#  IPF (Institutional Profile File) number, the unique ID assigned to a grantee organization in the eRA system. 
nihmsetl.api.url.param.ipf = 4134401

# The API token retrieved from the PACM website.  These expire every three months.
nihmsetl.api.url.param.api-token = XXXXXXX-XXXX-XXXX-XXXXXX

# Date in MM/YYYY format that the PACM data should start from (may be set using the `-s` harvester command line option).  By default this date will be set to the current month, one year ago 
# nihmsetl.api.url.param.pdf = 07/2018

# Date in MM/YYYY format that the PACM data should end at (leave commented to default to the current month)
# nihmsetl.api.url.param.pdt = 07/2019

# Undocumented, not used
# nihmsetl.api.url.param.rd =

# Undocumented, not used
# nihmsetl.api.url.param.filter =

Running the Data Harvester

Once the Data Harvest CLI has been configured, there are a few additional options you can add when running from the command line.

By default all 3 publication statuses - compliant, non-compliant, and in-process CSVs will be downloaded. To download one or two of them, you can add them individually at the command line:

-c, -compliant, --compliant - Download compliant publication CSV.
-p, -inprocess, --inprocess - Download in-process publication CSV.
-n, -noncompliant, --noncompliant - Download non-compliant publication CSV.

You can also specify a start date, by default the PACM system sets the start date to 1 year prior to the date of the download. You can change this by adding a start date parameter:

-s, -startDate --startDate - This will return all records published since the date provided. The syntax is mm-yyyy.

So, for example, to download the compliant publications published since December 2012, you would do the following:

> java -jar nihms-data-harvest-cli-1.0.0-SNAPSHOT-shaded.jar -s 12-2012 -c

On running this command, files will be downloaded and renamed with a prefix according to their status ("compliant", "noncompliant", or "inprocess") and a timestamp integer e.g. noncompliant_nihmspubs_20180507104323.csv.

NIHMS Data Transform-Load CLI

The NIHMS Data Transform-Load CLI reads data in from CSVs that were downloaded from the PACM system, converts them to PASS compliant data and loads them into the PASS database.

Pre-requisites

The following is required to run this tool:

Java 8 or later
Download latest nihms-data-transform-load-cli-{version}-shaded.jar from the releases page and place in a folder on the machine where the application will run.

Data Transform-Load Configuration

There are several ways to configure the Data Transform-Load CLI. You can use a configuration file, environment variables, system variables, or a combination of these. The configuration file will set system properties. In the absence of a config file, system properties will be used, and in the absence of those, environment variables will be used.

By default, the application will look for a configuration file named nihms-loader.properties in the folder containing the java application. You can override the location of the properties file by defining an environment variable for nihmsetl.harvester.configfile e.g.

> java -Dnihmsetl.loader.configfile=/path/to/configfile.properties -jar nihms-data-transform-load-cli-1.0.0-SNAPSHOT-shaded.jar

The configuration file should look like this:

nihmsetl.data.dir=/path/to/pass/loaders/data
nihmsetl.loader.cachepath=/path/to/pass/loaders/cache/compliant-cache.data
nihmsetl.repository.uri=https://example:8080/fcrepo/rest/repositories/aaa/bbb/ccc
nihmsetl.pmcurl.template=https://www.ncbi.nlm.nih.gov/pmc/articles/%s/
pass.fedora.baseurl=http://localhost:8080/fcrepo/rest/
pass.fedora.user=admin
pass.fedora.password=password
pass.elasticsearch.url=http://localhost:9200/ (default value)
pass.elasticsearcg.indices=pass (default value)
pass.elasticsearch.limit=200

nihmsetl.data.dir is the path that the CSV files will be read from. If a path is not defined, the app will look for a /data folder in the folder containing the java app.
nihmsetl.loader.cachepath designates a path to a file that will be used to store a cache of completed compliant data so that it is not reprocessed. Note that this file can be deleted to force a complete recheck of the data. If a path is not defined, this will default to a file at /cache/compliant-cache.data in the folder containing the java app.
nihmsetl.repository.uri the URI for the Repository resource in PASS that represents the PMC repository.
nihmsetl.pmcurl.template is the template URL used to construct the RepositoryCopy.accessUrl. The article PMC is passed into this URL.
pass.fedora.baseurl - Base URL for Fedora
pass.fedora.user - User name for Fedora access
pass.fedora.password - Password for Fedora access
pass.elasticsearch.url - Base URL for elasticsearch host
pass.elasticsearch.indices - Index target for elasticsearch
pass.elasticsearch.limit - Maximum number of results to return in an Elasticsearch query. This is optional, it defaults to 200 and typically will not need to be overridden.

Running the Data Transform-Load

Once the Data Transform-Load CLI has been configured, there are a few additional options you can add when running from the command line.

By default all 3 publication statuses - compliant, non-compliant, and in-process CSVs will be downloaded. To download one or two of them, you can add them individually at the command line:

-c, -compliant, --compliant - Download compliant publication CSV.
-p, -inprocess, --inprocess - Download in-process publication CSV.
-n, -noncompliant, --noncompliant - Download non-compliant publication CSV.

So, for example, to process non-compliant spreadsheets only:

> java -jar nihms-data-transform-load-cli-1.0.0-SNAPSHOT-shaded.jar -n

When run, each row will be loaded into the application and new Publications, Submissions, and RepositoryCopies will be created in PASS as needed. The application will also update any Deposit.repositoryCopy links where a new one is discovered. Once a CSV file has been processed, it will be renamed with a suffix of ".done" e.g. noncompliant_nihmspubs_20180507104323.csv.done. To re-process the file, simply rename it to remove the .done suffix and re-run the application.

Versions

Version
1.4.1 Apr 15, 2021
1.4.0 Aug 5, 2019
1.3.1 May 1, 2019

NIHMS ETL Model

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Download nihms-etl-model

How to add to project

Dependencies

test (1)

Project Modules

PASS NIHMS Submission ETL

NIHMS Data Harvest CLI

Pre-requisites

Data Harvest Configuration

Running the Data Harvester

NIHMS Data Transform-Load CLI

Pre-requisites

Data Transform-Load Configuration

Running the Data Transform-Load

Versions