bioresources
Data resources from the biomedical domain
Information for developers
Extending grounding resource files
The src/main/resources/org/clulab/reach/kb
folder contains a number of tab separated value (TSV) files which contain grounding entries. Several of these files have corresponding automated update scripts in the scripts
folder. If an update script exists, the corresponding TSV file should not be manually edited, rather, changes should be integrated by changing and running the script.
Most TSV files contain primary grounding entries from a given source. Additionally, the NER-Grounding-Override.tsv
file contains manually curated groundings that are used to apply overrides.
Note that the files are version-controlled in a gzipped form and therefore need to be decompressed for editing and then compressed again for checking in to version control.
Note also that if a new TSV file or a new entity type needs to be added. it requires corresponding changes in several places in the Reach code base. For an example of changes that needed to be made in Reach when adding mesh_disease.tsv
to bioresources, see https://github.com/clulab/reach/pull/686/files.
Updating the NER files
Once edits have been made to one or more files in the kb
folder, the NER files need to be regenerated. For this, reach
needs to be cloned in the same parent folder in which the bioresources
repo was cloned. Then, the ner_kb.sh
script needs to be run which converts the KBs in org.clulab.reach.kb
into the format expected by the BioNLPProcessor NER. Please re-run this script everytime a grounding file changes, or when the tokenization algorithm changes in BioNLPProcessor.
The ner_kb.sh
script uses ner_kb.config
as a configuration input. If only a small number of KBs were modified, edit the file and keep only the modified KBs to avoid re-generating all KBs. The config file also controls what organisms' gene/protein synonyms should be included in the NER resources. By default, only human proteins are included but additional organism names can be listed in ner_kb.config
to extend NER to these organisms.
Testing bioresources updated with Reach
To test changes in bioresources, first, bioresources need to be built using sbt
as follows:
sbt publishLocal
then check version.sbt
to get the current published version of bioresources, typically something like x.x.x-SNAPSHOT
. Then navigate to the reach repo, edit processors/build.sbt
and change the bioresources version to the one published. This will result in Reach using the locally published bioresources.
It is also possible to automatically build a custom branch of bioresources and Reach, and then run Reach tests using Docker. This is documented here: https://github.com/clulab/reach/tree/master/docker
.