Terrier Spark (Scala)

A Spark wrapper for Terrier

License

License

GroupId

GroupId

org.terrier
ArtifactId

ArtifactId

terrier-spark
Last Version

Last Version

0.0.1
Release Date

Release Date

Type

Type

jar
Description

Description

Terrier Spark (Scala)
A Spark wrapper for Terrier
Project URL

Project URL

https://github.com/terrier-org/terrier-spark/
Project Organization

Project Organization

University of Glasgow
Source Code Management

Source Code Management

https://github.com/terrier-org/terrier-spark

Download terrier-spark

How to add to project

<!-- https://jarcasting.com/artifacts/org.terrier/terrier-spark/ -->
<dependency>
    <groupId>org.terrier</groupId>
    <artifactId>terrier-spark</artifactId>
    <version>0.0.1</version>
</dependency>
// https://jarcasting.com/artifacts/org.terrier/terrier-spark/
implementation 'org.terrier:terrier-spark:0.0.1'
// https://jarcasting.com/artifacts/org.terrier/terrier-spark/
implementation ("org.terrier:terrier-spark:0.0.1")
'org.terrier:terrier-spark:jar:0.0.1'
<dependency org="org.terrier" name="terrier-spark" rev="0.0.1">
  <artifact name="terrier-spark" type="jar" />
</dependency>
@Grapes(
@Grab(group='org.terrier', module='terrier-spark', version='0.0.1')
)
libraryDependencies += "org.terrier" % "terrier-spark" % "0.0.1"
[org.terrier/terrier-spark "0.0.1"]

Dependencies

compile (8)

Group / Artifact Type Version
org.apache.commons : commons-collections4 jar 4.1
org.apache.spark : spark-mllib_2.11 jar 2.1.0
org.terrier : terrier-rest-client jar 5.0
org.terrier : terrier-core jar 5.0
org.terrier : terrier-learning jar 5.0
org.apache.commons : commons-math3 jar 3.4.1
com.github.bruneli.scalaopt : scalaopt-core_2.11 jar 0.2
org.terrier : terrier-concurrent jar 5.0

test (1)

Group / Artifact Type Version
org.scalatest : scalatest_2.11 jar 2.2.1

Project Modules

There are no modules declared in this project.

Terrier-Spark

Terrier-Spark is a Scala library for Apache Spark that allows the Terrier.org information retrieval platform to be installed and working.

To use within a notebook, this requires Apache Toree to be installed and working.

Requirements:

  • Terrier 5.0
  • Apache Spark version 2.0 or newer
  • Jupyter & Apache Tree (optional)

Functionality

  • Retrieving a run from a Terrier index (local or remote)
  • Evaluating a run
  • Optimising the parameter of a retrieval run on a local index
  • Grid searching the parameter of a retrieval run on a local index
  • Learning a model using learning-to-rank

For known improvements/issues, see TODO.md

Example

val indexref = IndexRef.of("/path/to/index/data.properties")

val props = Map(
"terrier.home" -> terrierHome)

TopicSource.configureTerrier(props)
val topics = TopicSource.extractTRECTopics(topicsFile)
    .toList.toDF("qid", "query")

val queryTransform = new QueryingTransformer()
    .setTerrierProperties(props)
    .setIndexReference(indexref)
    .setSampleModel(model)

val r1 = queryTransform.transform(topics)
//r1 is a dataframe with results for queries in topics
val qrelTransform = new QrelTransformer()
    .setQrelsFile(qrelsFile)

val r2 = qrelTransform.transform(r1)
//r2 is a dataframe as r1, but also includes a label column
val ndcg = new RankingEvaluator(Measure.NDCG, 20).evaluateByQuery(r2).toList

More examples are provided in the example notebooks, or in our SIGIR 2018 demo paper [1].

Use from the Spark Shell

$ spark-shell --packages org.terrier:terrier-spark:0.0.1-SNAPSHOT

Use within a Jupyter Notebook

Firstly, make sure you have a working installation of Toree. Next, import Terrier and terrier-spark using some %AddDeps "magic":

%AddDeps org.terrier terrier-core 5.0 --transitive --exclude org.slf4j:slf4j-log4j12  
%AddDeps org.terrier terrier-spark 0.0.1-SNAPSHOT --repository file:/home/user/.m2/repository --transitive

You can then use the terrier-spark code directly in your Scala notebooks.

We have provided several example notebooks:

Bibliography

If you use this software, please cite one of:

  1. Combining Terrier with Apache Spark to create agile experimental information retrieval pipelines. Craig Macdonald. In Proceedings of SIGIR 2018.

  2. Agile Information Retrieval Experimentation with Terrier Notebooks. Craig Macdonald, Richard McCreadie, Iadh Ounis. In Proceedings of DESIRES 2018.

Credits

Developed by Craig Macdonald, University of Glasgow

org.terrier

Terrier.org

Versions

Version
0.0.1