spark-stringmetric


License

License

MIT
GroupId

GroupId

com.github.mrpowers
ArtifactId

ArtifactId

spark-stringmetric_2.11
Last Version

Last Version

0.3.0
Release Date

Release Date

Type

Type

jar
Description

Description

spark-stringmetric
spark-stringmetric
Project URL

Project URL

https://github.com/MrPowers/spark-stringmetric
Project Organization

Project Organization

com.github.mrpowers
Source Code Management

Source Code Management

https://github.com/MrPowers/spark-stringmetric

Download spark-stringmetric_2.11

How to add to project

<!-- https://jarcasting.com/artifacts/com.github.mrpowers/spark-stringmetric_2.11/ -->
<dependency>
    <groupId>com.github.mrpowers</groupId>
    <artifactId>spark-stringmetric_2.11</artifactId>
    <version>0.3.0</version>
</dependency>
// https://jarcasting.com/artifacts/com.github.mrpowers/spark-stringmetric_2.11/
implementation 'com.github.mrpowers:spark-stringmetric_2.11:0.3.0'
// https://jarcasting.com/artifacts/com.github.mrpowers/spark-stringmetric_2.11/
implementation ("com.github.mrpowers:spark-stringmetric_2.11:0.3.0")
'com.github.mrpowers:spark-stringmetric_2.11:jar:0.3.0'
<dependency org="com.github.mrpowers" name="spark-stringmetric_2.11" rev="0.3.0">
  <artifact name="spark-stringmetric_2.11" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.github.mrpowers', module='spark-stringmetric_2.11', version='0.3.0')
)
libraryDependencies += "com.github.mrpowers" % "spark-stringmetric_2.11" % "0.3.0"
[com.github.mrpowers/spark-stringmetric_2.11 "0.3.0"]

Dependencies

compile (1)

Group / Artifact Type Version
org.scala-lang : scala-library jar 2.11.12

provided (2)

Group / Artifact Type Version
org.apache.commons : commons-text jar 1.1
org.apache.spark : spark-sql_2.11 jar 2.4.4

test (3)

Group / Artifact Type Version
com.github.mrpowers : spark-fast-tests_2.11 jar 0.21.3
com.github.mrpowers : spark-daria_2.11 jar 0.38.2
org.scalatest : scalatest_2.11 jar 3.0.1

Project Modules

There are no modules declared in this project.

spark-stringmetric

String similarity functions and phonetic algorithms for Spark.

See ceja if you're using the PySpark API.

Project Setup

Update your build.sbt file to import the libraries.

libraryDependencies += "org.apache.commons" % "commons-text" % "1.1"
libraryDependencies += "com.github.mrpowers" %% "spark-stringmetric" % "0.3.0"

You can find the spark-daria Scala 2.11 versions here and the Scala 2.12 versions here.

You should generally use Scala 2.11 with Spark 2 and Scala 2.12 with Spark 3.

SimilarityFunctions

  • cosine_distance
  • fuzzy_score
  • hamming
  • jaccard_similarity
  • jaro_winkler

How to import the functions.

import com.github.mrpowers.spark.stringmetric.SimilarityFunctions._

Here's an example on how to use the jaccard_similarity function.

Suppose we have the following sourceDF:

+-------+-------+
|  word1|  word2|
+-------+-------+
|  night|  nacht|
|context|contact|
|   null|  nacht|
|   null|   null|
+-------+-------+

Let's run the jaccard_similarity function.

val actualDF = sourceDF.withColumn(
  "w1_w2_jaccard",
  jaccard_similarity(col("word1"), col("word2"))
)

We can run actualDF.show() to view the w1_w2_jaccard column that's been appended to the DataFrame.

+-------+-------+-------------+
|  word1|  word2|w1_w2_jaccard|
+-------+-------+-------------+
|  night|  nacht|         0.43|
|context|contact|         0.57|
|   null|  nacht|         null|
|   null|   null|         null|
+-------+-------+-------------+

PhoneticAlgorithms

  • double_metaphone
  • nysiis
  • refined_soundex

How to import the functions.

import com.github.mrpowers.spark.stringmetric.PhoneticAlgorithms._

Here's an example on how to use the refined_soundex function.

Suppose we have the following sourceDF:

+-----+
|word1|
+-----+
|night|
|  cat|
| null|
+-----+

Let's run the refined_soundex function.

val actualDF = sourceDF.withColumn(
  "word1_refined_soundex",
  refined_soundex(col("word1"))
)

We can run actualDF.show() to view the word1_refined_soundex column that's been appended to the DataFrame.

+-----+---------------------+
|word1|word1_refined_soundex|
+-----+---------------------+
|night|               N80406|
|  cat|                 C306|
| null|                 null|
+-----+---------------------+

API Documentation

Here is the latest API documentation.

Release

  1. Create GitHub tag

  2. Build documentation with sbt ghpagesPushSite

  3. Publish JAR

Run sbt to open the SBT console.

Run > ; + publishSigned; sonatypeBundleRelease to create the JAR files and release them to Maven. These commands are made available by the sbt-sonatype plugin.

When the release command is run, you'll be prompted to enter your GPG passphrase.

The Sonatype credentials should be stored in the ~/.sbt/sonatype_credentials file in this format:

realm=Sonatype Nexus Repository Manager
host=oss.sonatype.org
user=$USERNAME
password=$PASSWORD

Microsite

The microsite for this project is available here. It's built with the sbt-microsite plugin.

The sbt publishMicrosite command creates a new microsite based on the Markdown files stored in the docs/src/main/mdoc folder.

Versions

Version
0.3.0