xskipper-regex-plugin

xskipper-regex-plugin: A sample plugin for Xskipper

License	License Apache-2.0
GroupId	GroupId io.xskipper
ArtifactId	ArtifactId xskipper-regex-plugin_2.12
Last Version	Last Version 0.1.0
Release Date	Release Date Feb 9, 2021
Type	Type jar
Description	Description xskipper-regex-plugin xskipper-regex-plugin: A sample plugin for Xskipper
Project URL	Project URL https://github.com/xskipper-io/xskipper-regex-plugin
Project Organization	Project Organization xskipper
Source Code Management	Source Code Management https://github.com/xskipper-io/xskipper-regex-plugin

Download xskipper-regex-plugin_2.12

Filename	Size
xskipper-regex-plugin_2.12-0.1.0.pom
xskipper-regex-plugin_2.12-0.1.0.jar	54 KB
xskipper-regex-plugin_2.12-0.1.0-sources.jar	10 KB
xskipper-regex-plugin_2.12-0.1.0-javadoc.jar	1 MB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/io.xskipper/xskipper-regex-plugin_2.12/ -->
<dependency>
    <groupId>io.xskipper</groupId>
    <artifactId>xskipper-regex-plugin_2.12</artifactId>
    <version>0.1.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/io.xskipper/xskipper-regex-plugin_2.12/
implementation 'io.xskipper:xskipper-regex-plugin_2.12:0.1.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/io.xskipper/xskipper-regex-plugin_2.12/
implementation ("io.xskipper:xskipper-regex-plugin_2.12:0.1.0")

Apache Buildr

'io.xskipper:xskipper-regex-plugin_2.12:jar:0.1.0'

Apache Ivy

<dependency org="io.xskipper" name="xskipper-regex-plugin_2.12" rev="0.1.0">
  <artifact name="xskipper-regex-plugin_2.12" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='io.xskipper', module='xskipper-regex-plugin_2.12', version='0.1.0')
)

Scala SBT

libraryDependencies += "io.xskipper" % "xskipper-regex-plugin_2.12" % "0.1.0"

Leiningen

[io.xskipper/xskipper-regex-plugin_2.12 "0.1.0"]

Dependencies

compile (1)

Group / Artifact	Type	Version
org.scala-lang : scala-library	jar	2.12.8

test (6)

Group / Artifact	Type	Version
org.apache.spark : spark-core_2.12	jar	3.0.1
io.xskipper : xskipper-core_2.12	jar	1.2.0
org.apache.spark : spark-hive_2.12	jar	3.0.1
org.apache.spark : spark-sql_2.12	jar	3.0.1
org.apache.spark : spark-catalyst_2.12	jar	3.0.1
org.scalatest : scalatest_2.12	jar	3.0.5

Project Modules

There are no modules declared in this project.

Xskipper Regex Plugin

A sample plugin for Xskipper. See Xskipper documentation for more details on how to create a plugin.

The plugin enable to index a column by specifying a list of patterns and saving the matching groups as a value list.

For example, given the following dataset:

application_name,log_line
batch job,20/12/29 18:04:39 INFO FileSourceStrategy: Pruning directories with:
batch job,20/12/29 18:04:40 INFO DAGScheduler: ResultStage 22 (collect at ParquetMetadataHandle.scala:324) finished in 0.011 s

and the regex pattern ".* .* .* (.*): .*"

The metadata that will be saved is List("FileSourceStrategy", "DAGScheduler").

Then the following query will benefit from the regex index:

SELECT * 
FROM tbl 
WHERE 
regexp_extract(log_line, '.* .* .* (.*): .*', 1) = 'MemoryStore'

Run as a project

To build a project using the Xskipper binaries from the Maven Central Repository, use the following Maven coordinates:

Maven

Include Xskipper Regex plugin in a Maven project by adding it as a dependency in the project's POM file along. The plugin should be compiled with Scala 2.12.

<dependency>
  <groupId>io.xskipper</groupId>
  <artifactId>xskipper-regex-plugin_2.12</artifactId>
  <version>0.1.0</version>
</dependency>

SBT

Include the plugin in an SBT project by adding the following line to its build.sbt file:

libraryDependencies += "io.xskipper" %% "xskipper-regex-plugin" % "0.1.0"

Usage Example

The following shows a simple usage example

Python

from xskipper import Xskipper
from xskipper import Registration

metadata_location = "src/test/resources/metadata"  

conf = dict([
            ('io.xskipper.parquet.mdlocation', metadata_location),
            ("io.xskipper.parquet.mdlocation.type", "EXPLICIT_BASE_PATH_LOCATION")])
Xskipper.setConf(spark, conf)
# Register the needed classes
# Add MetadataFilterFactor
Registration.addMetadataFilterFactory(spark, 'io.xskipper.plugins.regex.filter.RegexValueListMetaDataFilterFactory')
# Add IndexFactory
Registration.addIndexFactory(spark, 'io.xskipper.plugins.regex.index.RegexIndexFactory')
# Add MetaDataTranslator
Registration.addMetaDataTranslator(spark, 'io.xskipper.plugins.regex.parquet.RegexValueListMetaDataTranslator')
# Add ClauseTranslator
Registration.addClauseTranslator(spark, 'io.xskipper.plugins.regex.parquet.RegexValueListClauseTranslator')

dataset_location = "src/test/resources/sample_dataset"
reader = spark.read.format("csv").option("inferSchema", "true").option("header", "true")

xskipper = Xskipper(spark, dataset_location)

# test adding all index types including using the custom index API
xskipper.indexBuilder() \
    .addCustomIndex("io.xskipper.plugins.regex.index.RegexValueListIndex", ["log_line"],
                    {"io.xskipper.plugins.regex.pattern.r0": ".* .* .* (.*): .*"}) \
    .build(reader) \
    .show(10, False)

Xskipper.enable(spark)

spark.sql("SELECT * FROM tbl WHERE regexp_extract(log_line,'.* .* .* (.*): .*', 1) = 'MemoryStore'").show()

xskipper.getLatestQueryAggregatedStats(spark).show(10, False)

Scala

import io.xskipper._
import io.xskipper.implicits._
import io.xskipper.plugins.regex.implicits._
import io.xskipper.plugins.regex.implicits._
import io.xskipper.plugins.regex.filter.RegexValueListMetaDataFilterFactory
import io.xskipper.plugins.regex.index.RegexIndexFactory
import io.xskipper.plugins.regex.parquet.{RegexValueListClauseTranslator, RegexValueListMetaDataTranslator}
 
// Register the plugin classes
Registration.addIndexFactory(RegexIndexFactory)
Registration.addMetadataFilterFactory(RegexValueListMetaDataFilterFactory)
Registration.addClauseTranslator(RegexValueListClauseTranslator)
Registration.addMetaDataTranslator(RegexValueListMetaDataTranslator)

val metadata_location = "src/test/resources/metadata" 

// Set JVM Wide parameters
val conf = Map(
  "io.xskipper.parquet.mdlocation" -> metadata_location,
  "io.xskipper.parquet.mdlocation.type" -> "EXPLICIT_BASE_PATH_LOCATION")
Xskipper.setConf(conf)
  
val dataset_location = "src/test/resources/sample_dataset"
val reader = spark.read.format("csv").option("inferSchema", "true").option("header", "true")

// index the dataset
val xskipper = new Xskipper(spark, dataset_location)

// remove existing index if needed
if (xskipper.isIndexed()) {
  xskipper.dropIndex()
}

xskipper
      .indexBuilder()
      .addRegexValueListIndex("log_line", Seq(".* .* .* (.*): .*"))
      .build(reader).show(false)

// enable xskipper
spark.enableXskipper()

spark.sql("SELECT * FROM tbl WHERE regexp_extract(log_line," +
      "'.* .* .* (.*): .*', 1) = 'MemoryStore'")
      .show(false)

// show data skipping stats
Xskipper.getLatestQueryAggregatedStats(spark).show(false)

Building

xskipper-regex-plugin is compiled using SBT.

To compile, run

build/sbt compile

To generate artifacts, run

build/sbt package

To execute tests, run

build/sbt test

Refer to SBT docs for more commands.

Collaboration

xskipper-regex-plugin tracks issues in GitHub and prefers to receive contributions as pull requests.

Compatibility

xskipper-regex-plugin is compatible with Xskipper 1.2.0 and requires Apache Spark 3.0.0

License

Apache License 2.0, see LICENSE.

xskipper-io

An Extensible Data Skipping Framework

Versions

Version
0.1.0 Feb 9, 2021

xskipper-regex-plugin

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Project Organization

Source Code Management

Download xskipper-regex-plugin_2.12

How to add to project

Dependencies

compile (1)

test (6)

Project Modules

Xskipper Regex Plugin

Run as a project

Maven

SBT

Usage Example

Building

Collaboration

Compatibility

License

xskipper-io

Versions