xskipper-regex-plugin

xskipper-regex-plugin: A sample plugin for Xskipper

License

License

GroupId

GroupId

io.xskipper
ArtifactId

ArtifactId

xskipper-regex-plugin_2.12
Last Version

Last Version

0.1.0
Release Date

Release Date

Type

Type

jar
Description

Description

xskipper-regex-plugin
xskipper-regex-plugin: A sample plugin for Xskipper
Project URL

Project URL

https://github.com/xskipper-io/xskipper-regex-plugin
Project Organization

Project Organization

xskipper
Source Code Management

Source Code Management

https://github.com/xskipper-io/xskipper-regex-plugin

Download xskipper-regex-plugin_2.12

How to add to project

<!-- https://jarcasting.com/artifacts/io.xskipper/xskipper-regex-plugin_2.12/ -->
<dependency>
    <groupId>io.xskipper</groupId>
    <artifactId>xskipper-regex-plugin_2.12</artifactId>
    <version>0.1.0</version>
</dependency>
// https://jarcasting.com/artifacts/io.xskipper/xskipper-regex-plugin_2.12/
implementation 'io.xskipper:xskipper-regex-plugin_2.12:0.1.0'
// https://jarcasting.com/artifacts/io.xskipper/xskipper-regex-plugin_2.12/
implementation ("io.xskipper:xskipper-regex-plugin_2.12:0.1.0")
'io.xskipper:xskipper-regex-plugin_2.12:jar:0.1.0'
<dependency org="io.xskipper" name="xskipper-regex-plugin_2.12" rev="0.1.0">
  <artifact name="xskipper-regex-plugin_2.12" type="jar" />
</dependency>
@Grapes(
@Grab(group='io.xskipper', module='xskipper-regex-plugin_2.12', version='0.1.0')
)
libraryDependencies += "io.xskipper" % "xskipper-regex-plugin_2.12" % "0.1.0"
[io.xskipper/xskipper-regex-plugin_2.12 "0.1.0"]

Dependencies

compile (1)

Group / Artifact Type Version
org.scala-lang : scala-library jar 2.12.8

test (6)

Group / Artifact Type Version
org.apache.spark : spark-core_2.12 jar 3.0.1
io.xskipper : xskipper-core_2.12 jar 1.2.0
org.apache.spark : spark-hive_2.12 jar 3.0.1
org.apache.spark : spark-sql_2.12 jar 3.0.1
org.apache.spark : spark-catalyst_2.12 jar 3.0.1
org.scalatest : scalatest_2.12 jar 3.0.5

Project Modules

There are no modules declared in this project.

Xskipper Regex Plugin

Build Status

A sample plugin for Xskipper. See Xskipper documentation for more details on how to create a plugin.

The plugin enable to index a column by specifying a list of patterns and saving the matching groups as a value list.

For example, given the following dataset:

application_name,log_line
batch job,20/12/29 18:04:39 INFO FileSourceStrategy: Pruning directories with:
batch job,20/12/29 18:04:40 INFO DAGScheduler: ResultStage 22 (collect at ParquetMetadataHandle.scala:324) finished in 0.011 s

and the regex pattern ".* .* .* (.*): .*"

The metadata that will be saved is List("FileSourceStrategy", "DAGScheduler").

Then the following query will benefit from the regex index:

SELECT * 
FROM tbl 
WHERE 
regexp_extract(log_line, '.* .* .* (.*): .*', 1) = 'MemoryStore'

Run as a project

To build a project using the Xskipper binaries from the Maven Central Repository, use the following Maven coordinates:

Maven

Include Xskipper Regex plugin in a Maven project by adding it as a dependency in the project's POM file along. The plugin should be compiled with Scala 2.12.

<dependency>
  <groupId>io.xskipper</groupId>
  <artifactId>xskipper-regex-plugin_2.12</artifactId>
  <version>0.1.0</version>
</dependency>

SBT

Include the plugin in an SBT project by adding the following line to its build.sbt file:

libraryDependencies += "io.xskipper" %% "xskipper-regex-plugin" % "0.1.0"

Usage Example

The following shows a simple usage example

Python
from xskipper import Xskipper
from xskipper import Registration

metadata_location = "src/test/resources/metadata"  

conf = dict([
            ('io.xskipper.parquet.mdlocation', metadata_location),
            ("io.xskipper.parquet.mdlocation.type", "EXPLICIT_BASE_PATH_LOCATION")])
Xskipper.setConf(spark, conf)
# Register the needed classes
# Add MetadataFilterFactor
Registration.addMetadataFilterFactory(spark, 'io.xskipper.plugins.regex.filter.RegexValueListMetaDataFilterFactory')
# Add IndexFactory
Registration.addIndexFactory(spark, 'io.xskipper.plugins.regex.index.RegexIndexFactory')
# Add MetaDataTranslator
Registration.addMetaDataTranslator(spark, 'io.xskipper.plugins.regex.parquet.RegexValueListMetaDataTranslator')
# Add ClauseTranslator
Registration.addClauseTranslator(spark, 'io.xskipper.plugins.regex.parquet.RegexValueListClauseTranslator')

dataset_location = "src/test/resources/sample_dataset"
reader = spark.read.format("csv").option("inferSchema", "true").option("header", "true")

xskipper = Xskipper(spark, dataset_location)

# test adding all index types including using the custom index API
xskipper.indexBuilder() \
    .addCustomIndex("io.xskipper.plugins.regex.index.RegexValueListIndex", ["log_line"],
                    {"io.xskipper.plugins.regex.pattern.r0": ".* .* .* (.*): .*"}) \
    .build(reader) \
    .show(10, False)

Xskipper.enable(spark)

spark.sql("SELECT * FROM tbl WHERE regexp_extract(log_line,'.* .* .* (.*): .*', 1) = 'MemoryStore'").show()

xskipper.getLatestQueryAggregatedStats(spark).show(10, False)
Scala
import io.xskipper._
import io.xskipper.implicits._
import io.xskipper.plugins.regex.implicits._
import io.xskipper.plugins.regex.implicits._
import io.xskipper.plugins.regex.filter.RegexValueListMetaDataFilterFactory
import io.xskipper.plugins.regex.index.RegexIndexFactory
import io.xskipper.plugins.regex.parquet.{RegexValueListClauseTranslator, RegexValueListMetaDataTranslator}
 
// Register the plugin classes
Registration.addIndexFactory(RegexIndexFactory)
Registration.addMetadataFilterFactory(RegexValueListMetaDataFilterFactory)
Registration.addClauseTranslator(RegexValueListClauseTranslator)
Registration.addMetaDataTranslator(RegexValueListMetaDataTranslator)

val metadata_location = "src/test/resources/metadata" 

// Set JVM Wide parameters
val conf = Map(
  "io.xskipper.parquet.mdlocation" -> metadata_location,
  "io.xskipper.parquet.mdlocation.type" -> "EXPLICIT_BASE_PATH_LOCATION")
Xskipper.setConf(conf)
  
val dataset_location = "src/test/resources/sample_dataset"
val reader = spark.read.format("csv").option("inferSchema", "true").option("header", "true")

// index the dataset
val xskipper = new Xskipper(spark, dataset_location)

// remove existing index if needed
if (xskipper.isIndexed()) {
  xskipper.dropIndex()
}

xskipper
      .indexBuilder()
      .addRegexValueListIndex("log_line", Seq(".* .* .* (.*): .*"))
      .build(reader).show(false)

// enable xskipper
spark.enableXskipper()

spark.sql("SELECT * FROM tbl WHERE regexp_extract(log_line," +
      "'.* .* .* (.*): .*', 1) = 'MemoryStore'")
      .show(false)

// show data skipping stats
Xskipper.getLatestQueryAggregatedStats(spark).show(false)

Building

xskipper-regex-plugin is compiled using SBT.

To compile, run

build/sbt compile

To generate artifacts, run

build/sbt package

To execute tests, run

build/sbt test

Refer to SBT docs for more commands.

Collaboration

xskipper-regex-plugin tracks issues in GitHub and prefers to receive contributions as pull requests.

Compatibility

xskipper-regex-plugin is compatible with Xskipper 1.2.0 and requires Apache Spark 3.0.0

License

Apache License 2.0, see LICENSE.

io.xskipper

xskipper-io

An Extensible Data Skipping Framework

Versions

Version
0.1.0