spark-hugefs

Walk big and deep filesystem fast and efficiently from Spark

License

License

GroupId

GroupId

com.github.salva
ArtifactId

ArtifactId

spark-hugefs_2.11
Last Version

Last Version

0.13
Release Date

Release Date

Type

Type

jar
Description

Description

spark-hugefs
Walk big and deep filesystem fast and efficiently from Spark
Project URL

Project URL

https://github.com/salva/spark-hugefs
Project Organization

Project Organization

com.github.salva
Source Code Management

Source Code Management

https://github.com/salva/spark-hugefs

Download spark-hugefs_2.11

How to add to project

<!-- https://jarcasting.com/artifacts/com.github.salva/spark-hugefs_2.11/ -->
<dependency>
    <groupId>com.github.salva</groupId>
    <artifactId>spark-hugefs_2.11</artifactId>
    <version>0.13</version>
</dependency>
// https://jarcasting.com/artifacts/com.github.salva/spark-hugefs_2.11/
implementation 'com.github.salva:spark-hugefs_2.11:0.13'
// https://jarcasting.com/artifacts/com.github.salva/spark-hugefs_2.11/
implementation ("com.github.salva:spark-hugefs_2.11:0.13")
'com.github.salva:spark-hugefs_2.11:jar:0.13'
<dependency org="com.github.salva" name="spark-hugefs_2.11" rev="0.13">
  <artifact name="spark-hugefs_2.11" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.github.salva', module='spark-hugefs_2.11', version='0.13')
)
libraryDependencies += "com.github.salva" % "spark-hugefs_2.11" % "0.13"
[com.github.salva/spark-hugefs_2.11 "0.13"]

Dependencies

compile (3)

Group / Artifact Type Version
org.scala-lang : scala-library jar 2.11.12
com.databricks : dbutils-api_2.11 jar 0.0.4
com.github.salva : scala-glob_2.11 jar 0.0.3

provided (1)

Group / Artifact Type Version
org.apache.spark : spark-sql_2.11 jar 2.4.5

Project Modules

There are no modules declared in this project.

spark-hugefs

Spark is commonly used with remote file systems that are relatively slow, specially for operations related to walking the file structure.

This package provides methods that allow one to retrieve the file structure fast and efficiently, taking advantage of Spark parallelization infrastructure.

Example

import com.github.salva.spark.hugefs._

val df = Walker.walk("/dbfs/mnt/fs1", Good.isFile.glob("**/images/*.jpg"), spark)
println("%s JPEG images found in directory", df.count)

Documentation

This package uses three main abstractions:

  • Walkers: The classes that can walk the file system and retrieve its structure using different strategies.

Currently, two walker classes are available:

  • Walker: walks the file system using the Spark engine, distributing the process over all the available nodes.

  • LocalWalker: walks the file system using only the local thread.

  • Restrictions: A family of classes that allow one to prune the search space limiting the file system walk to those directories or files that comply with the declared restrictions.

For instance, it is possible to pick files by its type (regular file, directory, symbolic link), extension, filename, globing patterns, etc.

  • File systems: The classes that implement the low level access to the filesystem. Currently, backends are available for the Databricks file system (DBFS), and the Java native one.

Probably only users interested in adding support for additional file systems should be concerned with these classes. Needless to say that contributions in this area are very welcome!

Latest version

The latest version of spark-hugefs is 0.14, soon to be available from Maven Central.

libraryDependencies += "com.github.salva.spark-hugefs" %% "spark-hugefs" % "0.14"

Copying

Copyright 2020 Salvador Fandiño ([email protected])

Licensed under the Apache License, Version 2.0 (the "License"); you may not use the files in this package except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Versions

Version
0.13
0.10
0.5
0.4
0.3