spark-excel

License	License Apache License, Version 2.0
GroupId	GroupId com.crealytics
ArtifactId	ArtifactId spark-excel_2.11
Last Version	Last Version 0.13.7
Release Date	Release Date Feb 22, 2021
Type	Type jar
Description	Description spark-excel spark-excel
Project URL	Project URL https://github.com/crealytics/spark-excel
Project Organization	Project Organization com.crealytics
Source Code Management	Source Code Management https://github.com/crealytics/spark-excel

Download spark-excel_2.11

Filename	Size
spark-excel_2.11-0.13.7.pom
spark-excel_2.11-0.13.7.jar	6 MB
spark-excel_2.11-0.13.7-unshaded.jar	245 KB
spark-excel_2.11-0.13.7-sources.jar	13 KB
spark-excel_2.11-0.13.7-javadoc.jar	445 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.crealytics/spark-excel_2.11/ -->
<dependency>
    <groupId>com.crealytics</groupId>
    <artifactId>spark-excel_2.11</artifactId>
    <version>0.13.7</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.crealytics/spark-excel_2.11/
implementation 'com.crealytics:spark-excel_2.11:0.13.7'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.crealytics/spark-excel_2.11/
implementation ("com.crealytics:spark-excel_2.11:0.13.7")

Apache Buildr

'com.crealytics:spark-excel_2.11:jar:0.13.7'

Apache Ivy

<dependency org="com.crealytics" name="spark-excel_2.11" rev="0.13.7">
  <artifact name="spark-excel_2.11" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.crealytics', module='spark-excel_2.11', version='0.13.7')
)

Scala SBT

libraryDependencies += "com.crealytics" % "spark-excel_2.11" % "0.13.7"

Leiningen

[com.crealytics/spark-excel_2.11 "0.13.7"]

Dependencies

compile (7)

Group / Artifact	Type	Version
org.scala-lang : scala-library	jar	2.11.12
org.apache.poi : poi	jar	4.1.2
org.apache.poi : poi-ooxml	jar	4.1.2
com.norbitltd : spoiwo_2.11	jar	1.8.0
com.github.pjfanning : excel-streaming-reader	jar	2.3.6
com.github.pjfanning : poi-shared-strings	jar	1.0.4
org.apache.commons : commons-compress	jar	1.20

provided (4)

Group / Artifact	Type	Version
org.apache.spark : spark-core_2.11	jar	2.4.7
org.apache.spark : spark-sql_2.11	jar	2.4.7
org.apache.spark : spark-hive_2.11	jar	2.4.7
org.slf4j : slf4j-api	jar	1.7.30

test (7)

Group / Artifact	Type	Version
org.typelevel : cats-core_2.11	jar	2.0.0
org.scalatest : scalatest_2.11	jar	3.2.4
org.scalatestplus : scalatestplus-scalacheck_2.11	jar	3.1.0.0-RC2
org.scalacheck : scalacheck_2.11	jar	1.15.2
com.github.alexarchambault : scalacheck-shapeless_1.14_2.11	jar	1.2.5
com.github.nightscape » spark-testing-base_2.11	jar	c2bc44caf4
org.scalamock : scalamock-scalatest-support_2.11	jar	3.6.0

Project Modules

There are no modules declared in this project.

Spark Excel Library

A library for querying Excel files with Apache Spark, for Spark SQL and DataFrames.

Co-maintainers wanted

Due to personal and professional constraints, the development of this library has been rather slow. If you find value in this library, please consider stepping up as a co-maintainer by leaving a comment here. Help is very welcome e.g. in the following areas:

Additional features
Code improvements and reviews
Bug analysis and fixing
Documentation improvements
Build / test infrastructure

Requirements

This library requires Spark 2.0+

Linking

You can link against this library in your program at the following coordinates:

Scala 2.12

groupId: com.crealytics
artifactId: spark-excel_2.12
version: 0.13.1

Scala 2.11

groupId: com.crealytics
artifactId: spark-excel_2.11
version: 0.13.1

Using with Spark shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

Spark compiled with Scala 2.12

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:0.13.1

Spark compiled with Scala 2.11

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:0.13.1

Features

This package allows querying Excel spreadsheets as Spark DataFrames.

Scala API

Spark 2.0+:

Create a DataFrame from an Excel file

import org.apache.spark.sql._

val spark: SparkSession = ???
val df = spark.read
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'My Sheet'!B3:C35") // Optional, default: "A1"
    .option("header", "true") // Required
    .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
    .option("setErrorCellsToFallbackValues", "true") // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.
    .option("usePlainNumberFormat", "false") // Optional, default: false, If true, format the cells without rounding and scientific notations
    .option("inferSchema", "false") // Optional, default: false
    .option("addColorColumns", "true") // Optional, default: false
    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
    .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    .option("workbookPassword", "pass") // Optional, default None. Requires unlimited strength JCE for older JVMs
    .schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
    .load("Worktime.xlsx")

For convenience, there is an implicit that wraps the DataFrameReader returned by spark.read and provides a .excel method which accepts all possible options and provides default values:

import org.apache.spark.sql._
import com.crealytics.spark.excel._

val spark: SparkSession = ???
val df = spark.read.excel(
    header = true,  // Required
    dataAddress = "'My Sheet'!B3:C35", // Optional, default: "A1"
    treatEmptyValuesAsNulls = false,  // Optional, default: true
    setErrorCellsToFallbackValues = false, // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.
    usePlainNumberFormat = false,  // Optional, default: false. If true, format the cells without rounding and scientific notations
    inferSchema = false,  // Optional, default: false
    addColorColumns = true,  // Optional, default: false
    timestampFormat = "MM-dd-yyyy HH:mm:ss",  // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
    maxRowsInMemory = 20,  // Optional, default None. If set, uses a streaming reader which can help with big files
    excerptSize = 10,  // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
    workbookPassword = "pass"  // Optional, default None. Requires unlimited strength JCE for older JVMs
).schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
 .load("Worktime.xlsx")

If the sheet name is unavailable, it is possible to pass in an index:

val df = spark.read.excel(
  header = true,
  dataAddress = "0!B3:C35"
).load("Worktime.xlsx")

or to read in the names dynamically:

val sheetNames = WorkbookReader( Map("path" -> "Worktime.xlsx")
                               , spark.sparkContext.hadoopConfiguration
                               ).sheetNames
val df = spark.read.excel(
  header = true,
  dataAddress = sheetNames(0)
)

Create a DataFrame from an Excel file using custom schema

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val peopleSchema = StructType(Array(
    StructField("Name", StringType, nullable = false),
    StructField("Age", DoubleType, nullable = false),
    StructField("Occupation", StringType, nullable = false),
    StructField("Date of birth", StringType, nullable = false)))

val spark: SparkSession = ???
val df = spark.read
    .format("com.crealytics.spark.excel")
    .option("sheetName", "Info")
    .option("header", "true")
    .schema(peopleSchema)
    .load("People.xlsx")

Write a DataFrame to an Excel file

import org.apache.spark.sql._

val df: DataFrame = ???
df.write
  .format("com.crealytics.spark.excel")
  .option("dataAddress", "'My Sheet'!B3:C35")
  .option("header", "true")
  .option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
  .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
  .mode("append") // Optional, default: overwrite.
  .save("Worktime2.xlsx")

Data Addresses

As you can see in the examples above, the location of data to read or write can be specified with the dataAddress option. Currently the following address styles are supported:

B3: Start cell of the data. Reading will return all rows below and all columns to the right. Writing will start here and use as many columns and rows as required.
B3:F35: Cell range of data. Reading will return only rows and columns in the specified range. Writing will start in the first cell (B3 in this example) and use only the specified columns and rows. If there are more rows or columns in the DataFrame to write, they will be truncated. Make sure this is what you want.
'My Sheet'!B3:F35: Same as above, but with a specific sheet.
MyTable[#All]: Table of data. Reading will return all rows and columns in this table. Writing will only write within the current range of the table. No growing of the table will be performed. PRs to change this are welcome.

Building From Source

This library is built with SBT. To build a JAR file simply run sbt assembly from the project root. The build configuration includes support for Scala 2.12 and 2.11.

crealytics GmbH

Versions

Version
0.13.7 Feb 22, 2021
0.13.6 Dec 10, 2020
0.13.5 Aug 8, 2020
0.13.4 Aug 5, 2020
0.13.1 Mar 1, 2020
0.13.0 Feb 24, 2020
0.12.5 Jan 31, 2020
0.12.4 Dec 30, 2019
0.12.3 Oct 12, 2019
0.12.2 Oct 5, 2019
0.12.1 Oct 3, 2019
0.12.0 Jul 2, 2019
0.11.1 Jan 31, 2019
0.11.0 Nov 29, 2018
0.11.0-beta3 Nov 20, 2018
0.11.0-beta2 Nov 10, 2018
0.11.0-beta1 Oct 31, 2018
0.10.2 Nov 24, 2018
0.10.1 Nov 10, 2018
0.10.0 Oct 11, 2018
0.9.18 Sep 27, 2018
0.9.17 Jul 10, 2018
0.9.16 Jul 9, 2018
0.9.15 Apr 9, 2018
0.9.14 Feb 12, 2018
0.9.12 Jan 31, 2018
0.9.11 Jan 17, 2018
0.9.10 Jan 17, 2018
0.9.9 Dec 15, 2017
0.9.8 Nov 17, 2017
0.9.7 Nov 16, 2017
0.9.6 Nov 10, 2017
0.9.5 Sep 26, 2017
0.9.4 Sep 21, 2017
0.9.3 Sep 11, 2017
0.9.2 Sep 8, 2017
0.9.1 Sep 5, 2017
0.9.0 Aug 16, 2017
0.8.6 Aug 16, 2017
0.8.5 Aug 16, 2017
0.8.4 Jul 24, 2017
0.8.3 Mar 23, 2017
0.8.2 Oct 25, 2016