CSVInputFormat

Hadoop2 InputFormat for reading multiline CSV files

License

License

Categories

Categories

Ant Build Tools CSV Data Data Formats
GroupId

GroupId

in.ashwanthkumar
ArtifactId

ArtifactId

hadoop2-csv
Last Version

Last Version

2.0
Release Date

Release Date

Type

Type

jar
Description

Description

CSVInputFormat
Hadoop2 InputFormat for reading multiline CSV files
Project URL

Project URL

https://github.com/ashwanthkumar/hadoop2-csv
Source Code Management

Source Code Management

https://github.com/ashwanthkumar/hadoop2-csv

Download hadoop2-csv

How to add to project

<!-- https://jarcasting.com/artifacts/in.ashwanthkumar/hadoop2-csv/ -->
<dependency>
    <groupId>in.ashwanthkumar</groupId>
    <artifactId>hadoop2-csv</artifactId>
    <version>2.0</version>
</dependency>
// https://jarcasting.com/artifacts/in.ashwanthkumar/hadoop2-csv/
implementation 'in.ashwanthkumar:hadoop2-csv:2.0'
// https://jarcasting.com/artifacts/in.ashwanthkumar/hadoop2-csv/
implementation ("in.ashwanthkumar:hadoop2-csv:2.0")
'in.ashwanthkumar:hadoop2-csv:jar:2.0'
<dependency org="in.ashwanthkumar" name="hadoop2-csv" rev="2.0">
  <artifact name="hadoop2-csv" type="jar" />
</dependency>
@Grapes(
@Grab(group='in.ashwanthkumar', module='hadoop2-csv', version='2.0')
)
libraryDependencies += "in.ashwanthkumar" % "hadoop2-csv" % "2.0"
[in.ashwanthkumar/hadoop2-csv "2.0"]

Dependencies

compile (2)

Group / Artifact Type Version
log4j : log4j jar 1.2.14
org.apache.hadoop : hadoop-client jar 2.2.0

test (1)

Group / Artifact Type Version
junit : junit jar 4.10

Project Modules

There are no modules declared in this project.

Build Status

hadoop2-csv

Input format for hadoop able to read multiline CSVs

Run BasicTest.java to see it working. Check src/test/resource/test.csv to see a multiline demofile.

The key returned is the file position where the line starts and the value is a List with the column values

Zip files are supported.

More ideas to improve this are welcome.

Example:

If we read this CSV (note that line 2 is multiline):

Joe Demo,"2 Demo Street,
Demoville,
Australia. 2615",[email protected]
Jim Sample,"3 Sample Street, Sampleville, Australia. 2615",[email protected]
Jack Example,"1 Example Street, Exampleville, Australia.
2615",[email protected]

The output is as follows:

==> TestMapper
==> key=0
==> val[0] = Joe Demo
==> val[1] = 2 Demo Street, 
Demoville, 
Australia. 261
==> val[2] = [email protected]

==> TestMapper
==> key=73
==> val[0] = Jim Sample
==> val[1] = 
==> val[2] = [email protected]

==> TestMapper
==> key=10
==> val[0] = Jack Example
==> val[1] = 1 Example Street, Exampleville, Australia. 261
==> val[2] = [email protected]

License

https://www.apache.org/licenses/LICENSE-2.0.html

Credits

Personal fork of CSVInputFormat, but built against hadoop2. Please report the issues to the original fork.

Versions

Version
2.0