hadoop2-csv
Input format for hadoop able to read multiline CSVs
Run BasicTest.java to see it working. Check src/test/resource/test.csv to see a multiline demofile.
The key returned is the file position where the line starts and the value is a List with the column values
Zip files are supported.
More ideas to improve this are welcome.
Example:
If we read this CSV (note that line 2 is multiline):
Joe Demo,"2 Demo Street,
Demoville,
Australia. 2615",[email protected]
Jim Sample,"3 Sample Street, Sampleville, Australia. 2615",[email protected]
Jack Example,"1 Example Street, Exampleville, Australia.
2615",[email protected]
The output is as follows:
==> TestMapper
==> key=0
==> val[0] = Joe Demo
==> val[1] = 2 Demo Street,
Demoville,
Australia. 261
==> val[2] = [email protected]
==> TestMapper
==> key=73
==> val[0] = Jim Sample
==> val[1] =
==> val[2] = [email protected]
==> TestMapper
==> key=10
==> val[0] = Jack Example
==> val[1] = 1 Example Street, Exampleville, Australia. 261
==> val[2] = [email protected]
License
https://www.apache.org/licenses/LICENSE-2.0.html
Credits
Personal fork of CSVInputFormat, but built against hadoop2. Please report the issues to the original fork.