PDF Extractor

Extract data and metadata from PDF files in a hierarchial JSON format.

License	License The Apache License, Version 2.0
Categories	Categories PDF Data
GroupId	GroupId com.beehyv
ArtifactId	ArtifactId pdf-extractor
Last Version	Last Version 1.0
Release Date	Release Date Jun 12, 2020
Type	Type jar
Description	Description PDF Extractor Extract data and metadata from PDF files in a hierarchial JSON format.
Project Organization	Project Organization BeeHyv Software Solutions Pvt Ltd

Download pdf-extractor

Filename	Size
pdf-extractor-1.0.pom
pdf-extractor-1.0.jar	223 KB
pdf-extractor-1.0-sources.jar	182 KB
pdf-extractor-1.0-javadoc.jar	398 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.beehyv/pdf-extractor/ -->
<dependency>
    <groupId>com.beehyv</groupId>
    <artifactId>pdf-extractor</artifactId>
    <version>1.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.beehyv/pdf-extractor/
implementation 'com.beehyv:pdf-extractor:1.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.beehyv/pdf-extractor/
implementation ("com.beehyv:pdf-extractor:1.0")

Apache Buildr

'com.beehyv:pdf-extractor:jar:1.0'

Apache Ivy

<dependency org="com.beehyv" name="pdf-extractor" rev="1.0">
  <artifact name="pdf-extractor" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.beehyv', module='pdf-extractor', version='1.0')
)

Scala SBT

libraryDependencies += "com.beehyv" % "pdf-extractor" % "1.0"

Leiningen

[com.beehyv/pdf-extractor "1.0"]

Dependencies

compile (12)

Group / Artifact	Type	Version
commons-collections : commons-collections	jar	3.2.2
org.apache.commons : commons-lang3	jar	3.9
org.apache.pdfbox : pdfbox	jar	2.0.5
org.slf4j : slf4j-api	jar	2.0.0-alpha1
org.apache.httpcomponents : httpcore	jar	4.4.13
commons-io : commons-io	jar	2.6
org.codehaus.jackson : jackson-mapper-asl	jar	1.9.13
com.google.apis : google-api-services-vision	jar	v1-rev30-1.22.0
commons-configuration : commons-configuration	jar	1.10
com.google.guava : guava	jar	18.0
technology.tabula : tabula	jar	1.0.0
com.beehyv.munchbot » ingestion-model	jar	0.0.1

test (1)

Group / Artifact	Type	Version
junit : junit	jar	4.12

Project Modules

There are no modules declared in this project.

PDF Extractor

Powered by BeeHyv Software Solutions Pvt Ltd and Distributed under Apache 2.0 Licence

Overview

Extracting meaningful data out of documents is a standard problem and many attempts have been made till date, with partial success. The goal of this project is to extract data and metadata in a structured manner for any given PDF document.

Features

Table of contents : A TOC generally provides an overview of the content within the document. A PDF may or may not have a table of contents. This code extracts a TOC from a PDF which doesn't have one using a heuristic based approach.
Text : Entire text , text from a particular page
Sections : Splitting PDF content (text , image , tables) into sections could help to extract more relevant content .
Font information : Color , font type . font weight , font size etc.
Tables : Table heading , rows , cells
Images : Image files , Text inside images.
Metadata of PDF : Author info , Creation date , Size etc.

Technologies Used

This library uses The Apache PDFBox® library's PDF content stream engine to stream the PDF file.
Tabula 1.2.1 (an Open source library) is used for table extraction.

Installation Instructions

Pre-requisites

Java (>1.6)
Maven

Installation and Setup

From Source

Clone the project
Add an environment variable for the tabula jar (used for tables extraction and unit tests)
TABULA_JAR_LOCATION={Project-dir}/lib/tabula/tabula-0.9.1-jar-with-dependencies.jar
Run mvn clean install to install it in your local environment. It might take some time (~15 mins) as there are ~400 unit tests within the project. In order to skip tests , run with -DskipTests

Import the pdf-extractor dependency to your project

Adding the maven dependency

 <dependency>
    <groupId>com.beehyv</groupId>
    <artifactId>pdf-extractor</artifactId>
    <version>1.0</version>
 </dependency>

Adding the jar to the classpath
pdf-extractor.jar file for the project can be found under {Project-dir}/target

Run Extraction

Create a document object

HolmesPdfDocument pdfDocument = new HolmesPdfDocument(file);

Create an extractor instance

PdfBoxExtractor pdfBoxExtractor = new PdfBoxExtractor();

Extract text

String text = pdfBoxExtractor.getText(pdfDocument,startPage,endPage);

Extract images

pdfBoxExtractor.getImages(pdfDocument,startPage,endPage);

Extract tables

pdfBoxExtractor.getTabularData(pdfDocument,startPage,endPage);

Extract Structured Text

With this feature you can extract data in a structured manner. The data is extracted in sections with the hierarchy of the sections being intact. All the texts , images , tables , paragraphs are assigned to the respective sections giving the extracted data a structure and hence more meaningful.

All this information resides in an InfoNode model.

InfoNode infoNode = pdfBoxExtractor.getStructuredText(hdoc)
- Sections infoNode.getSections()
- Paragraphs infoNode.getParagraphs
- Content infoNode.getContent()
- Section Heading infoNode.getHeading()
- Section Images infoNode.getImageSections()
- Lines infoNode.getContentLineObjects()

Feature Request

In case of new feature requests please use the Github Issues page to raise tickets for Bugs as well as enhancements. The community can then take up the functionality as per need.

BeeHyv Software Solutions Pvt Ltd

Versions

Version
1.0 Jun 12, 2020

PDF Extractor

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project Organization

Download pdf-extractor

How to add to project

Dependencies

compile (12)

test (1)

Project Modules

PDF Extractor

Overview

Features

Technologies Used

Installation Instructions

Pre-requisites

Installation and Setup

From Source

Import the pdf-extractor dependency to your project

Run Extraction

Feature Request

BeeHyv Software Solutions Pvt Ltd

Versions