PDF Extractor

Extract data and metadata from PDF files in a hierarchial JSON format.

License

License

Categories

Categories

PDF Data
GroupId

GroupId

com.beehyv
ArtifactId

ArtifactId

pdf-extractor
Last Version

Last Version

1.0
Release Date

Release Date

Type

Type

jar
Description

Description

PDF Extractor
Extract data and metadata from PDF files in a hierarchial JSON format.
Project Organization

Project Organization

BeeHyv Software Solutions Pvt Ltd

Download pdf-extractor

How to add to project

<!-- https://jarcasting.com/artifacts/com.beehyv/pdf-extractor/ -->
<dependency>
    <groupId>com.beehyv</groupId>
    <artifactId>pdf-extractor</artifactId>
    <version>1.0</version>
</dependency>
// https://jarcasting.com/artifacts/com.beehyv/pdf-extractor/
implementation 'com.beehyv:pdf-extractor:1.0'
// https://jarcasting.com/artifacts/com.beehyv/pdf-extractor/
implementation ("com.beehyv:pdf-extractor:1.0")
'com.beehyv:pdf-extractor:jar:1.0'
<dependency org="com.beehyv" name="pdf-extractor" rev="1.0">
  <artifact name="pdf-extractor" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.beehyv', module='pdf-extractor', version='1.0')
)
libraryDependencies += "com.beehyv" % "pdf-extractor" % "1.0"
[com.beehyv/pdf-extractor "1.0"]

Dependencies

compile (12)

Group / Artifact Type Version
commons-collections : commons-collections jar 3.2.2
org.apache.commons : commons-lang3 jar 3.9
org.apache.pdfbox : pdfbox jar 2.0.5
org.slf4j : slf4j-api jar 2.0.0-alpha1
org.apache.httpcomponents : httpcore jar 4.4.13
commons-io : commons-io jar 2.6
org.codehaus.jackson : jackson-mapper-asl jar 1.9.13
com.google.apis : google-api-services-vision jar v1-rev30-1.22.0
commons-configuration : commons-configuration jar 1.10
com.google.guava : guava jar 18.0
technology.tabula : tabula jar 1.0.0
com.beehyv.munchbot » ingestion-model jar 0.0.1

test (1)

Group / Artifact Type Version
junit : junit jar 4.12

Project Modules

There are no modules declared in this project.

PDF Extractor

BeeHyv Software Solutions Pvt Ltd

Powered by BeeHyv Software Solutions Pvt Ltd and Distributed under Apache 2.0 Licence

Overview

Extracting meaningful data out of documents is a standard problem and many attempts have been made till date, with partial success. The goal of this project is to extract data and metadata in a structured manner for any given PDF document.

Features

  • Table of contents : A TOC generally provides an overview of the content within the document. A PDF may or may not have a table of contents. This code extracts a TOC from a PDF which doesn't have one using a heuristic based approach.
  • Text : Entire text , text from a particular page
  • Sections : Splitting PDF content (text , image , tables) into sections could help to extract more relevant content .
  • Font information : Color , font type . font weight , font size etc.
  • Tables : Table heading , rows , cells
  • Images : Image files , Text inside images.
  • Metadata of PDF : Author info , Creation date , Size etc.

Technologies Used

  • This library uses The Apache PDFBox® library's PDF content stream engine to stream the PDF file.
  • Tabula 1.2.1 (an Open source library) is used for table extraction.

Installation Instructions

Pre-requisites

  • Java (>1.6)
  • Maven

Installation and Setup

From Source

  • Clone the project
  • Add an environment variable for the tabula jar (used for tables extraction and unit tests)
    TABULA_JAR_LOCATION={Project-dir}/lib/tabula/tabula-0.9.1-jar-with-dependencies.jar
  • Run mvn clean install to install it in your local environment. It might take some time (~15 mins) as there are ~400 unit tests within the project. In order to skip tests , run with -DskipTests

Import the pdf-extractor dependency to your project

  • Adding the maven dependency
 <dependency>
    <groupId>com.beehyv</groupId>
    <artifactId>pdf-extractor</artifactId>
    <version>1.0</version>
 </dependency>
  • Adding the jar to the classpath
    pdf-extractor.jar file for the project can be found under {Project-dir}/target

Run Extraction

  • Create a document object

    HolmesPdfDocument pdfDocument = new HolmesPdfDocument(file);
  • Create an extractor instance

    PdfBoxExtractor pdfBoxExtractor = new PdfBoxExtractor();
  • Extract text

    String text = pdfBoxExtractor.getText(pdfDocument,startPage,endPage);
  • Extract images

    pdfBoxExtractor.getImages(pdfDocument,startPage,endPage);
  • Extract tables

    pdfBoxExtractor.getTabularData(pdfDocument,startPage,endPage);
  • Extract Structured Text

    With this feature you can extract data in a structured manner. The data is extracted in sections with the hierarchy of the sections being intact. All the texts , images , tables , paragraphs are assigned to the respective sections giving the extracted data a structure and hence more meaningful.

    All this information resides in an InfoNode model.

    InfoNode infoNode = pdfBoxExtractor.getStructuredText(hdoc)

    • Sections infoNode.getSections()
    • Paragraphs infoNode.getParagraphs
    • Content infoNode.getContent()
    • Section Heading infoNode.getHeading()
    • Section Images infoNode.getImageSections()
    • Lines infoNode.getContentLineObjects()

Feature Request

In case of new feature requests please use the Github Issues page to raise tickets for Bugs as well as enhancements. The community can then take up the functionality as per need.

com.beehyv

BeeHyv Software Solutions Pvt Ltd

Versions

Version
1.0