pdfocr

Library that shells to Tesseract to make PDFs searchable

License	License AGPL3 AGPLv3
Categories	Categories PDF Data
GroupId	GroupId org.overviewproject
ArtifactId	ArtifactId pdfocr_2.12
Last Version	Last Version 0.0.12
Release Date	Release Date Jan 25, 2019
Type	Type jar
Description	Description pdfocr Library that shells to Tesseract to make PDFs searchable
Project URL	Project URL https://github.com/overview/pdfocr
Project Organization	Project Organization Overview Services Inc.
Source Code Management	Source Code Management https://github.com/overview/pdfocr

Download pdfocr_2.12

Filename	Size
pdfocr_2.12-0.0.12.pom
pdfocr_2.12-0.0.12.jar	204 KB
pdfocr_2.12-0.0.12-sources.jar	136 KB
pdfocr_2.12-0.0.12-javadoc.jar	725 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/org.overviewproject/pdfocr_2.12/ -->
<dependency>
    <groupId>org.overviewproject</groupId>
    <artifactId>pdfocr_2.12</artifactId>
    <version>0.0.12</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/org.overviewproject/pdfocr_2.12/
implementation 'org.overviewproject:pdfocr_2.12:0.0.12'

Gradle Kotlin

// https://jarcasting.com/artifacts/org.overviewproject/pdfocr_2.12/
implementation ("org.overviewproject:pdfocr_2.12:0.0.12")

Apache Buildr

'org.overviewproject:pdfocr_2.12:jar:0.0.12'

Apache Ivy

<dependency org="org.overviewproject" name="pdfocr_2.12" rev="0.0.12">
  <artifact name="pdfocr_2.12" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='org.overviewproject', module='pdfocr_2.12', version='0.0.12')
)

Scala SBT

libraryDependencies += "org.overviewproject" % "pdfocr_2.12" % "0.0.12"

Leiningen

[org.overviewproject/pdfocr_2.12 "0.0.12"]

Dependencies

compile (8)

Group / Artifact	Type	Version
org.scala-lang : scala-library	jar	2.12.8
org.bouncycastle : bcmail-jdk15on	jar	1.60
org.bouncycastle : bcprov-jdk15on	jar	1.60
org.bouncycastle : bcpkix-jdk15on	jar	1.60
com.github.jai-imageio : jai-imageio-core	jar	1.4.0
com.github.jai-imageio : jai-imageio-jpeg2000	jar	1.3.0
org.apache.pdfbox : jbig2-imageio	jar	3.0.0
org.apache.pdfbox : pdfbox	jar	2.0.13

test (4)

Group / Artifact	Type	Version
org.scalatest : scalatest_2.12	jar	3.0.5
org.mockito : mockito-core	jar	2.18.3
org.slf4j : jcl-over-slf4j	jar	1.7.25
org.slf4j : slf4j-simple	jar	1.7.25

Project Modules

There are no modules declared in this project.

Use Tesseract to make a PDF searchable.

Installation

Install Tesseract v3.0.5. This library shells out to it.

Then install this package. Maven-style:

<dependency>
  <groupId>org.overviewproject</groupId>
  <artifactId>pdfocr_2.12</artifactId>
  <version>0.0.10</version>
</dependency>

Sbt-style:

dependencies += "org.overviewproject" %% "pdfocr" % "0.0.10"

Usage

You've got to use Scala. Code something like this:

import java.nio.file.Path
import java.util.Locale
import org.overviewproject.pdfocr.{PdfOcr,PdfOcrProgress,PdfOcrResult}
import org.overviewproject.pdfocr.exceptions._
import scala.concurrent.Future

val pdfOcr = new PdfOcr()                          // default settings: finds tesseract in your $PATH
val inPdf = new Path("/path/to/needs-ocr.pdf")     // exists
val outPdf = new Path("/path/to/ocr-finished.pdf") // doesn't exist; will be deleted if it does
val process = PdfOcr.makePdfSearchable(inPdf, outPdf, Seq(Locale("en")))

process.progress // Future[PdfOcrProgress]
  .map { progress =>
    // It's a Future because we don't know how many pages there are until
    // we begin parsing the PDF, which takes time.

    progress.value       // 0.0 ... 1.0
    progress.currentPage // 1 .. nPages
    progress.nPages      // n
  }

process.result // Future[PdfTextResult]
  .map { result =>
    // do something with outPdf now...

    // Also, since the data is handy and would otherwise take a long time
    // to compute, PdfOcr returns the text, in pages.
    val text = result.pages.map(_.text).mkString("\n")
  }
  .recover {
    // outPdf is guaranteed not to exist

    case TesseractMissingException => throw
    case TesseractLanguageMissingException => throw
    case EncryptedPdfException => throw
    case InvalidPdfException => throw
    // Other errors may happen -- PDFBox bugs, Tesseract bugs,
    // out-of-memory.... You shouldn't catch those.
  }

// Or if you got impatient, you could:
process.cancel // Future[Unit]

How PdfOcr behaves

PdfOcr processes one page at a time.
PdfOcr sends Tesseract any page that's missing fonts or missing 100 characters of text.
PdfOcr's progress reports are page-by-page. If one page needs OCR and nine don't, the progress report will be unintuitive.
PdfOcr communicates with Tesseract via stdin and stdout.
For any method that will block on I/O, PdfOcr returns a Future. In other words: blocking methods are asynchronous.
PdfOcr does heavy computations (especially in PdfPage) which are slow. These are non-blocking and synchronous.

Developing

First, Install sbt.

After that,

Run sbt ~test to run unit tests in the background.
Edit files in src/test until a test fails.
Edit files in src/main until the test passes.
Return to step 2.
Commit to a git branch, push it to GitHub, and submit a pull request.

Publishing

We use [sbt-sonatype](https://github.com/xerial/sbt-sonatype for more details) for all this.

Setup: using the sbt-sonatype instructions, ensure you've done these things:

Created an account at https://oss.sonatype.org and get access to this project.
Created ~/.sbt/1.0/sonatype.sbt with your credentials.

Then, every new version:

sbt publishSigned to deploy to staging
sbt sonatypeRelease to close and promote it

If the version ends in -SNAPSHOT, you won't be able to release it.

License

Overview Docs

The open source document mining platform

Versions

Version
0.0.12 Jan 25, 2019
0.0.11 Jan 25, 2019
0.0.10 May 10, 2018
0.0.9 Apr 16, 2018
0.0.8 Apr 16, 2018
0.0.7 Apr 16, 2018
0.0.6 Nov 22, 2017
0.0.5 Jul 19, 2017
0.0.3 Jul 19, 2017

pdfocr

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Project Organization

Source Code Management

Download pdfocr_2.12

How to add to project

Dependencies

compile (8)

test (4)

Project Modules

Installation

Usage

How PdfOcr behaves

Developing

Publishing

License

Overview Docs

Versions