pdfocr

Library that shells to Tesseract to make PDFs searchable

License

License

Categories

Categories

PDF Data
GroupId

GroupId

org.overviewproject
ArtifactId

ArtifactId

pdfocr_2.12
Last Version

Last Version

0.0.12
Release Date

Release Date

Type

Type

jar
Description

Description

pdfocr
Library that shells to Tesseract to make PDFs searchable
Project URL

Project URL

https://github.com/overview/pdfocr
Project Organization

Project Organization

Overview Services Inc.
Source Code Management

Source Code Management

https://github.com/overview/pdfocr

Download pdfocr_2.12

How to add to project

<!-- https://jarcasting.com/artifacts/org.overviewproject/pdfocr_2.12/ -->
<dependency>
    <groupId>org.overviewproject</groupId>
    <artifactId>pdfocr_2.12</artifactId>
    <version>0.0.12</version>
</dependency>
// https://jarcasting.com/artifacts/org.overviewproject/pdfocr_2.12/
implementation 'org.overviewproject:pdfocr_2.12:0.0.12'
// https://jarcasting.com/artifacts/org.overviewproject/pdfocr_2.12/
implementation ("org.overviewproject:pdfocr_2.12:0.0.12")
'org.overviewproject:pdfocr_2.12:jar:0.0.12'
<dependency org="org.overviewproject" name="pdfocr_2.12" rev="0.0.12">
  <artifact name="pdfocr_2.12" type="jar" />
</dependency>
@Grapes(
@Grab(group='org.overviewproject', module='pdfocr_2.12', version='0.0.12')
)
libraryDependencies += "org.overviewproject" % "pdfocr_2.12" % "0.0.12"
[org.overviewproject/pdfocr_2.12 "0.0.12"]

Dependencies

compile (8)

Group / Artifact Type Version
org.scala-lang : scala-library jar 2.12.8
org.bouncycastle : bcmail-jdk15on jar 1.60
org.bouncycastle : bcprov-jdk15on jar 1.60
org.bouncycastle : bcpkix-jdk15on jar 1.60
com.github.jai-imageio : jai-imageio-core jar 1.4.0
com.github.jai-imageio : jai-imageio-jpeg2000 jar 1.3.0
org.apache.pdfbox : jbig2-imageio jar 3.0.0
org.apache.pdfbox : pdfbox jar 2.0.13

test (4)

Group / Artifact Type Version
org.scalatest : scalatest_2.12 jar 3.0.5
org.mockito : mockito-core jar 2.18.3
org.slf4j : jcl-over-slf4j jar 1.7.25
org.slf4j : slf4j-simple jar 1.7.25

Project Modules

There are no modules declared in this project.

Use Tesseract to make a PDF searchable.

Installation

Install Tesseract v3.0.5. This library shells out to it.

Then install this package. Maven-style:

<dependency>
  <groupId>org.overviewproject</groupId>
  <artifactId>pdfocr_2.12</artifactId>
  <version>0.0.10</version>
</dependency>

Sbt-style:

dependencies += "org.overviewproject" %% "pdfocr" % "0.0.10"

Usage

You've got to use Scala. Code something like this:

import java.nio.file.Path
import java.util.Locale
import org.overviewproject.pdfocr.{PdfOcr,PdfOcrProgress,PdfOcrResult}
import org.overviewproject.pdfocr.exceptions._
import scala.concurrent.Future

val pdfOcr = new PdfOcr()                          // default settings: finds tesseract in your $PATH
val inPdf = new Path("/path/to/needs-ocr.pdf")     // exists
val outPdf = new Path("/path/to/ocr-finished.pdf") // doesn't exist; will be deleted if it does
val process = PdfOcr.makePdfSearchable(inPdf, outPdf, Seq(Locale("en")))

process.progress // Future[PdfOcrProgress]
  .map { progress =>
    // It's a Future because we don't know how many pages there are until
    // we begin parsing the PDF, which takes time.

    progress.value       // 0.0 ... 1.0
    progress.currentPage // 1 .. nPages
    progress.nPages      // n
  }

process.result // Future[PdfTextResult]
  .map { result =>
    // do something with outPdf now...

    // Also, since the data is handy and would otherwise take a long time
    // to compute, PdfOcr returns the text, in pages.
    val text = result.pages.map(_.text).mkString("\n")
  }
  .recover {
    // outPdf is guaranteed not to exist

    case TesseractMissingException => throw
    case TesseractLanguageMissingException => throw
    case EncryptedPdfException => throw
    case InvalidPdfException => throw
    // Other errors may happen -- PDFBox bugs, Tesseract bugs,
    // out-of-memory.... You shouldn't catch those.
  }

// Or if you got impatient, you could:
process.cancel // Future[Unit]

How PdfOcr behaves

  • PdfOcr processes one page at a time.
  • PdfOcr sends Tesseract any page that's missing fonts or missing 100 characters of text.
  • PdfOcr's progress reports are page-by-page. If one page needs OCR and nine don't, the progress report will be unintuitive.
  • PdfOcr communicates with Tesseract via stdin and stdout.
  • For any method that will block on I/O, PdfOcr returns a Future. In other words: blocking methods are asynchronous.
  • PdfOcr does heavy computations (especially in PdfPage) which are slow. These are non-blocking and synchronous.

Developing

First, Install sbt.

After that,

  1. Run sbt ~test to run unit tests in the background.
  2. Edit files in src/test until a test fails.
  3. Edit files in src/main until the test passes.
  4. Return to step 2.
  5. Commit to a git branch, push it to GitHub, and submit a pull request.

Publishing

We use [sbt-sonatype](https://github.com/xerial/sbt-sonatype for more details) for all this.

Setup: using the sbt-sonatype instructions, ensure you've done these things:

  • Created an account at https://oss.sonatype.org and get access to this project.
  • Created ~/.sbt/1.0/sonatype.sbt with your credentials.

Then, every new version:

  1. sbt publishSigned to deploy to staging
  2. sbt sonatypeRelease to close and promote it

If the version ends in -SNAPSHOT, you won't be able to release it.

License

This software is Copyright 2011-2018 The Associated Press, and distributed under the terms of the GNU Affero General Public License. See the LICENSE file for details.

org.overviewproject

Overview Docs

The open source document mining platform

Versions

Version
0.0.12
0.0.11
0.0.10
0.0.9
0.0.8
0.0.7
0.0.6
0.0.5
0.0.3