Use Tesseract to make a PDF searchable.
Installation
Install Tesseract v3.0.5. This library shells out to it.
Then install this package. Maven-style:
<dependency>
<groupId>org.overviewproject</groupId>
<artifactId>pdfocr_2.12</artifactId>
<version>0.0.10</version>
</dependency>
Sbt-style:
dependencies += "org.overviewproject" %% "pdfocr" % "0.0.10"
Usage
You've got to use Scala. Code something like this:
import java.nio.file.Path
import java.util.Locale
import org.overviewproject.pdfocr.{PdfOcr,PdfOcrProgress,PdfOcrResult}
import org.overviewproject.pdfocr.exceptions._
import scala.concurrent.Future
val pdfOcr = new PdfOcr() // default settings: finds tesseract in your $PATH
val inPdf = new Path("/path/to/needs-ocr.pdf") // exists
val outPdf = new Path("/path/to/ocr-finished.pdf") // doesn't exist; will be deleted if it does
val process = PdfOcr.makePdfSearchable(inPdf, outPdf, Seq(Locale("en")))
process.progress // Future[PdfOcrProgress]
.map { progress =>
// It's a Future because we don't know how many pages there are until
// we begin parsing the PDF, which takes time.
progress.value // 0.0 ... 1.0
progress.currentPage // 1 .. nPages
progress.nPages // n
}
process.result // Future[PdfTextResult]
.map { result =>
// do something with outPdf now...
// Also, since the data is handy and would otherwise take a long time
// to compute, PdfOcr returns the text, in pages.
val text = result.pages.map(_.text).mkString("\n")
}
.recover {
// outPdf is guaranteed not to exist
case TesseractMissingException => throw
case TesseractLanguageMissingException => throw
case EncryptedPdfException => throw
case InvalidPdfException => throw
// Other errors may happen -- PDFBox bugs, Tesseract bugs,
// out-of-memory.... You shouldn't catch those.
}
// Or if you got impatient, you could:
process.cancel // Future[Unit]
How PdfOcr behaves
- PdfOcr processes one page at a time.
- PdfOcr sends Tesseract any page that's missing fonts or missing 100 characters of text.
- PdfOcr's progress reports are page-by-page. If one page needs OCR and nine don't, the progress report will be unintuitive.
- PdfOcr communicates with Tesseract via stdin and stdout.
- For any method that will block on I/O, PdfOcr returns a Future. In other words: blocking methods are asynchronous.
- PdfOcr does heavy computations (especially in
PdfPage
) which are slow. These are non-blocking and synchronous.
Developing
First, Install sbt.
After that,
- Run
sbt ~test
to run unit tests in the background. - Edit files in
src/test
until a test fails. - Edit files in
src/main
until the test passes. - Return to step 2.
- Commit to a git branch, push it to GitHub, and submit a pull request.
Publishing
We use [sbt-sonatype](https://github.com/xerial/sbt-sonatype for more details) for all this.
Setup: using the sbt-sonatype instructions, ensure you've done these things:
- Created an account at https://oss.sonatype.org and get access to this project.
- Created
~/.sbt/1.0/sonatype.sbt
with your credentials.
Then, every new version:
sbt publishSigned
to deploy to stagingsbt sonatypeRelease
to close and promote it
If the version ends in -SNAPSHOT
, you won't be able to release it.
License
This software is Copyright 2011-2018 The Associated Press, and distributed under the terms of the GNU Affero General Public License. See the LICENSE file for details.