License	License The Apache License, Version 2.0
Categories	Categories Net
GroupId	GroupId net.dankito.text.extraction
ArtifactId	ArtifactId finereader-hotfolder-text-extractor
Last Version	Last Version 0.6.0
Release Date	Release Date Nov 6, 2020
Type	Type module
Description	Description finereader-hotfolder-text-extractor A framework for extracting text from different types of files, e.g. PDFs, images, office documents, text files, ...
Project URL	Project URL https://github.com/dankito/TextExtraction
Source Code Management	Source Code Management https://github.com/dankito/TextExtraction

Filename	Size
finereader-hotfolder-text-extractor-0.6.0.pom
finereader-hotfolder-text-extractor-0.6.0.module	4 KB
finereader-hotfolder-text-extractor-0.6.0-sources.jar	3 KB
finereader-hotfolder-text-extractor-0.6.0-javadoc.jar	261 bytes
Browse

Group / Artifact	Type	Version
net.dankito.text.extraction : text-extractor-common	jar	0.6.0

Group / Artifact	Type	Version
org.jsoup : jsoup	jar	1.13.1

Text Extration

A modular framework for extracting text from many different sources (websites, PDFs, images).

Text Extractors comparison

PDF

There are two types of PDF:

"Image only" PDFs that just embed (scanned) images. But they contain no selectable and therefore extractable text. To get the text in the images, first the images have to be extracted from the PDF and then OCR applied to them. See section Images.
Searchable PDFs: If you open them in a PDF viewer you can select their text or search for it. The following libraries help to extract text from these types of PDFs:

Searchable PDFs

Extractor	Permissive License	Runs on Android	Advantages	Disadvantages
pdftotext	✔️	❌	Best PDF extraction result so far	User has to install Poppler Utils Does not run on Android
iText 2	✔️	✔️	Works also with PDFs with disordered layouts Best PDF extraction result of any Java library I found Works on older Androids (at least on Android 4.1) Almost the same text extraction quality as the newer (and non-free) iText 7
iText	❌	✔️	Works also with PDFs with disordered layouts Best PDF extraction result of any Java library I found Works on older Androids (at least on Android 4.1)	Not free / commercial (AGPL / commercial license)
OpenPDF	✔️	( ✔️ )	Free Quite good and fast	Does not work on PDFs with disordered layouts Does not run on older Androids (uses Java 8 features (Optional); works on Android 6 but not on Android 4.1, others not tested)
PDFBox (not added yet)	✔️	❌
PdfBox-Android (not added yet)	✔️	✔️

iText 2 and iText 7

iText 2 is the older, permissive version of then turned commercial iText. But as the last free iText version, 2.1.7, has security flaws, I used version 2.1.7.js7 from JasperReports as this version fixes the security issues. It's slower than iText 7 but in regard to text extraction quality I cannot see any difference between iText 7 and iText 2.

OpenPdf

OpenPdf took the last commit with a permissive license of iText and developed it further. But according to my experience its text extraction capability is worse than that one of iText 7 and iText 2.

Do not add OpenPdfPdfTextExtractor and iText2PdfTextExtractor to the class path at the same time as both have the same package and class names but different method and class signatures -> one of them will crash when using them.

(Very opinionated) Recommendation

If you can use pdftotext (Poppler), use pdftotext. It yields the best results both in terms of text extraction quality and speed.

Otherwise use security issues fixed version of iText 2. It's slower than commercial (and really amazing good) iText 7, but in terms of text extraction quality I cannot see any difference between iText 2 and iText 7.

I don't know why, but of some PDFs OpenPdf cannot extract any text at all.

How to distinguish between Searchable and "Image only" PDFs?

Kurt Pfeifle gave an superb hint (https://stackoverflow.com/a/3108531): Check how many fonts a PDF uses. If it uses fonts, it contains searchable text. If it uses no font at all it contains only images.

I added IPdfTypeDetector implementations for Poppler / pdffonts and ...

Images

(All variants with Tesseract 4 have the same extraction quality, which is quite good but not the best.)

Extractor	Advantages	Disadvantages
tess4j	Uses Tesseract 4	User has to install Tesseract Extraction result depends a lot on image quality Does not run on Android
Tesseract 4 over JNI (e. g. from Bytedeco)	Uses Tesseract 4	If there's an exception in native code whole application crashes (JNI) User has to install Tesseract Extraction result depends a lot on image quality Does not run on Android
Tesseract4Android	Uses Tesseract 4	Very slow, took 2 minutes to recognize a single image (0,5 MB) Extraction result depends a lot on image quality
Tess4Android	Uses Tesseract 4	Couldn't get it to compile
TextFairy (not added yet)		Uses Tesseract 3 Quite slow Extraction result depends a lot on image quality
Microsoft Cloud Computer Vision API OCR (not implemented yet)	Best image extraction result I found so far	Requires registration (credit card required; every single user to do this for his/her self) Costs $1.50 per 1000 images (see) Data protection insanity, stores all your images and recognized text for years
Google Cloud Vision OCR (neither implemented nor tested yet)		Requires registration (credit card required; every single user to do this for his/her self) 1000 images per month are free, have to pay for more (see) Data protection insanity, stores all your images and recognized text for years

License

If not stated otherwise all code is licensed under Apache License, Version 2.0.

Notice: Some libraries, like iText, have different, partially commercial licenses.

finereader-hotfolder-text-extractor

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management

Download finereader-hotfolder-text-extractor

Dependencies

compile (1)

runtime (1)

Project Modules

Text Extration

Text Extractors comparison

PDF

Searchable PDFs

iText 2 and iText 7

OpenPdf

(Very opinionated) Recommendation

How to distinguish between Searchable and "Image only" PDFs?

Images

License

Versions