Text Extration
A modular framework for extracting text from many different sources (websites, PDFs, images).
Text Extractors comparison
There are two types of PDF:
- "Image only" PDFs that just embed (scanned) images. But they contain no selectable and therefore extractable text. To get the text in the images, first the images have to be extracted from the PDF and then OCR applied to them. See section Images.
- Searchable PDFs: If you open them in a PDF viewer you can select their text or search for it. The following libraries help to extract text from these types of PDFs:
Searchable PDFs
Extractor | Permissive License | Runs on Android | Advantages | Disadvantages |
---|---|---|---|---|
pdftotext | |
|
|
|
iText 2 | |
|
|
|
iText | |
|
|
|
OpenPDF | |
( |
|
|
PDFBox (not added yet) | |
|
||
PdfBox-Android (not added yet) | |
|
iText 2 and iText 7
iText 2 is the older, permissive version of then turned commercial iText. But as the last free iText version, 2.1.7, has security flaws, I used version 2.1.7.js7 from JasperReports as this version fixes the security issues. It's slower than iText 7 but in regard to text extraction quality I cannot see any difference between iText 7 and iText 2.
OpenPdf
OpenPdf took the last commit with a permissive license of iText and developed it further. But according to my experience its text extraction capability is worse than that one of iText 7 and iText 2.
Do not add OpenPdfPdfTextExtractor and iText2PdfTextExtractor to the class path at the same time as both have the same package and class names but different method and class signatures -> one of them will crash when using them.
(Very opinionated) Recommendation
If you can use pdftotext (Poppler), use pdftotext. It yields the best results both in terms of text extraction quality and speed.
Otherwise use security issues fixed version of iText 2. It's slower than commercial (and really amazing good) iText 7, but in terms of text extraction quality I cannot see any difference between iText 2 and iText 7.
I don't know why, but of some PDFs OpenPdf cannot extract any text at all.
How to distinguish between Searchable and "Image only" PDFs?
Kurt Pfeifle gave an superb hint (https://stackoverflow.com/a/3108531): Check how many fonts a PDF uses. If it uses fonts, it contains searchable text. If it uses no font at all it contains only images.
I added IPdfTypeDetector implementations for Poppler / pdffonts and ...
Images
(All variants with Tesseract 4 have the same extraction quality, which is quite good but not the best.)
Extractor | Advantages | Disadvantages |
---|---|---|
tess4j |
|
|
Tesseract 4 over JNI (e. g. from Bytedeco) |
|
|
Tesseract4Android |
|
|
Tess4Android |
|
|
TextFairy (not added yet) |
|
|
Microsoft Cloud Computer Vision API OCR (not implemented yet) |
|
|
Google Cloud Vision OCR (neither implemented nor tested yet) |
|
License
If not stated otherwise all code is licensed under Apache License, Version 2.0.
Notice: Some libraries, like iText, have different, partially commercial licenses.