This simple tutorial shows how to install the latest Tesseract OCR engine in all current Ubuntu releases (Ubuntu 24.04, Ubuntu 22.04, and Ubuntu 20.04) via PPA.
Tesseract is the most accurate open-source OCR engine that reads a wide variety of image formats and converts them to text in over 40 languages. Tesseract 5.0.0 was officially released a few days ago that features:
- Faster training and OCR performance while less memory usage via ‘fast bloats’.
- Support for latest macOS and Apple Silicon
- Better ARM/ARM64 support.
- API improvements and more.
How to Install Tesseract OCR in Ubuntu:
The optical character recognition engine is available in Ubuntu repositories though it’s always old.
Thanks to Alexander Pozdnyakov, the maintainer of Tesseract OCR in Debian/Ubuntu official repository, also maintains few PPAs with the latest packages. And, most CPU architectures (amd64
, i386
, arm64
/armhf
, ppc64el
, s390x
) are supported.
Option 1: Add Tesseract 4.x PPA
For the latest release of Tesseract OCR 4 (v4.1.3 so far), the stable PPA contains the packages for Ubuntu 18.04, Ubuntu 20.04, Ubuntu 21.10, and old Ubuntu 16.04/14.04.
Press Ctrl+Alt+T on keyboard to open terminal. When it opens, run the command below to add the PPA:
sudo add-apt-repository ppa:alex-p/tesseract-ocr
Type user password when it asks (no visual feedback) and hit Enter to continue.
Option 2: Add Tesseract 5 PPA
The new 5.x release series is available in the another PPA for Ubuntu 24.04, Ubuntu 22.04, and Ubuntu 20.04.
Also, press Ctrl+Alt+T to open terminal and run command:
sudo add-apt-repository ppa:alex-p/tesseract-ocr5
NOTE: install the OCR from this PPA will override the old 4.x packages, though it’s not 100 % API compatible with v4.0.
Option 3: Add Tesseract repository for Debian:
For Debian Stretch, Buster, Bullseye, and Sid, there’s apt repositories for both Tesseract v4 and v5. Along with Ubuntu 21.10 users may follow the link button below to add the repository:
Update and Install Tesseract:
After adding a PPA or repository from the previous options, run command in terminal to refresh system package cache in case you’re still running old Ubuntu 18.04 and earlier:
sudo apt update
And, finally install the software engine via command:
sudo apt install tesseract-ocr
Or, upgrade the package using Software Updater:
How to Remove PPAs & uninstall Tesseract OCR:
To remove the PPAs, either run previous add-apt-repository
command with --remove
flag, or use Software & Updates utility under ‘Other Software’ tab.
To remove OCR engine, use command:
sudo apt remove --autoremove tesseract-ocr tesseract-ocr-*
You may also remove the libtesseract*
package, which will however remove other app packages (e.g., gImageReader) that depends on it.
Ji, Do you use this? What API’s are out there to use with it?
I already had Tesseract 4 installed on my Ubuntu 20.04 system. The install of 5 went a bit differently than it did on yours. It installed tesseract-ocr as an upgrade, and did not install tesseract-ocr-osd. I think it saw the 4.0 version as satisfying the dependency, but the files are in the wrong place for 5.0.
When I first tried to run 5.0, it failed with:
Error opening data file /usr/share/tesseract-ocr/5/tessdata/eng.traineddata
I found the simple fix on stackoverflow, under the title”Tesseract running error”: you just need to download the eng.traineddata from github, and copy it to where it belongs.
(I’m leaving out URLs because they freak out some blogging software’s spam filters).
Thanks for doing this guide: I wasn’t even sure that 5.0 was released, and you let me know that it was and saved me the hassle of finding the PPA to upgrade.
Ran
When trying to add your tesseract 5 repository to a bionic system, I am getting the following error:
Executing: /tmp/apt-key-gpghome.uMfjMffagF/gpg.1.sh –keyserver keyserver.ubuntu.com –recv-keys 8529B1E0F8BF7F65C12FABB0A4BCBD87CEF9E52D
gpg: keyserver receive failed: Invalid argument