Mihail Radu Solcan

  Tesseract OCR

2008-12-15; 2009-10-31 note on online OCR

Tesseract has filled a gap in my GNU/Linux toolbox. I am "text-dependent" and I enjoy LATEX or DjVu. I enjoy Vim because it is so text-centric. I like to write and read texts on the computer's screen, but I had no operational open-source tool for Optical Character Recognition (OCR). 

The only OCR-program that I used (rarely) was a relic from Win98 times. It runs under Wine. I bought it as "bundled software" and it has two major defects. First, it uses only a GUI and I cannot access it from the command-line. Thus I cannot integrate it with other text-processing operations. Second, it does not recognize characters specific to the Romanian language. 

The open-source OCRs that I knew where interesting experiments. I did not know, however, how to make them work together with other text-processing tools. 

I discovered Tesseract accidentally. I was looking for a solution to the problem of inserting text behind a djvu image. 

Brief History

According to Wikipedia, Tesseract was developed by Hewlett-Packard until 1995. The University of Nevada at Las Vegas was also involved. After 10 years, HP lost any interest in the OCR software. 

Enthusiastic programmers from Google took the Tesseract source and adapted it to the world of open-source. Tesseract was released under the Apache License

The tesseractTrainer.py, created by Cătălin Frâncu, is released under GPL. This is part of the Tesseract OCR tool-set. 

Installation

Tesseract is an OCR engine. The Tesseract Project is located on Google Code

You can find on the Tesseract Project's page a list of downloadable files. Look there and choose what you need. 

When I first noticed Tesseract, I found out that Ubuntu can install Tesseract. I looked to the Fedora 7 RPMs and I found there a Tesseract package. 

I have tested the building of RPMs for the installation of Tesseract on Fedora 7. The building went without problems for tesseract-2.01.tar.gz and tesseract-2.03.tar.gz. I used the spec file from the Fedora 7 source RPM, making the obvious adaptations for the newer source package on Google Code. 

Now comes the great advantage of Tesseract, for me at least: the data for the Romanian language. I have downloaded language data for English (you need it when you build Tesseract !), Dutch, Spanish, German, Italian and French from the download page. But, if you have a look in the file section of the Tesseract Project, you find tessdata_ron.tar.bz2. This is a file with data for Romanian characters. It is not difficult to repackage the archive in the format for the standard languages and, with a modified spec, build the corresponding RPM for Romanian data. In the spec file you must make some trivial chages (see the figure). 

tess-lang-spec
Spec file for language data including Romanian

I also recommand to download the excellent program of Cătălin Frâncu tesseractTrainer.py. You need Python for this program. This is not a problem under Ubuntu or Fedora. In principle, it should work under Windows too, but you have to install Python first. 

Use

Tesseract needs tiff files. Let us say that you have a folder with tiff files. First, open in that folder a console. The basic command is

tesseract file-name.tif file-name

Tesseract adds automatically the txt extension to the output

Options are inserted at the end of the command. For example:

tesseract file-name.tif file-name batch

tesseract file-name.tif file-name batch.nochop

If you add just batch this tells tesseract that "all defaults are correct" (cf. the text of the source of tesseract). You need however this option in order to add makebox (see below) or box.train (see the section on training). 

In my tests batch.nochop seemed to work only with images at higher resolutions. However, the batch option works even with lower resolution images. The results are not very good. 

Tesseract does not work with screen captures. The resolution of the screen is too low. The resolution of the scanned pages must be at least 200x200 pixels per inch. Intuitively, this means that a scanned a4 page looks "pretty big" on the screen, with letters which seem enlarged (as in the Tesseract Box Editor figure below). See the figure below for a result obtained with the test file from the FreeOCR distribution from Softi Software. 

tess-test
Test with the file from Softi Software

The language option is inserted like this:

tesseract file-name.tif file-name -l ron

This is the option for Romanian. "Language" does not mean that tesseract understands the language; tesseract is an OCR: it recognizes characters. Each language has its specific characters and the language options tells that to the program. 

The command

tesseract file-name.tif file-name-box batch makebox

generates the file-name-box.txt file. This is a list of the positions of the characters on the page. Generate it with the same options as the text file, if you wnat to work with both files. 

Tesseract training

It is possible to train tesseract to recognize previously unknown characters. The training is described on Tesseract Project's page. 

The tesseractTrainer.py program is used in the process of training. It needs a pair of file-name.tif file-name.box files. The program is used as a box editor. In the image I have used a page from a paper of mine. 

tess-box-ed
Tesseract box editor

Tesseract as engine in freeware

Unlike tesseractTrainer.py, which is released under GPL, Tesseract is released under the Apache license. This means it can be included in non-open-source software. 

Softi Software offers a freeware application called FreeOCR. This application works under Windows (2000, XP or Vista). It requires .Net Framework 2.0. 

Tesseract in Free Online OCR

Alexey Ryabukhin has built, using the Tesseract engine, a Free Online OCR service. 

The Free Online OCR has several advantages. First, you do not have to bother with intricate command-lines, write your own scripts or install software. Second, it can process several languages, including my native Romanian. Third, it supports layout analysis. 

All that you need in this case is an adequate image of the text that you want to process.