In the note on Tesseract OCR, I have talked about the tesseractTrainer.py, a program created by Cătălin Frâncu. You can download the program from Google Code. Thanks Cătălin Frâncu!
In this note, I will show how Frâncu's program can be used as a sort of Swiss Army knife. What for? Getting the coordinates of rectangles in an image of a text. I need this for inserting text in a djvu or writing an image map for an html file.
Frâncu's program is written in Python and it is released under the GPL license. This is very important. First, you can see the Python source. Second, you can study it and adapt it to your own needs.
Could a Python program be offered as an executable (“exe” file under Windows)? Yes. This short article tells how you can do this. There are certain technical limits to the operation, but their discussion is beyond the topic of this note. The problem with exe files is rather of an ethical nature, even if you do not have to pay (money!) for them. The freeware is “available at zero price but only as an executable, which is a mysterious bunch of numbers. What it does is secret. You can’t study it; you can’t change it; and you certainly can’t publish it in your own modified version.”(Richard Stallman)
I am going to use for my modified version of Frâncu's tesseractTrainer the name “Tesseract Box Editor”. This is the “internal” name, the title which appears on the program's window. There are two main reasons for the use of this name. First, the author of tesseractTrainer is not guilty for the mistakes that might be in the modifications. Second, the uses of the Tesseract Box Editor that I had in mind when I modified the software are not connected with Tesseract's training (for the recognition of new characters).
First, you should get the modifications. They are in this diff file. Copy tesseractTrainer in an empty folder. Rename it tesseractBoxEditor.py. Open a terminal and change the directory to the folder with tesseractBoxEditor.py. Apply the modifications with the command
patch tesseractBoxEditor.py box.diff
Alternatively, you may download the archive with the modified version.
I will explain the main changes.
First, you may call the Box Editor form the commandline. For example:
In this example I supposed that the Box Editor is in the current directory. Pay attention! You also need the corresponding box file (see the note on Tesseract OCR).
Second, the name of the tiff file (the image file) appears as a title of the main window. This is useful when you have many tiff files around.
Third, the coordinates of the boxes are shown in two styles. Frâncu's program uses the convention for tiff and other image files. The left-top corner has the coordinates 0 0. The right-bottom has the maximal values for x and y coordinates.
The modified version shows also the cartesian coordinates with the origin in the left-down corner. The “l-down” stands for left-down y values. The “r-up” stands for right-up y values. This is the syle in PostScript and DjVu files.
Fourth, the Commands menu offers the possibility to copy (Ctrl+C) the DjVu style coordinates. This is useful for the manual insertion of the hidden text layer in DjVu files. In the case of djvu annotations, use Ctrl+A to copy the coordinates of the left-down point and the width and height (for rectangular and oval areas).
Fifth, the Commands menu offers the possibility to copy (Ctrl+M) the coordinates in the style of HTML map images. This is useful for the creation of HTML file with clickable areas on the images.
Paste the values of the coordinates with a text editor.
In the home directory I have created the folders share\bin
I have put the tesseractBoxEditor.py in that bin folder.
I do not have ~\share\bin in my PATH. You might put the py file in an adequate location, if you want to call the Box Editor from the commandline.
I call the Box Editor, in Gnome, from the context menu. The trick is to put in ~/.gnome2/nautilus-scripts a Python script.
I have adapted a Nautilus Python script which extracts archives. The Nautilus Python script gets the URL of the tiff file and invokes the Box Editor. You may download the script from here.
Note that I do not use the “Open image…” command from the File menu of the Box Editor. I open each image file from the context menu.
First, you need tiff files. They might be created by your scanner. Otherwise, use convert (from ImageMagick):
convert wsh.png wsh.tif
or some other program for converting images.
You can convert pdf files to tiff. I find Fred Smith's script pdf2tif an excellent idea. Smith modified the pdf2ps script. The essential command is:
exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"
The script calls GhostScript and creates separate tiff files for the pages of the pdf.
In the case of DjVu, you may use a command like this:
ddjvu -format=tiff ebook.djvu ebook.tiff
Extract the pages from the tiff e-book with:
The program tiffsplit is included in the libtiff package.
Second, you need a box file for the tiff file. Tesseract is able to generate such a file.
The internal commands of the Box Editor are quite simple. The boxes are increased with:
Ctrl + arrow (left, up, down, right)
The boxes are decreased with:
Ctrl + Shift + arrow
Why use the arrow keys? Why not the mouse? I find the arrows much more precise. When you use the boxes for the image of a text, each pixel is important. It would also be difficult to determine exactly the lines of the text in the image.
For the moves from one box to the other you use the following keys:
Tab moves the cursor to the next box.
Shift + Tab moves the cursor to the previous box.
The up and down arrows move the cursor the above and, respectively, bellow boxes.
Click on box to move the cursor there.
The rest of the commands should be easily guessed from the menus.