Mihail Radu Solcan

Insert OCRed text and annotations in DjVu (manual method)

2008-12-29

The method described in this note is pretty similar to the procedures presented in the notes on the insertion of hidden text and annotations behind the image of a handwritten text in a DjVu file. The main difference is the use this time of the Tesseract OCR program. In this case, the boxes are created not by the gridMaker, but by the OCR program.

Before I start, I should add the usual words of caution: work with copies of the files; work in a directory that does not contain any precious data. Take responsability for anything you do!

Generate the box files

For the sake of an example, let us say that you want to transform an article about Champollion from the 1825 Westminster Review into a DjVu. The entire volume is digitized by Google and you may download it from Google Books.

The massive 27.3 MB pdf you get from Google Books proved to be a stumbling block. I tried to extract tiffs with Fred Smith's script from Groklaw, but it did not work. However, Smith's basic idea is to mimic pdf2ps is good. I used pdftk (you find it on GNU/Linux distributions) and I split the pdf into separate pages with the command:

pdftk name-of-the-file.pdf burst

Then I applied an adapted version of the command that you find in pdf2ps for the transformation of each pdf into a tiff. You find the code in pdf2tif in the downloadable archive.

The most important step is the use of the OCR. I have used the Tesseract OCR. The commands are in the script tess-ocr that you can also find in the archive. They are explained in the note on Tesseract OCR.

It might take some time to process all the tif files. On my box, 200 pages were OCR-ed in an hour. Thus you may just have a look at pages 50-70 (the article on Champollion).

The main result at this stage is the series of pairs of tiff and box files. Of course, there are also txt files (with the text of the pages) and the respective pdf files.

Edit and extract coordinates from box files

The tool that I have used for work with the tiff and the box files is the Box Editor. This is a modified version of the tesseractTrainer.py, a program written by Cătălin Frâncu (see details in the note).

Adjust the box around a word

In this context, you can use the Box Editor for two main tasks: adjust the boxes and copy the coordinates for use in the djvused script.

Using the manual method, you enlarge a box until it surrounds a word (see the image). Then you copy the relevant coordinates.

The djvused script

The creation of the djvused script is explained in the notes on the insertion of text and the insertion of annotations behind the image of handwritten text.

For example, the name of Champollion is inserted as text with the command:

(word 585 903 783 949 "Champollion")

The corresponding annotation might be:

(maparea "" "Champollion" (rect 585 903 198 46) (none) (hilite #00cfcf))

You can find the whole model-script in the archive.

Now, you have to create the djvu files and bundle them together, as it is explained in the note on DjVu e-books. In this particular case, because we have a collection of pdf files we might just use (on the commandline)

pdftk *.pdf cat output 1825-art-on-Champollion.pdf

Of course, the name of the pdf file is just an accident. Use the appropriate name in your case. Then

pdf2djvu 1825-art-on-Champollion.pdf -o 1825-art-on-Champollion.djvu

Read the note on pdf2djvu for more information on pdf2djvu.

Now, one has to pay attention to the fact that pdf2djvu has renamed the component djvu files. In DjView4 you can get the necessary information using the menu: go to View and then to Information (or just press Ctrl+I). Also, in order to build the djvused script, first issue (on the commandline)

djvused name.djvu -e 'output-all' > name.dsed

Use an appropriate “name”. For example, in our case, “1825-art-on-Champollion”.

Add your commands to the dsed file (see the model-script.dsed in the attached archive).

You insert the text and annotations in the djvu file with a command like this:

djvused name.djvu -f name.dsed -s

The result of the insertion of three highlighted annotations is shown in the figure.

DjVu with highlighted text

The basic files

Now, we have a lot of files in the working folder. Should we keep them all? I think that we should keep just a few files.

The most important file is, of course, the djvu file with all the pages, the hidden text and the annotations. Keep this file!

In our example, the name of the file is “1825-art-on-Champollion.djvu”. From this file, on the commandline, it is possible to extract a lot: the size of a page, information on the document, a djvused script, a page or all the pages in tiff format. I have included in the archive a list of relevant commands.

What else? I think that you should keep the manually edited box-files. They will be useful if you come back to the file and want to add more text or annotations. You can keep them in an archive.

What about the images? Should we keep them as pdf files? I do not think the pdf is the best solution. I would rather opt for xcf, the standard format of the Gimp.

A xcf file has layers. You can add new layers with your own graphics and transform them into djvu pages. Later, if you want, you can remove layers from the image.

But how we create and manipulate the xcf files? Is it possible to use the console? Yes. I followed the suggestion of user saulgoode on gimptalk and with an obvious adaptation of his script it is possible to convert all the tiff files in a folder with

gimp -i -b '(tif2xcf "*.tif")' '(gimp-quit 0)'

Working with the Gimp it is possible to modify the images. For example, you can add free-hand signs on the image. Then you have to go back from the xcf to a ppm and convert it to a djvu page. You have to delete the old page from the djvu and insert the new one. Anyway, pay attention with such commands! They might destroy files.

djvm -d 1825-art-on-Champollion.djvu 8

The above command cuts page 8 off. The next puts another page in its place:

djvm -i 1825-art-on-Champollion.djvu pg_0057.djvu 8

Then you have to reinsert the hidden text and annotations! Thus you must prepare the djvused script before you delete pages. And you must take into account the name and characteristics of the new page.

The image shows a combination of drawn mark-up and DjVu style annotation.

Graphical mark-up and DjVu annotation

The example that illustrates this note uses rather large files with a high resolution. In fact, the results of the OCR are good. You are not going, probably, to use manual insertion of OCRed text on such files. In practice, I have noticed that this method is useful with low resolution files (for example, images in the web pages). These files tend to have much smaller sizes.