Unofficial Tesseract OCR Training HOWTO

The official Tesseract training wiki page is here, but it is verbose and combines instructions for multiple 3.0x minor versions. Those directions are hard to follow because it assumes familiarity with the software. Moreover, the Tesseract software itself is not resilient to incorrect usage, and incorrect usage results in obscure error codes (instead of meaningful messages), segfaulting, or glibc double-free error messages.

These instructions are a HOWTO for creating Tesseract language files by training from a images of sample text in a new font. They are written for Tesseract 3.02 as distributed by Debian 7.4 (wheezy) in the tesseract-ocr package.

These are not the official instructions, and they may not work for newer or older versions of the Tesseract software. Some steps are omitted for simplification! No support is provided! Use at your own risk.

Outline

Collect images containing sample text
Convert images into a multipage TIFF
Use Tesseract to generate initial box files
Edit box files to correct boundaries and correct recognized characters
Train to create new language data
Use Tesseract to OCR target text

Tools used:

ImageMagick command-line tools
Tesseract 3.02
jTessBoxEditor (requires Java)

Important:

The Teserract tools expect filenames to be named in the format shown. Deviations in file naming may result in runtime errors or errors without an obvious error message.

Collect images containing sample text

Copy images containing sample text into a directory of their own. For the purposes of this example, let this be /home/user/ocr/set0. Files for this example will be of PNG type and have suffix .png.

If the characters are small, it may help to preprocess or resize them. Whatever preprocessing is done should be the same as is used for the 'real' text for Tesseract to process. Some ImageMagick convert tool preprocessing flags to try: -resize 500% -level 25%,55%

Convert the images into a multi-page TIFF

Convert the images into a single multi-page TIFF file using ImageMagick's convert tool.

The TIFF file's filename must have the format of [lang].[fontname].exp[num].tiff where [lang] should be eng for an initial try at this, fontname is a word (except no spaces and no periods), [num] is the should be 0 for the first training set.

For the purposes of this example, the fontname will be "thefontname".

Run this command in the /home/user/ocr/set0 directory:


convert *.png eng.thefontname.exp0.tiff

If more than one training set is to be used, the TIFF file should have the same naming convention but with different values of [num].

The multi-page TIFF file can be viewed using ImageMagick display command line tool and hitting space bar to advance.

If multiple training sets are to be used, put the files into different directories, e.g. /home/user/ocr/set1 for the second set of images. Run the convert command to create a separate TIFF for each training set. Name the resulting multi-page TIFFs similarly except with a different number after exp, e.g. the second training file should be named eng.thefontname.exp1.tiff.

Use Tesseract to generate initial box files

Make a new directory for the Tesseract work. Copy the eng.thefontname.exp?.tiff from the prior step into this directory.


mkdir /home/user/ocr/working
cp /home/user/set0/eng.thefontname.exp0.tiff /home/user/ocr/working
cp /home/user/set1/eng.thefontname.exp1.tiff /home/user/ocr/working

Now edit this script, create-boxes.sh, with the FONTNAME variable chosen, which in this case is "thefontname".

create-boxes.sh


#!/bin/sh

LANGNAME=eng
FONTNAME=thefontname

for a in $LANGNAME.$FONTNAME.exp*.tiff ;
do
	BASENAME=`basename $a .tiff`;
	tesseract $a  $BASENAME batch.nochop makebox
done

Run the script without any arguments in the Tesseract work directory to create the initial box files. The box files will be named eng.thefontname.exp0.box, etc.

Edit box files to correct boundaries and characters

jTessBoxEditor can be obtained from the VietOCR project files downloads page. These instructions used jTessBoxEditor-1.0.zip. Unzip the file to find the file jTessBoxEditor.jar. Run this command from the Tesseract working directory using java with these flags.


java -Xms128m -Xmx1024m -jar jTessBoxEditor.jar

Select the "Box Editor" tab. Choose the "Open" button and load the eng.thefontname.exp0.tiff file. It should load the associated eng.thefontname.exp0.box that is in the same directory. (jTessBoxEditor looks for the .box file based upon the rest of the filename of the .tiff file.)

Split/merge boxes so that they contain only one character each. Select by clicking on the box over each character. Then, in the box data table on the left, edit the char outlined by each box if it was not correctly identified. Go to next page of the TIFF using the next page button at the bottom. Repeat until all pages are corrected. Then hit "Save" to save the corrected box file.

Repeat for each .box file.

Train to create new language data

Edit the LANGNAME and FONTNAME in the following script, train-all.sh. Run it in the working directory. It will read the .tiff files and the .box file and call the other Tesseract utilities. The result will be a subdirectory named tessdata that contains the Tesseract files for the language and thefontname.

train-all.sh


#!/bin/sh

LANGNAME=eng
FONTNAME=thefontname

for a in $LANGNAME.$FONTNAME.exp*.tiff ;
do

	BASENAME=`basename $a .tiff`;
	tesseract $a  $BASENAME box.train

done


unicharset_extractor $LANGNAME.$FONTNAME.exp*.box
echo $FONTNAME 0 0 0 0 0 > font_properties

shapeclustering -F font_properties -U unicharset $LANGNAME.$FONTNAME.exp*.tr
mftraining -F font_properties -U unicharset -O $LANGNAME.unicharset $LANGNAME.$FONTNAME.exp*.tr
cntraining $LANGNAME.$FONTNAME.exp*.tr

mkdir -p tessdata
cp unicharset tessdata/$LANGNAME.unicharset
cp pffmtable tessdata/$LANGNAME.pffmtable
cp normproto tessdata/$LANGNAME.normproto
cp inttemp tessdata/$LANGNAME.inttemp
cp shapetable  tessdata/$LANGNAME.shapetable

cd tessdata
combine_tessdata $LANGNAME.
cd ..

Use Tesseract to OCR target text

The following script takes an image and returns the identified text to stdout. Edit the script so that the TESSDATA_PREFIX variable points to the parent of the tessdata directory described in the previous section. The TESSDATA_PREFIX path must end in a '/' character, and the tessdata directory from the previous step must bein this directory.

(There does not appear to be a stdout mode in Tesseract 3.02 so this hack is needed. Version 3.03 is supposed to have a stdout mode, but I have not tried it because a lot of the other procedure described here may not apply.)

(The image format must be a TIFF file. If it is not, use the ImageMagick convert command to convert from a different image file format. The TIFF file can not be a stream.)

ocr-one-image.sh


#!/bin/sh

BASE=f_$$
FILE=$BASE.txt

mkfifo $FILE
TESSDATA_PREFIX=/home/user/ocr/working/ \
tesseract "$@" $BASE -l eng > /dev/null &
cat $FILE
wait
rm $FILE

Run it like this:


sh ocr-one-image.sh image-containing-some-text.tiff

The result will be printed to stdout.