The official Tesseract training wiki page is here, but it is verbose and combines instructions for multiple 3.0x minor versions. Those directions are hard to follow because it assumes familiarity with the software. Moreover, the Tesseract software itself is not resilient to incorrect usage, and incorrect usage results in obscure error codes (instead of meaningful messages), segfaulting, or glibc double-free error messages.
These instructions are a HOWTO for creating Tesseract language files by training from a images of sample text in a new font. They are written for Tesseract 3.02 as distributed by Debian 7.4 (wheezy) in the tesseract-ocr package.
These are not the official instructions, and they may not work for newer or older versions of the Tesseract software. Some steps are omitted for simplification! No support is provided! Use at your own risk.
Tools used:
Important:
If the characters are small, it may help to preprocess or resize them.
Whatever preprocessing is done should be the same as is used for
the 'real' text for Tesseract to process.
Some ImageMagick convert
tool preprocessing flags to try: -resize 500% -level 25%,55%
convert
tool.
The TIFF file's filename must have the format of [lang].[fontname].exp[num].tiff where [lang] should be eng for an initial try at this, fontname is a word (except no spaces and no periods), [num] is the should be 0 for the first training set.
For the purposes of this example, the fontname will be "thefontname".
Run this command in the /home/user/ocr/set0 directory:
convert *.png eng.thefontname.exp0.tiff
If more than one training set is to be used, the TIFF file should have the same naming convention but with different values of [num].
The multi-page TIFF file can be viewed using ImageMagick display
command line tool and hitting space bar to advance.
If multiple training sets are to be used, put the files into different directories, e.g. /home/user/ocr/set1 for the second set of images. Run the convert command to create a separate TIFF for each training set. Name the resulting multi-page TIFFs similarly except with a different number after exp, e.g. the second training file should be named eng.thefontname.exp1.tiff.
mkdir /home/user/ocr/working
cp /home/user/set0/eng.thefontname.exp0.tiff /home/user/ocr/working
cp /home/user/set1/eng.thefontname.exp1.tiff /home/user/ocr/working
Now edit this script, create-boxes.sh, with the FONTNAME variable chosen, which in this case is "thefontname".
#!/bin/sh
LANGNAME=eng
FONTNAME=thefontname
for a in $LANGNAME.$FONTNAME.exp*.tiff ;
do
BASENAME=`basename $a .tiff`;
tesseract $a $BASENAME batch.nochop makebox
done
Run the script without any arguments in the Tesseract work directory to create the initial box files. The box files will be named eng.thefontname.exp0.box, etc.
java -Xms128m -Xmx1024m -jar jTessBoxEditor.jar
Select the "Box Editor" tab. Choose the "Open" button and load the eng.thefontname.exp0.tiff file. It should load
the associated eng.thefontname.exp0.box that is in the same directory. (jTessBoxEditor looks for the .box file
based upon the rest of the filename of the .tiff file.)
Split/merge boxes so that they contain only one character each. Select by clicking on the box over each character. Then, in the box data table on the left, edit the char outlined by each box if it was not correctly identified. Go to next page of the TIFF using the next page button at the bottom. Repeat until all pages are corrected. Then hit "Save" to save the corrected box file.
Repeat for each .box file.
#!/bin/sh
LANGNAME=eng
FONTNAME=thefontname
for a in $LANGNAME.$FONTNAME.exp*.tiff ;
do
BASENAME=`basename $a .tiff`;
tesseract $a $BASENAME box.train
done
unicharset_extractor $LANGNAME.$FONTNAME.exp*.box
echo $FONTNAME 0 0 0 0 0 > font_properties
shapeclustering -F font_properties -U unicharset $LANGNAME.$FONTNAME.exp*.tr
mftraining -F font_properties -U unicharset -O $LANGNAME.unicharset $LANGNAME.$FONTNAME.exp*.tr
cntraining $LANGNAME.$FONTNAME.exp*.tr
mkdir -p tessdata
cp unicharset tessdata/$LANGNAME.unicharset
cp pffmtable tessdata/$LANGNAME.pffmtable
cp normproto tessdata/$LANGNAME.normproto
cp inttemp tessdata/$LANGNAME.inttemp
cp shapetable tessdata/$LANGNAME.shapetable
cd tessdata
combine_tessdata $LANGNAME.
cd ..
(There does not appear to be a stdout mode in Tesseract 3.02 so this hack is needed. Version 3.03 is supposed to have a stdout mode, but I have not tried it because a lot of the other procedure described here may not apply.)
(The image format must be a TIFF file. If it is not, use the ImageMagick convert
command to convert from a different image file format. The TIFF file can not be a stream.)
#!/bin/sh
BASE=f_$$
FILE=$BASE.txt
mkfifo $FILE
TESSDATA_PREFIX=/home/user/ocr/working/ \
tesseract "$@" $BASE -l eng > /dev/null &
cat $FILE
wait
rm $FILE
Run it like this:
sh ocr-one-image.sh image-containing-some-text.tiff
The result will be printed to stdout.