Stb-tester : Improving OCR accuracy

14 Apr 2014. By William Manley.

TL;DR version:: stbt.ocr’s misreadings aren’t due to it not being familiar with your fonts. The root cause is that the OCR system we use (tesseract) is intended for reading the printed word. With this new understanding we’ve made improvements to OCR resulting in a 12× reduction of errors. This will be in the next stb-tester release.
Update 2014-04-23:: Improvements have now been merged to master and so will be in stb-tester 0.20.

stb-tester uses the open-source tesseract engine for OCR. This works really well but is not perfect. Tesseract was was primarily designed to operate on text which had been printed and then scanned. This is broadly the same but slightly different to OCR on the screen:

Scanned Text	Text on Screen
High resolution scans (300dpi)	Lower-resolution anti-aliased fonts
Text at an angle, perhaps curved	Text is perfectly straight
Usually black text on a white (or near white) background	Coloured text on coloured background often with gradients.
Artifacts due to blobs of ink or dust, text not quite joining up and stretched or crinkled pages.	Artefacts due to video compression caused by capture device (e.g. h264)
Contains oddities related to the limitations of manually typeset text (e.g. ligatures)	Fairly consistent, one glyph per character.

Over the years tesseract has evolved techniques for dealing with each of the problems listed on the left. According to my experiments it is not yet perfectly adapted for dealing with the issues on the right.

Measurement

But first: If we want to improve something, first we must be able to measure it. Fortunately the YouView UI which I’m using for this contains a 6163 word Terms and Conditions and Privacy Policy which is also available online. This gives me a basis for comparison. I can then OCR the screens and compare against the ideal with the word diff tool dwdiff. This ignores formatting differences and will print statistics to measure how good a job we’ve done:

old: 6278 words  6046 96% common  0 0% deleted  232 3% changed
new: 6199 words  6046 97% common  0 0% inserted 153 2% changed

For this case the number included in the table below would be 232/6278. e.g. 232 of the words of the original text were not correctly recognised.

The test script that I used to generate these results can be found on a branch on github.

Approaches to improve accuracy

Training

Training Tesseract 3 - Tesseract wiki

The theory is that by training tesseract on the fonts you’re using in your specific UI stb-tester will be able to do a better job at recognising text. It turns out that this isn’t true. Training on the specific font used in the YouView UI had no beneficial effect, in fact the opposite turned out to be the case. We hypothesise that a lot of effort has gone into the eng.trainneddata file that tesseract ships with to make it work well with the majority of fonts and text so training with a specific font has little effect.

Training turns out to be much trickier than you might expect - despite the tesseract wiki proclaiming “And that’s all there is to it” after 12 pages of instructions. Eventually I was able to write a script to automate these instructions. As a point of interest you can find this on my train-tesseract branch on github. Instructions are included there. As training was ineffective I’ve no intention to merge this to stb-tester master at this time.

Result: No useful effect once other improvements are made, harmful in some cases.

Remove ligatures

Ligatures are when two letters are combined into a single glyph. e.g. fi might be rendered as the single glyph (and unicode codepoint) ﬁ. This was used in traditional movable type typesetting to make typesetting easier and look better.

This is not really relevant for our case as we are interested in the content, not the rendering. I’ve tried two approaches to this:

Tell tesseract to not recognise ligatures with -c tessedit_char_blacklist=ﬀﬁﬂﬃﬄﬅﬆ. This is “+noligatures” below.
Expand ligatures after tesseract has recognised them. This is “+replaceligatures” below.

Result: Replacing ligatures is superior to blacklisting them. Results in an improvement of 0.4%.

Normalise Punctuation

Unicode contains a whole bunch of punctuation which all look very similar. We consider for the purposes of set-top box testing our users are unlikely to be interested in the differences between a hyphen (‐), a minus sign (−) and any of the four types of unicode dash (‒, –, —, ―). Nor will there be much interest in the differences between the various types of quotation marks.

This approach cheats slightly as, strictly speaking, to match against some known text you need to additionally normalise the expected text. When this is merged to master stbt will provide an API for doing this.

Result: Positive effect reducing error rate by ~0.3%

Scale text up

Tesseract expects high resolution (300dpi) scans of text with no anti-aliasing. Screen text on the other hand tends to be relatively low resolution (72dpi) and anti-aliased. tesseract’s first step “thresholds” the image - turning the image into 1-bit monochrome black and white binary image. This throws away any information about the shapes of the letters provided by anti-aliasing before performing outline recognition.

Ideally we would patch tesseract to directly take anti-aliasing into account during outline recognition, but instead it seem that just scaling the image up before thresholding allows this information to be preserved until the outline extraction step.

Result: This provides the bulk of the improvement in error rate, bringing it down from 8.3% to 1.2%.

There are many image scaling algorithms. I tested against all the algorithms available in imagemagick at 2×, 3× and 4× scaling factor:

Bartlett
Blackman
Bohman
Box
Catrom
Cosine
Cubic
Gaussian
Hamming
Hanning
Hermite
Jinc
Kaiser
Lagrange
Lanczos
LanczosSharp
Lanczos2
Lanczos2Sharp
Mitchell
Parzen
Point
Quadratic
Robidoux
RobidouxSharp
Sinc
SincFast
Spline
Triangle
Welsh

To cut a long story short the best algorithm is “Triangle” at 3× scaling. This turns out to be bilinear scaling which conveniently is the default interpolation mode used by OpenCV’s resize function (cv2.INTER_LINEAR).

3× worked better than 2× or 4× scaling. My theory is that this is because each pixel becomes 9 pixels (3×3) the sampling point in the middle can stay the same. This theory is unsubstantiated speculation however. Perhaps someone with a greater understanding of re-sampling can explain this?

Sharpening

The theory is that h264 encoding (such as performed by the TeradeK VidiU or the Hauppauge HD-PVR) causes some blurring. The sharpening is intended to make the edges of the words sharper in the hope that tesseract’s thresholding algorithm will then be able to do a better job.

Result: This had a small negative effect.

Results

With scaling and ligature and punctuation normalisation the error rate drops from 8.3% to 0.65%. This is a improvement of 12×. An extract of the full results is included below:

Configuration	Error Count	Error rate
Default	523 / 6278	8.33%
Normalising Punctuation	506 / 6278	8.06%
3× Scaling	78 / 6278	1.24%
Scaling and Replacing Ligatures	56 / 6278	0.89%
Scaling and Replacing Punctuation	63 / 6278	1.00%
Replacing Ligatures	492 / 6278	7.83%
Replacing Punctuation	506 / 6278	8.06%
Default training plus training on FS Me	516 / 6278	8.21%
Training on FS Me font	696 / 6278	11.09%
Scaling and Replacing Ligatures and Punctuation	41 / 6278	0.65%

Full results with example output can be found here.

Admittedly this is all a little unscientific. Further testing with different text would be required to estimate error bars on the above measurements but I believe this is conclusive enough to justify the improvements to stb-tester.

Thanks

Thanks to YouView for sponsoring this effort.