Improving OCR accuracy
By William Manley. 14 Apr 2014
- TL;DR version:
stbt.ocr’s misreadings aren’t due to it not being familiar with your fonts. The root cause is that the OCR system we use (tesseract) is intended for reading the printed word. With this new understanding we’ve made improvements to OCR resulting in a 12× reduction of errors. This will be in the next stb-tester release.
- Update 2014-04-23:
- Improvements have now been merged to master and so will be in stb-tester 0.20.
stb-tester uses the open-source tesseract engine for OCR. This works really well but is not perfect. Tesseract was was primarily designed to operate on text which had been printed and then scanned. This is broadly the same but slightly different to OCR on the screen:
|Scanned Text||Text on Screen|
|High resolution scans (300dpi)||Lower-resolution anti-aliased fonts|
|Text at an angle, perhaps curved||Text is perfectly straight|
|Usually black text on a white (or near white) background||Coloured text on coloured background often with gradients.|
|Artifacts due to blobs of ink or dust, text not quite joining up and stretched or crinkled pages.||Artefacts due to video compression caused by capture device (e.g. h264)|
|Contains oddities related to the limitations of manually typeset text (e.g. ligatures)||Fairly consistent, one glyph per character.|
Over the years tesseract has evolved techniques for dealing with each of the problems listed on the left. According to my experiments it is not yet perfectly adapted for dealing with the issues on the right.
But first: If we want to improve something, first we must be able to measure it.
Fortunately the YouView UI which I’m using for this contains a 6163 word Terms
online. This gives me a basis for comparison. I can then OCR the screens
and compare against the ideal with the word diff tool
ignores formatting differences and will print statistics to measure how good a
job we’ve done:
old: 6278 words 6046 96% common 0 0% deleted 232 3% changed new: 6199 words 6046 97% common 0 0% inserted 153 2% changed
For this case the number included in the table below would be 232/6046. e.g. 232 of the words of the original text were not correctly recognised.
The test script that I used to generate these results can be found on a branch on github.
Approaches to improve accuracy
- Training Tesseract 3 - Tesseract wiki
The theory is that by training tesseract on the fonts you’re using in your
specific UI stb-tester will be able to do a better job at recognising text. It
turns out that this isn’t true. Training on the specific font used in the
YouView UI had no beneficial effect, in fact the opposite turned out to be the
case. We hypothesise that a lot of effort has gone into the
file that tesseract ships with to make it work well with the majority of fonts
and text so training with a specific font has little effect.
Training turns out to be much trickier than you might expect - despite the tesseract wiki proclaiming “And that’s all there is to it” after 12 pages of instructions. Eventually I was able to write a script to automate these instructions. As a point of interest you can find this on my train-tesseract branch on github. Instructions are included there. As training was ineffective I’ve no intention to merge this to stb-tester master at this time.
Result: No useful effect once other improvements are made, harmful in some cases.
Ligatures are when two letters are combined into a single glyph. e.g. fi might be rendered as the single glyph (and unicode codepoint) ﬁ. This was used in traditional movable type typesetting to make typesetting easier and look better.
This is not really relevant for our case as we are interested in the content, not the rendering. I’ve tried two approaches to this:
- Tell tesseract to not recognise ligatures with
-c tessedit_char_blacklist=ﬀﬁﬂﬃﬄﬅﬆ. This is “+noligatures” below.
- Expand ligatures after tesseract has recognised them. This is “+replaceligatures” below.
Result: Replacing ligatures is superior to blacklisting them. Results in an improvement of 0.4%.
Unicode contains a whole bunch of punctuation which all look very similar. We consider for the purposes of set-top box testing our users are unlikely to be interested in the differences between a hyphen (‐), a minus sign (−) and any of the four types of unicode dash (‒, –, —, ―). Nor will there be much interest in the differences between the various types of quotation marks.
This approach cheats slightly as, strictly speaking, to match against some known text you need to additionally normalise the expected text. When this is merged to master stbt will provide an API for doing this.
Result: Positive effect reducing error rate by ~0.3%
Scale text up
Tesseract expects high resolution (300dpi) scans of text with no anti-aliasing. Screen text on the other hand tends to be relatively low resolution (72dpi) and anti-aliased. tesseract’s first step “thresholds” the image - turning the image into 1-bit monochrome black and white binary image. This throws away any information about the shapes of the letters provided by anti-aliasing before performing outline recognition.
Ideally we would patch tesseract to directly take anti-aliasing into account during outline recognition, but instead it seem that just scaling the image up before thresholding allows this information to be preserved until the outline extraction step.
Result: This provides the bulk of the improvement in error rate, bringing it down from 8.3% to 1.2%.
There are many image scaling algorithms. I tested against all the algorithms available in imagemagick at 2×, 3× and 4× scaling factor:
To cut a long story short the best algorithm is “Triangle” at 3× scaling. This
turns out to be bilinear scaling which conveniently is the default interpolation
mode used by OpenCV’s
resize function (
3× worked better than 2× or 4× scaling. My theory is that this is because each pixel becomes 9 pixels (3×3) the sampling point in the middle can stay the same. This theory is unsubstantiated speculation however. Perhaps someone with a greater understanding of re-sampling can explain this?
The theory is that h264 encoding (such as performed by the TeradeK VidiU or the Hauppauge HD-PVR) causes some blurring. The sharpening is intended to make the edges of the words sharper in the hope that tesseract’s thresholding algorithm will then be able to do a better job.
Result: This had a small negative effect.
With scaling and ligature and punctuation normalisation the error rate drops from 8.3% to 0.65%. This is a improvement of 12×. An extract of the full results is included below:
|Configuration||Error Count||Error rate|
|Default||523 / 6278||8.33%|
|Normalising Punctuation||506 / 6278||8.06%|
|3× Scaling||78 / 6278||1.24%|
|Scaling and Replacing Ligatures||56 / 6278||0.89%|
|Scaling and Replacing Punctuation||63 / 6278||1.00%|
|Replacing Ligatures||492 / 6278||7.83%|
|Replacing Punctuation||506 / 6278||8.06%|
|Default training plus training on FS Me||516 / 6278||8.21%|
|Training on FS Me font||696 / 6278||11.09%|
|Scaling and Replacing Ligatures and Punctuation||41 / 6278||0.65%|
Admittedly this is all a little unscientific. Further testing with different text would be required to estimate error bars on the above measurements but I believe this is conclusive enough to justify the improvements to stb-tester.
Thanks to YouView for sponsoring this effort.
- An Overview of the Tesseract OCR Engine - Ray Smith
- Training Tesseract 3 - Tesseract wiki
- Improving the quality of the output - Tesseract wiki
- Tesseract OCR Engine - What it is, where it came from, where it is going - Ray Smith - presented at OSCON 2007.