Tesseract OCR for PHP
A wrapper to work with Tesseract OCR inside PHP.
Installation
First of all, make sure you have Tesseract OCR installed. (v3.03 or greater)
As a composer dependency
{
"require": {
"thiagoalessio/tesseract_ocr": "1.0.0-RC"
}
}
Usage
Basic usage
Given the following image (text.png):
And the following code:
<?php
echo (new TesseractOCR('text.png'))
->run();
The output would be:
The quick brown fox
jumps over the lazy
dog.
Other languages
Given the following image (german.png):
And the following code:
<?php
echo (new TesseractOCR('german.png'))
->run();
The output would be:
griifien
Which is not good, but defining a language:
<?php
echo (new TesseractOCR('german.png'))
->lang('deu')
->run();
Will produce:
grüßen
Multiple languages
Given the following image (multi-languages.png):
And the following code ....
<?php
echo (new TesseractOCR('multi-languages.png'))
->lang('eng', 'jpn', 'por')
->run();
The output would be:
I eat 寿司 de maçã
Inducing recognition
Given the following image (8055.png):
And the following code ....
<?php
echo (new TesseractOCR('8055.png'))
->whitelist(range('A', 'Z'))
->run();
The output would be:
BOSS
API
->executable('/path/to/tesseract')
Define a custom location of the tesseract executable, if by any reason it is
not present in the $PATH.
->tessdataDir('/path')
Specify a custom location for the tessdata directory.
->userWords('/path/to/user-words.txt')
Specify the location of user words file.
This is a plain text file containing a list of words that you want to be
considered as a normal dictionary words by tesseract.
Useful when dealing with contents that contain technical terminology, jargon,
etc.
Example of a user words file:
$ cat /path/to/user-words.txt
foo
bar
->userPatterns('/path/to/user-patterns.txt')
Specify the location of user patterns file.
If the contents you are dealing with have known patterns, this option can help
a lot tesseract's recognition accuracy.
Example of a user patterns file:
$ cat /path/to/user-patterns.txt'
1-\d\d\d-GOOG-441
www.\n\\\*.com
->lang('lang1', 'lang2', 'lang3')
Define one or more languages to be used during the recognition.
They need to be specified as 3-character ISO 639-2 language codes.
Note: For Chinese Language,we don't use the ISO 639-2 code,it doesn't work well. Rewrite as this:->lang('chi_sim', 'chi_tra') instead .
->psm(6)
Specify the Page Segmentation Mode, which instructs tesseract how to
interpret the given image.
Possible psm values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR.
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
->config('configvar', 'value')
Tesseract offers incredible control to the user through its 660 configuration
vars.
You can see the complete list by running the following command:
$ tesseract --print-parameters
->whitelist(range('a', 'z'), range(0, 9), '-_@')
This is a shortcut for ->config('tessedit_char_whitelist', 'abcdef....').
Where to get help
- #tesseract-ocr-for-php on freenode IRC




