DBW

Darjeeling, Bergamot and Walnuts

21.3.10

OCR in Polish on Linux

A mental note to myself on getting text from images, and using a non-standard dictionary.

Software of choice: tesseract, gimagereader, tesseract-polish, Python, emacs.

Installation:

- kubuntu 9.04
- sudo apt-get install tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu tesseract-ocr-fra
- get tesseract-polish source and copy pol.* to /usr/share/tesseract-ocr/tessdata/

gImageReader:

is a quick and dirty GUI frontend to tesseract, but it does the job very well. I got the source ... careful, the tar does not include a main truck "gimagereader" so create one and extract it in there. Edit src/main.py to add Polish dictionary @ lines 239 and 256 add ("pol","Polish") and ("pol","Polish","pl_PL") respectively. Edit src/config.py @ lines 111-116: following the if ... !="/" change outer get_text to set_text, then "make install".

$ gimagereader

In the dialog asking for dictionary files, enter /usr/share/tesseract-ocr/tessdata/, apply.

My images were generated from a pdf using gimp pdf import at 200 DPI, higher resolutions seem to have caused tesseract some problems, such that gimagereader "recognize" found no text.

Process is as follows:

1. Open image

2. In the langauge dropdown in the toolbar, select Polish, the new dictionary we added.

3. Select the column of text to be recognized, and click "recognize". Repeat for additional columns. To the right, in the text box that appears, the OCR'd text will appear. Though the "Insert At cursor" is a sane default, allowing regions to be recognized in sequence, other options are available. Save.

1 comment:

  1. Hi, this is my first post on your blog
    Can I use polish language when I install gimagereader from deb package?

    ReplyDelete