Darjeeling, Bergamot and Walnuts: OCR in Polish on Linux

Darjeeling, Bergamot and Walnuts

21.3.10

OCR in Polish on Linux

A mental note to myself on getting text from images, and using a non-standard dictionary.

Software of choice: tesseract, gimagereader, tesseract-polish, Python, emacs.

Installation:

- kubuntu 9.04
- sudo apt-get install tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu tesseract-ocr-fra
- get tesseract-polish source and copy pol.* to /usr/share/tesseract-ocr/tessdata/

gImageReader:

is a quick and dirty GUI frontend to tesseract, but it does the job very well. I got the source ... careful, the tar does not include a main truck "gimagereader" so create one and extract it in there. Edit src/main.py to add Polish dictionary @ lines 239 and 256 add ("pol","Polish") and ("pol","Polish","pl_PL") respectively. Edit src/config.py @ lines 111-116: following the if ... !="/" change outer get_text to set_text, then "make install".

$ gimagereader

In the dialog asking for dictionary files, enter /usr/share/tesseract-ocr/tessdata/, apply.

My images were generated from a pdf using gimp pdf import at 200 DPI, higher resolutions seem to have caused tesseract some problems, such that gimagereader "recognize" found no text.

Process is as follows:

1. Open image

2. In the langauge dropdown in the toolbar, select Polish, the new dictionary we added.

3. Select the column of text to be recognized, and click "recognize". Repeat for additional columns. To the right, in the text box that appears, the OCR'd text will appear. Though the "Insert At cursor" is a sane default, allowing regions to be recognized in sequence, other options are available. Save.

1 comment:

fielorybJuly 14, 2010 at 4:57 PM
Hi, this is my first post on your blog
Can I use polish language when I install gimagereader from deb package?
ReplyDelete
Replies

Add comment

Darjeeling, Bergamot and Walnuts

21.3.10

OCR in Polish on Linux

1 comment:

Recipes
Prose
TaiChi
ChaiTea
Mercy

Blog Archive

Labels

Moderately simple moderate pleasures with moderate consequences.

A massage for your sub- and consciousness from the future since 2010.

Fellows

Darjeeling, Bergamot and Walnuts

21.3.10

OCR in Polish on Linux

1 comment:

RecipesProseTaiChiChaiTeaMercy

Blog Archive

Labels

Moderately simple moderate pleasures with moderate consequences.

A massage for your sub- and consciousness from the future since 2010.

Fellows

Recipes
Prose
TaiChi
ChaiTea
Mercy