Difference between revisions of "Book scanning"
From Finninday
(Created page with 'We have a lot of old books that I would like to scan and index. I'm inspired by the vibrant community that is forming around this website: http://www.diybookscanner.org/ The sc…') |
|||
Line 9: | Line 9: | ||
===Software=== | ===Software=== | ||
− | * I already have a Canon Powershot SD30, so I should have no problem putting | + | * I already have a Canon Powershot SD30, so I should have no problem putting [http://chdk.wikia.com/wiki/CHDK CHDK] on it. |
+ | * It looks like [http://code.google.com/p/tesseract-ocr/ Tesseract] is the low-hanging fruit in the OCR field. I have been able to download, build, and run a test on it. Out of the box, it only understands TIFF. But it seems there are flavors of TIFF, and the ImageMagick "convert" created a TIFF that made Tesseract barf: | ||
+ | <pre> | ||
+ | [rday@snapper tesseract-2.04]$ ccmain/tesseract image3.tiff image3.txt | ||
+ | Tesseract Open Source OCR Engine | ||
+ | name_to_image_type:Error:Unrecognized image type:image3.tiff | ||
+ | IMAGE::read_header:Error:Can't read this image type:image3.tiff | ||
+ | ccmain/tesseract:Error:Read of file failed:image3.tiff | ||
+ | Segmentation fault | ||
+ | </pre> |
Revision as of 21:25, 22 December 2009
We have a lot of old books that I would like to scan and index.
I'm inspired by the vibrant community that is forming around this website: http://www.diybookscanner.org/
The scanner-building project looks like fun, but I want to make sure the the results will be worthwhile before I plug in my soldering iron and circular saw.
So, first, some proof-of-concept work with the software components.
Software
- I already have a Canon Powershot SD30, so I should have no problem putting CHDK on it.
- It looks like Tesseract is the low-hanging fruit in the OCR field. I have been able to download, build, and run a test on it. Out of the box, it only understands TIFF. But it seems there are flavors of TIFF, and the ImageMagick "convert" created a TIFF that made Tesseract barf:
[rday@snapper tesseract-2.04]$ ccmain/tesseract image3.tiff image3.txt Tesseract Open Source OCR Engine name_to_image_type:Error:Unrecognized image type:image3.tiff IMAGE::read_header:Error:Can't read this image type:image3.tiff ccmain/tesseract:Error:Read of file failed:image3.tiff Segmentation fault