Book scanning: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
|||
Line 12: | Line 12: | ||
* It looks like [http://code.google.com/p/tesseract-ocr/ Tesseract] is the low-hanging fruit in the OCR field. I have been able to download, build, and run a test on it. Out of the box, it only understands TIFF. But it seems there are flavors of TIFF, and the ImageMagick "convert" created a TIFF that made Tesseract barf: | * It looks like [http://code.google.com/p/tesseract-ocr/ Tesseract] is the low-hanging fruit in the OCR field. I have been able to download, build, and run a test on it. Out of the box, it only understands TIFF. But it seems there are flavors of TIFF, and the ImageMagick "convert" created a TIFF that made Tesseract barf: | ||
<pre> | <pre> | ||
[rday@snapper tesseract-2.04]$ ccmain/tesseract image3. | [rday@snapper tesseract-2.04]$ ccmain/tesseract image3.tif image3.txt | ||
Tesseract Open Source OCR Engine | Tesseract Open Source OCR Engine | ||
read_tif_image:Error:Illegal image format:Compression | |||
ccmain/tesseract:Error:Read of file failed:image3.tif | |||
ccmain/tesseract:Error:Read of file failed:image3. | |||
Segmentation fault | Segmentation fault | ||
</pre> | </pre> | ||
Line 22: | Line 21: | ||
Here is what "identify" says about the two working TIFFs that came with Tesseract and the one I generated that doesn't work: | Here is what "identify" says about the two working TIFFs that came with Tesseract and the one I generated that doesn't work: | ||
<pre> | <pre> | ||
[rday@snapper tesseract-2.04]$ identify *.tif | [rday@snapper tesseract-2.04]$ identify *.tif | ||
eurotext.tif TIFF 1024x800 1024x800+0+0 PseudoClass 2c 1e+02kb | eurotext.tif TIFF 1024x800 1024x800+0+0 PseudoClass 2c 1e+02kb | ||
phototest.tif[1] TIFF 640x480 640x480+0+0 PseudoClass 2c 38kb | phototest.tif[1] TIFF 640x480 640x480+0+0 PseudoClass 2c 38kb | ||
image3. | image3.tif[2] TIFF 1701x2800 1701x2800+0+0 DirectClass 1.1mb | ||
</pre> | </pre> |
Revision as of 21:30, 22 December 2009
We have a lot of old books that I would like to scan and index.
I'm inspired by the vibrant community that is forming around this website: http://www.diybookscanner.org/
The scanner-building project looks like fun, but I want to make sure the the results will be worthwhile before I plug in my soldering iron and circular saw.
So, first, some proof-of-concept work with the software components.
Software
- I already have a Canon Powershot SD30, so I should have no problem putting CHDK on it.
- It looks like Tesseract is the low-hanging fruit in the OCR field. I have been able to download, build, and run a test on it. Out of the box, it only understands TIFF. But it seems there are flavors of TIFF, and the ImageMagick "convert" created a TIFF that made Tesseract barf:
[rday@snapper tesseract-2.04]$ ccmain/tesseract image3.tif image3.txt Tesseract Open Source OCR Engine read_tif_image:Error:Illegal image format:Compression ccmain/tesseract:Error:Read of file failed:image3.tif Segmentation fault
- Perhaps there is a way to provide more direction to "convert" than just
Here is what "identify" says about the two working TIFFs that came with Tesseract and the one I generated that doesn't work:
[rday@snapper tesseract-2.04]$ identify *.tif eurotext.tif TIFF 1024x800 1024x800+0+0 PseudoClass 2c 1e+02kb phototest.tif[1] TIFF 640x480 640x480+0+0 PseudoClass 2c 38kb image3.tif[2] TIFF 1701x2800 1701x2800+0+0 DirectClass 1.1mb