Difference between revisions of "Book scanning"

From Finninday
Jump to: navigation, search
Line 19: Line 19:
 
Segmentation fault
 
Segmentation fault
 
</pre>
 
</pre>
 +
* Perhaps there is a way to provide more direction to "convert" than just
 +
convert image3.jpg image3.tiff

Revision as of 21:26, 22 December 2009

We have a lot of old books that I would like to scan and index.

I'm inspired by the vibrant community that is forming around this website: http://www.diybookscanner.org/

The scanner-building project looks like fun, but I want to make sure the the results will be worthwhile before I plug in my soldering iron and circular saw.

So, first, some proof-of-concept work with the software components.

Software

  • I already have a Canon Powershot SD30, so I should have no problem putting CHDK on it.
  • It looks like Tesseract is the low-hanging fruit in the OCR field. I have been able to download, build, and run a test on it. Out of the box, it only understands TIFF. But it seems there are flavors of TIFF, and the ImageMagick "convert" created a TIFF that made Tesseract barf:
[rday@snapper tesseract-2.04]$ ccmain/tesseract image3.tiff image3.txt
Tesseract Open Source OCR Engine
name_to_image_type:Error:Unrecognized image type:image3.tiff
IMAGE::read_header:Error:Can't read this image type:image3.tiff
ccmain/tesseract:Error:Read of file failed:image3.tiff
Segmentation fault
  • Perhaps there is a way to provide more direction to "convert" than just
convert image3.jpg image3.tiff