Book scanning

From Finninday
Revision as of 21:41, 22 December 2009 by Rday (Talk | contribs)

Jump to: navigation, search

We have a lot of old books that I would like to scan and index.

I'm inspired by the vibrant community that is forming around this website: http://www.diybookscanner.org/

The scanner-building project looks like fun, but I want to make sure the the results will be worthwhile before I plug in my soldering iron and circular saw.

So, first, some proof-of-concept work with the software components.

Software

  • I already have a Canon Powershot SD30, so I should have no problem putting CHDK on it.
  • It looks like Tesseract is the low-hanging fruit in the OCR field. I have been able to download, build, and run a test on it. Out of the box, it only understands TIFF. But it seems there are flavors of TIFF, and the ImageMagick "convert" created a TIFF that made Tesseract barf:
[rday@snapper tesseract-2.04]$ ccmain/tesseract image3.tif image3.txt
Tesseract Open Source OCR Engine
read_tif_image:Error:Illegal image format:Compression
ccmain/tesseract:Error:Read of file failed:image3.tif
Segmentation fault
  • Perhaps there is a way to provide more direction to "convert" than just

Here is what "identify" says about the two working TIFFs that came with Tesseract and the one I generated that doesn't work:

[rday@snapper tesseract-2.04]$ identify *.tif
eurotext.tif TIFF 1024x800 1024x800+0+0 PseudoClass 2c 1e+02kb 
phototest.tif[1] TIFF 640x480 640x480+0+0 PseudoClass 2c 38kb 
image3.tif[2] TIFF 1701x2800 1701x2800+0+0 DirectClass 1.1mb 
  • The trick is to use no compression like so:
[rday@snapper tesseract-2.04]$ convert -compress none /home/rday/Documents/devel/Image\ \(3\).jpg image3.tif
  • the uncompressed test TIFF was 13.6MB

The OCR results were impressive even on a dirty scan that would probably be typical of the pages that I would encounter. I found four errors in the test page. That is encouraging enough to go ahead with a more ambitious test.