Difference between revisions of "Book scanning"

From Finninday
Jump to: navigation, search
 
(7 intermediate revisions by the same user not shown)
Line 9: Line 9:
  
 
===Software===
 
===Software===
* I already have a Canon Powershot SD30, so I should have no problem putting [http://chdk.wikia.com/wiki/CHDK CHDK] on it.
+
* I have a Canon Powershot S60, so I thought that it would be easy to put [http://chdk.wikia.com/wiki/CHDK CHDK] on it, but I was wrong.  My camera is too old to be supported.  Julie has a Canon SD30, which is supported.
 
* It looks like [http://code.google.com/p/tesseract-ocr/ Tesseract] is the low-hanging fruit in the OCR field.  I have been able to download, build, and run a test on it.  Out of the box, it only understands TIFF.  But it seems there are flavors of TIFF, and the ImageMagick "convert" created a TIFF that made Tesseract barf:
 
* It looks like [http://code.google.com/p/tesseract-ocr/ Tesseract] is the low-hanging fruit in the OCR field.  I have been able to download, build, and run a test on it.  Out of the box, it only understands TIFF.  But it seems there are flavors of TIFF, and the ImageMagick "convert" created a TIFF that made Tesseract barf:
 
<pre>
 
<pre>
Line 18: Line 18:
 
Segmentation fault
 
Segmentation fault
 
</pre>
 
</pre>
* Perhaps there is a way to provide more direction to "convert" than just
+
 
 
Here is what "identify" says about the two working TIFFs that came with Tesseract and the one I generated that doesn't work:
 
Here is what "identify" says about the two working TIFFs that came with Tesseract and the one I generated that doesn't work:
 
<pre>
 
<pre>
Line 28: Line 28:
 
* The trick is to use no compression like so:
 
* The trick is to use no compression like so:
 
  [rday@snapper tesseract-2.04]$ convert -compress none /home/rday/Documents/devel/Image\ \(3\).jpg image3.tif
 
  [rday@snapper tesseract-2.04]$ convert -compress none /home/rday/Documents/devel/Image\ \(3\).jpg image3.tif
 +
* the uncompressed test TIFF was 13.6MB
 +
The OCR results were impressive even on a dirty scan that would probably be typical of the pages that I would encounter.  I found four errors in the test page.  That is encouraging enough to go ahead with a more ambitious test.
  
The OCR results were impressive even on a dirty scan that would probably be typical of the pages that I would encounter.  I found four errors in the test page.
+
* I followed the directions on the Tesseract page and recompiled after installing libtiff-devel.  That gave Tesseract the ability to read a compressed tiff.
 +
* When testing with a compressed tiff, the file was 1.1MB in size and it took 1.7 seconds to convert it to text.
 +
 
 +
* Post-processing of the image files can be done with [http://sourceforge.net/project/screenshots.php?group_id=227253&ssid=90795 Scan Tailor] which is packaged for Ubuntu
 +
 
 +
I wasn't sure just how bad an idea it is to capture page images with my Canon S60 (5 mega pixel) just set on normal settings, generating jpgs (2000 x 2500 pixels)I converted them to tif and then tried to read them with tesseract and came up with gibberish.  So now I know it is a very bad idea.
 +
 
 +
I also wrote this to string together the steps from jpg to text:
 +
<pre>
 +
#!/usr/bin/perl -w
 +
 
 +
print "I found ". scalar(@ARGV). " parameters.\n";
 +
foreach $argnum (0 .. $#ARGV) {
 +
  my $in = $ARGV[$argnum];
 +
        my $out = $in;
 +
        $out =~ s/(jpg|JPG)$/tif/;
 +
        print "convert $in to $out... ";
 +
        $result = `convert $in $out`;
 +
        print "scan $out... ";
 +
        my $txt = $out;
 +
        $txt =~ s/.tif$//;
 +
        $result = `tesseract $out $txt`;
 +
        print "\n";
 +
}
 +
</pre>

Latest revision as of 19:36, 23 December 2009

We have a lot of old books that I would like to scan and index.

I'm inspired by the vibrant community that is forming around this website: http://www.diybookscanner.org/

The scanner-building project looks like fun, but I want to make sure the the results will be worthwhile before I plug in my soldering iron and circular saw.

So, first, some proof-of-concept work with the software components.

Software

  • I have a Canon Powershot S60, so I thought that it would be easy to put CHDK on it, but I was wrong. My camera is too old to be supported. Julie has a Canon SD30, which is supported.
  • It looks like Tesseract is the low-hanging fruit in the OCR field. I have been able to download, build, and run a test on it. Out of the box, it only understands TIFF. But it seems there are flavors of TIFF, and the ImageMagick "convert" created a TIFF that made Tesseract barf:
[rday@snapper tesseract-2.04]$ ccmain/tesseract image3.tif image3.txt
Tesseract Open Source OCR Engine
read_tif_image:Error:Illegal image format:Compression
ccmain/tesseract:Error:Read of file failed:image3.tif
Segmentation fault

Here is what "identify" says about the two working TIFFs that came with Tesseract and the one I generated that doesn't work:

[rday@snapper tesseract-2.04]$ identify *.tif
eurotext.tif TIFF 1024x800 1024x800+0+0 PseudoClass 2c 1e+02kb 
phototest.tif[1] TIFF 640x480 640x480+0+0 PseudoClass 2c 38kb 
image3.tif[2] TIFF 1701x2800 1701x2800+0+0 DirectClass 1.1mb 
  • The trick is to use no compression like so:
[rday@snapper tesseract-2.04]$ convert -compress none /home/rday/Documents/devel/Image\ \(3\).jpg image3.tif
  • the uncompressed test TIFF was 13.6MB

The OCR results were impressive even on a dirty scan that would probably be typical of the pages that I would encounter. I found four errors in the test page. That is encouraging enough to go ahead with a more ambitious test.

  • I followed the directions on the Tesseract page and recompiled after installing libtiff-devel. That gave Tesseract the ability to read a compressed tiff.
  • When testing with a compressed tiff, the file was 1.1MB in size and it took 1.7 seconds to convert it to text.
  • Post-processing of the image files can be done with Scan Tailor which is packaged for Ubuntu

I wasn't sure just how bad an idea it is to capture page images with my Canon S60 (5 mega pixel) just set on normal settings, generating jpgs (2000 x 2500 pixels). I converted them to tif and then tried to read them with tesseract and came up with gibberish. So now I know it is a very bad idea.

I also wrote this to string together the steps from jpg to text:

#!/usr/bin/perl -w

print "I found ". scalar(@ARGV). " parameters.\n";
foreach $argnum (0 .. $#ARGV) {
   my $in = $ARGV[$argnum];
        my $out = $in;
        $out =~ s/(jpg|JPG)$/tif/;
        print "convert $in to $out... ";
        $result = `convert $in $out`;
        print "scan $out... ";
        my $txt = $out;
        $txt =~ s/.tif$//;
        $result = `tesseract $out $txt`;
        print "\n";
}