Stinkerbelle hard disk failure

I'm jumping the gun a little by titling this page "hard disk failure". But I think the evidence is pretty compelling so far.

1 Symptoms
2 tail -50 /var/log/kern.log
3 What does the error mean?
4 Other instances of this bug:
5 Tutorial on sneaky way to force a bad block
6 How many sectors have been reallocated because of bad writes?
7 hdparm -i /dev/sda
8 fsck finds nothing wrong
9 fsck -cc /dev/sda1
10 smartctl -a /dev/sda
11 lspci -v
12 What changed to cause this?
13 ls -lt /var/cache/apt/archives | head
14 /var/log/apt/history.log
15 problem moves to another disk
16 Try this workaround March 26, 2009
17 Resolution
18 more bad blocks
19 new disk

Symptoms

After an unidentified update on the latest ubuntu channel, the laptop started booting very slowly. But it seemed to work once it finally booted. Normally, any failure after an update is fixed by rebuilding the NVIDIA driver to match the newly updated kernel. But that wasn't the problem this time.

Certain files seem to be unreadable:

root@Stinkerbelle:/var/log# grep "dev sda, sector" syslog
grep: syslog: Input/output error

When booting was taking a long time, I switched to a text console and saw disk errors every three seconds, like this:

tail -50 /var/log/kern.log

Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048486] ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048491] ata3.00: irq_stat 0x40000008
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048512] ata3.00: failed command: READ FPDMA QUEUED
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048517] ata3.00: cmd 60/08:00:90:e3:57/00:00:14:00:00/40 tag 0 ncq 4096 in
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048519]          res 41/40:00:90:e3:57/00:00:14:00:00/40 Emask 0x409 (media error) <F>
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048521] ata3.00: status: { DRDY ERR }
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048541] ata3.00: error: { UNC }
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050470] ata3.00: configured for UDMA/133
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050521] sd 2:0:0:0: [sda] Unhandled sense code
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050523] sd 2:0:0:0: [sda]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050527] sd 2:0:0:0: [sda]  Sense Key : Medium Error [current] [descriptor]
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050531] Descriptor sense data with sense descriptors (in hex):
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050533]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050560]         14 57 e3 90 
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050564] sd 2:0:0:0: [sda]  Add. Sense: Unrecovered read error - auto reallocate failed
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050587] sd 2:0:0:0: [sda] CDB: Read(10): 28 00 14 57 e3 90 00 00 08 00
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050596] end_request: I/O error, dev sda, sector 341304208
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050651] ata3: EH complete

What does the error mean?

From a whitepaper about Serial ATA Native Command Queuing (http://www.seagate.com/content/pdf/whitepaper/D2c_tech_paper_intc-stx_sata_ncq.pdf)

FPDMA

First party direct memory access

NCQ

Native command queuing

UNC

Uncorrectable Data: An ECC error in the data field could not be corrected (a media error or read instability).

Other instances of this bug:

Lots of bug reports from March 2010 to now. No real solutions or workarounds. Some people are convinced that it is a software problem, other are convinced it is hardware.

Tutorial on sneaky way to force a bad block

http://www.sjvs.nl/?p=12

How many sectors have been reallocated because of bad writes?

root@Stinkerbelle:/etc/modprobe.d# smartctl -a /dev/sda | grep -i reallocated
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

The zeroes at the end indicate that no sectors have been reallocated for bad writes. This problem appears to be with bad reads. Using the technique above, I should be able to get those bad sectors reallocated by triggering a write to the bad sector and watch those counters go up. And then hopefully the read errors will go away.

hdparm -i /dev/sda

root@Stinkerbelle:~# hdparm -i /dev/sda

/dev/sda:

 Model=WDC WD3200BEVT-22ZCT0, FwRev=11.01A11, SerialNo=WD-WXE308DU5524
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50
 BuffType=unknown, BuffSize=8192kB, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=625142448
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 AdvancedPM=yes: unknown setting WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1,2,3,4,5,6,7

 * signifies the current active mode

fsck finds nothing wrong

I booted to Ubuntu's recovery mode and ran fsck, which found no errors to correct.

Still, I should really boot with a rescue disk since the problem is manifesting in the root partition.

root@Stinkerbelle:/var/log# blkid
/dev/sda1: UUID="2bad8eee-9489-43b8-b0d0-4f1121a0222e" TYPE="ext4" 
/dev/sda5: UUID="52976c5c-ebec-4b0c-8ab2-434be2b91034" TYPE="swap"

Since it is an ext4 filesystem, I should try running these when I can boot from a rescue disk

e2fsck -f /dev/sda1

Force checking even if the file system seems clean

e2fsck -p /dev/sda1

Preen. Auto repair safely

e2fsck -t /dev/sda1

print timing statistics

-v verbose

The I/O errors seem to come from a limited list of sectors:

310846776 341304208

fsck -cc /dev/sda1

root@ubuntu:~# fsck /dev/sda1
fsck from util-linux-ng 2.17.2
e2fsck 1.41.11 (14-Mar-2010)
/dev/sda1: clean, 494829/18767872 files, 12976089/75070464 blocks

fsck thinks the disk is clean. That's not very helpful.

So I force it to run.

root@ubuntu:~# fsck -f /dev/sda1
fsck from util-linux-ng 2.17.2
e2fsck 1.41.11 (14-Mar-2010)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sda1: 494829/18767872 files (0.6% non-contiguous), 12976089/75070464 blocks

It still comes up clean. So I tell it to do a check for bad blocks.

root@ubuntu:~# fsck -cc /dev/sda1
fsck from util-linux-ng 2.17.2
e2fsck 1.41.11 (14-Mar-2010)
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern:   1.04% done, 5:04 elapsed

Checking for bad blocks takes a long time. Next time, I'll probably try just manually marking the bad sectors. This run is about 1% done after 5 minutes. So I could be looking at 8 hours to look for bad blocks with the read-write test. But when it is done, it should mark all the currently bad blocks.

Here is the full run of fsck -cc /dev/sda1. It took about 8 hours to do the bad block check.

root@ubuntu:~# fsck -cc /dev/sda1
fsck from util-linux-ng 2.17.2
e2fsck 1.41.11 (14-Mar-2010)
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern: done                                
/dev/sda1: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes

Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 131571: 42662770 42662771 42662802
Multiply-claimed block(s) in inode 131572: 38855591 38855592
Multiply-claimed block(s) in inode 132012: 40573867 40573868 40573869 40573871 40573872 40574408 40574440
Multiply-claimed block(s) in inode 132068: 44957920 44958012 44958013
Multiply-claimed block(s) in inode 132609: 42189764
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 5 inodes containing multiply-claimed blocks.)

File /var/log/syslog (inode #131571, mod time Wed Feb 15 01:37:14 2012) 
  has 3 multiply-claimed block(s), shared with 1 file(s):
	<The bad blocks inode> (inode #1, mod time Wed Feb 15 01:40:28 2012)
Clone multiply-claimed blocks<y>? yes

Error reading block 42662770 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Error reading block 42662771 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Error reading block 42662802 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

File /var/log/syslog.1 (inode #131572, mod time Thu Jan 12 17:35:01 2012) 
  has 2 multiply-claimed block(s), shared with 1 file(s):
	<The bad blocks inode> (inode #1, mod time Wed Feb 15 01:40:28 2012)
Clone multiply-claimed blocks<y>? yes

Error reading block 38855591 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Error reading block 38855592 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

File /var/log/kern.log (inode #132012, mod time Wed Feb 15 01:37:14 2012) 
  has 7 multiply-claimed block(s), shared with 1 file(s):
	<The bad blocks inode> (inode #1, mod time Wed Feb 15 01:40:28 2012)
Clone multiply-claimed blocks<y>? yes

Error reading block 40573867 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Error reading block 40573868 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Error reading block 40573869 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Error reading block 40573871 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Error reading block 40573872 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Error reading block 40574440 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

File /var/log/kern.log.1 (inode #132068, mod time Sun Jan 22 17:31:50 2012) 
  has 3 multiply-claimed block(s), shared with 1 file(s):
	<The bad blocks inode> (inode #1, mod time Wed Feb 15 01:40:28 2012)
Clone multiply-claimed blocks<y>? yes

Error reading block 44957920 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Error reading block 44958012 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Error reading block 44958013 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

File /var/log/pm-powersave.log (inode #132609, mod time Tue Feb 14 13:41:28 2012) 
  has 1 multiply-claimed block(s), shared with 1 file(s):
	<The bad blocks inode> (inode #1, mod time Wed Feb 15 01:40:28 2012)
Clone multiply-claimed blocks<y>? yes

Error reading block 42189764 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #0 (22519, counted=22503).
Fix<y>? yes

Free blocks count wrong for group #1185 (12309, counted=12311).
Fix<y>? yes

Free blocks count wrong for group #1238 (12307, counted=12314).
Fix<y>? yes

Free blocks count wrong for group #1287 (978, counted=979).
Fix<y>? yes

Free blocks count wrong for group #1301 (1384, counted=1387).
Fix<y>? yes

Free blocks count wrong for group #1372 (15604, counted=15607).
Fix<y>? yes


/dev/sda1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda1: 494829/18767872 files (0.6% non-contiguous), 12976107/75070464 blocks

smartctl -a /dev/sda

I installed and ran smartctl to look for disk hardware errors. It looks like it found some errors, but it also says that the overall health is good.

smartctl 5.41 2011-06-09 r3365 [i686-linux-3.0.0-15-generic-pae] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Scorpio Blue Serial ATA
Device Model:     WDC WD3200BEVT-22ZCT0
Serial Number:    WD-WXE308DU5524
LU WWN Device Id: 5 0014ee 20155be86
Firmware Version: 11.01A11
User Capacity:    320,072,933,376 bytes [320 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Feb 11 12:19:41 2012 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(10800) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 127) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       2900
  3 Spin_Up_Time            0x0027   188   187   021    Pre-fail  Always       -       1591
  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3233
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   083   083   000    Old_age   Always       -       12987
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   097   097   000    Old_age   Always       -       3031
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       159
193 Load_Cycle_Count        0x0032   037   037   000    Old_age   Always       -       489827
194 Temperature_Celsius     0x0022   096   074   000    Old_age   Always       -       51
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   186   000    Old_age   Always       -       10
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   100   253   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 1128 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1128 occurred at disk power-on lifetime: 12987 hours (541 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 20 26 1e 40  Error: UNC at LBA = 0x001e2620 = 1975840

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 00 20 26 1e 14 08      00:07:19.724  READ FPDMA QUEUED
  ef 10 02 00 00 00 00 08      00:07:19.724  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 00 08      00:07:19.724  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      00:07:19.723  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      00:07:19.723  SET FEATURES [Set transfer mode]

Error 1127 occurred at disk power-on lifetime: 12987 hours (541 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 20 26 1e 40  Error: UNC at LBA = 0x001e2620 = 1975840

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 00 20 26 1e 14 08      00:07:16.653  READ FPDMA QUEUED
  ef 10 02 00 00 00 00 08      00:07:16.653  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 00 08      00:07:16.653  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      00:07:16.652  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      00:07:16.652  SET FEATURES [Set transfer mode]

Error 1126 occurred at disk power-on lifetime: 12987 hours (541 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 20 26 1e 40  Error: UNC at LBA = 0x001e2620 = 1975840

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 00 20 26 1e 14 08      00:07:13.582  READ FPDMA QUEUED
  ef 10 02 00 00 00 00 08      00:07:13.582  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 00 08      00:07:13.582  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      00:07:13.581  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      00:07:13.581  SET FEATURES [Set transfer mode]

Error 1125 occurred at disk power-on lifetime: 12987 hours (541 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 20 26 1e 40  Error: UNC at LBA = 0x001e2620 = 1975840

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 00 20 26 1e 14 08      00:07:10.511  READ FPDMA QUEUED
  ef 10 02 00 00 00 00 08      00:07:10.511  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 00 08      00:07:10.510  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      00:07:10.510  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      00:07:10.510  SET FEATURES [Set transfer mode]

Error 1124 occurred at disk power-on lifetime: 12987 hours (541 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 20 26 1e 40  Error: UNC at LBA = 0x001e2620 = 1975840

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 00 20 26 1e 14 08      00:07:07.151  READ FPDMA QUEUED
  ef 10 02 00 00 00 00 08      00:07:07.151  SET FEATURES [Reserved for Serial ATA]
  27 00 00 00 00 00 00 08      00:07:07.151  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      00:07:07.150  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      00:07:07.150  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         3         -
# 2  Short offline       Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I ordered a new disk, but now I'm not so sure it is a hardware problem.

lspci -v

00:1f.1 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) IDE Controller (rev 03) (prog-if 8a [Master SecP PriP])
	Subsystem: Acer Incorporated [ALI] Device 0146
	Flags: bus master, medium devsel, latency 0, IRQ 19
	I/O ports at 01f0 [size=8]
	I/O ports at 03f4 [size=1]
	I/O ports at 0170 [size=8]
	I/O ports at 0374 [size=1]
	I/O ports at 18a0 [size=16]
	Kernel driver in use: ata_piix

00:1f.2 SATA controller: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA AHCI Controller (rev 03) (prog-if 01 [AHCI 1.0])
	Subsystem: Acer Incorporated [ALI] Device 0146
	Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 46
	I/O ports at 18d8 [size=8]
	I/O ports at 18cc [size=4]
	I/O ports at 18d0 [size=8]
	I/O ports at 18c8 [size=4]
	I/O ports at 18e0 [size=32]
	Memory at fc304000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: [80] MSI: Enable+ Count=1/4 Maskable- 64bit-
	Capabilities: [70] Power Management version 3
	Capabilities: [a8] SATA HBA v1.0
	Kernel driver in use: ahci
	Kernel modules: ahci

What changed to cause this?

How do I figure out what was updated recently with apt or dpkg?

ls -lt /var/cache/apt/archives | head

drwxr-xr-x 2 root root     4096 2012-02-11 12:15 partial
-rw-r--r-- 1 root root   974052 2012-02-09 13:39 libssl1.0.0_1.0.0e-2ubuntu4.2_i386.deb
-rw-r--r-- 1 root root   509818 2012-02-09 13:39 openssl_1.0.0e-2ubuntu4.2_i386.deb
-rw-r--r-- 1 root root 60048382 2012-02-08 11:21 acroread_9.4.7-1oneiric1_i386.deb
-rw-r--r-- 1 root root    13130 2012-02-08 11:20 acroread-common_9.4.7-1oneiric1_all.deb
-rw-r--r-- 1 root root    47388 2012-02-02 13:07 firefox-globalmenu_10.0+build1-0ubuntu0.11.10.1_i386.deb
-rw-r--r-- 1 root root 18755900 2012-02-02 13:07 firefox_10.0+build1-0ubuntu0.11.10.1_i386.deb
-rw-r--r-- 1 root root     9286 2012-02-02 13:06 firefox-gnome-support_10.0+build1-0ubuntu0.11.10.1_i386.deb
-rw-r--r-- 1 root root   409834 2012-02-02 13:05 firefox-locale-en_10.0+build1-0ubuntu0.11.10.1_all.deb

Looking in kern.log, I can see the onset of this problem. The first instance of the error is at Jan 13 07:55:28.

Now, what else happened around that time?

/var/log/apt/history.log

Starting from the time of the first error and looking backward, these are the updates that were applied:

Start-Date: 2012-01-06  17:17:43
Commandline: aptdaemon role='role-commit-packages' sender=':1.2122'
Upgrade: firefox-globalmenu:i386 (8.0+build1-0ubuntu0.11.10.3, 9.0.1+build1-0ubuntu0.11.10.2), 
firefox:i386 (8.0+build1-0ubuntu0.11.10.3, 9.0.1+build1-0ubuntu0.11.10.2), 
firefox-gnome-support:i386 (8.0+build1-0ubuntu0.11.10.3, 9.0.1+build1-0ubuntu0.11.10.2), 
xul-ext-ubufox:i386 (1.0-0ubuntu2.1, 1.0.2-0ubuntu0.11.10.1), 
firefox-locale-en:i386 (8.0+build1-0ubuntu0.11.10.3, 9.0.1+build1-0ubuntu0.11.10.2)
End-Date: 2012-01-06  17:18:05

Start-Date: 2012-01-05  07:41:58
Commandline: aptdaemon role='role-commit-packages' sender=':1.1936'
Upgrade: ubuntuone-client:i386 (2.0.0-0ubuntu2.2, 2.0.0-0ubuntu2.3), 
unity:i386 (4.24.0-0ubuntu2b1, 4.24.0-0ubuntu2.1), 
gnome-shell:i386 (3.2.1-0ubuntu1, 3.2.1-0ubuntu1.1), 
unity-lens-applications:i386 (0.4.12-0ubuntu2, 0.4.12-0ubuntu2.1), 
unity-services:i386 (4.24.0-0ubuntu2b1, 4.24.0-0ubuntu2.1), 
banshee:i386 (2.2.1-1ubuntu2, 2.2.1-1ubuntu3), 
libunity-core-4.0-4:i386 (4.24.0-0ubuntu2b1, 4.24.0-0ubuntu2.1), 
banshee-extension-soundmenu:i386 (2.2.1-1ubuntu2, 2.2.1-1ubuntu3), 
libsyncdaemon-1.0-1:i386 (2.0.0-0ubuntu2.2, 2.0.0-0ubuntu2.3), 
libwebkitgtk-3.0-0:i386 (1.4.3-0ubuntu3, 1.4.3-0ubuntu4), 
gir1.2-webkit-3.0:i386 (1.4.3-0ubuntu3, 1.4.3-0ubuntu4), 
ibwebkitgtk-3.0-common:i386 (1.4.3-0ubuntu3, 1.4.3-0ubuntu4), 
unity-common:i386 (4.24.0-0ubuntu2b1, 4.24.0-0ubuntu2.1), 
banshee-extension-ubuntuonemusicstore:i386 (2.2.1-1ubuntu2, 2.2.1-1ubuntu3), 
sysvinit-utils:i386 (2.88dsf-13.10ubuntu4, 2.88dsf-13.10ubuntu4.1), 
python-ubuntuone-client:i386 (2.0.0-0ubuntu2.2, 2.0.0-0ubuntu2.3), 
libwebkitgtk-1.0-common:i386 (1.4.3-0ubuntu3, 1.4.3-0ubuntu4), 
libwebkitgtk-1.0-0:i386 (1.4.3-0ubuntu3, 1.4.3-0ubuntu4), 
sysv-rc:i386 (2.88dsf-13.10ubuntu4, 2.88dsf-13.10ubuntu4.1), 
binutils:i386 (2.21.53.20110810-0ubuntu5, 2.21.53.20110810-0ubuntu5.1), 
initscripts:i386 (2.88dsf-13.10ubuntu4, 2.88dsf-13.10ubuntu4.1)
End-Date: 2012-01-05  07:43:05

problem moves to another disk

After running fsck -cc on the root partition and find some over-allocated blocks, I thought the problem was fixed. Then I tried swapping out the old disk that I had no more confidence in, and installing ubuntu on a new disk, I saw the problem reappear on a different device (cd drive):

[ 2904.123144] sr 0:0:0:0: [sr0] Unhandled sense code
[ 2904.123150] sr 0:0:0:0: [sr0]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 2904.123157] sr 0:0:0:0: [sr0]  Sense Key : Medium Error [current] 
[ 2904.123165] sr 0:0:0:0: [sr0]  ASC=0x10 <<vendor>> ASCQ=0x90
[ 2904.123173] sr 0:0:0:0: [sr0] CDB: Read(10): 28 00 00 04 72 df 00 00 20 00
[ 2904.123190] end_request: I/O error, dev sr0, sector 1166204
[ 2904.123197] Buffer I/O error on device sr0, logical block 291551
[ 2904.123202] Buffer I/O error on device sr0, logical block 291552
[ 2904.123210] Buffer I/O error on device sr0, logical block 291553
[ 2904.123215] Buffer I/O error on device sr0, logical block 291554
[ 2904.123224] Buffer I/O error on device sr0, logical block 291555
[ 2904.123226] Buffer I/O error on device sr0, logical block 291556
[ 2904.123229] Buffer I/O error on device sr0, logical block 291557
[ 2904.123231] Buffer I/O error on device sr0, logical block 291558
[ 2904.123234] Buffer I/O error on device sr0, logical block 291559
[ 2904.123236] Buffer I/O error on device sr0, logical block 291560
[ 2905.315451] SQUASHFS error: squashfs_read_data failed to read block 0x238e4018
[ 2905.315456] SQUASHFS error: Unable to read fragment cache entry [238e4018]
[ 2905.315458] SQUASHFS error: Unable to read page, block 238e4018, size c6df
[ 2905.315465] SQUASHFS error: Unable to read fragment cache entry [238e4018]
[ 2905.315467] SQUASHFS error: Unable to read page, block 238e4018, size c6df
[ 2905.315471] SQUASHFS error: Unable to read fragment cache entry [238e4018]
[ 2905.315473] SQUASHFS error: Unable to read page, block 238e4018, size c6df
[ 2905.315477] SQUASHFS error: Unable to read fragment cache entry [238e4018]
[ 2905.315479] SQUASHFS error: Unable to read page, block 238e4018, size c6df
[ 2905.315482] SQUASHFS error: Unable to read fragment cache entry [238e4018]
[ 2905.315484] SQUASHFS error: Unable to read page, block 238e4018, size c6df
[ 2905.315488] SQUASHFS error: Unable to read fragment cache entry [238e4018]
[ 2905.315489] SQUASHFS error: Unable to read page, block 238e4018, size c6df
[ 2905.315493] SQUASHFS error: Unable to read fragment cache entry [238e4018]
[ 2905.315495] SQUASHFS error: Unable to read page, block 238e4018, size c6df
[ 2905.315498] SQUASHFS error: Unable to read fragment cache entry [238e4018]
[ 2905.315500] SQUASHFS error: Unable to read page, block 238e4018, size c6df
[ 2905.315505] SQUASHFS error: Unable to read fragment cache entry [238e4018]
[ 2905.315506] SQUASHFS error: Unable to read page, block 238e4018, size c6df
[ 2954.027018] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
[ 2954.218484] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
[ 2954.616512] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
[ 2954.784208] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
[ 2954.973560] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
[ 2997.992568] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)

The install failed.

Try this workaround March 26, 2009

http://ubuntuforums.org/showthread.php?t=1034762

It's Kernel Bug on ata ACPI. I put "options libata noacpi=1" on /etc/modprobe.d/options and the ERROR is gone.

Resolution

The annoying symptom is fixed. No more I/O Errors and slow boot. And no more errors in kern.log.

The error that I'm counting in kern.log is this:

Feb 14 14:56:53 Stinkerbelle kernel: [40683.124114] ata3.00: failed command: READ FPDMA QUEUED

Here is the histogram of the error in terms of counts of errors per day:

julie@Stinkerbelle:~$ grep FPDMA /var/log/kern.log | awk '{ print $1, $2};'  | uniq -c
     18 Jan 23
     18 Jan 24
     18 Jan 25
     18 Jan 26
     18 Jan 27
     18 Jan 28
     30 Jan 29
     18 Jan 30
     18 Jan 31
     18 Feb 1
     18 Feb 2
     18 Feb 3
     18 Feb 4
     24 Feb 5
     18 Feb 6
     18 Feb 7
     18 Feb 8
     18 Feb 9
     18 Feb 10
   1450 Feb 11
    312 Feb 12
    342 Feb 13
    143 Feb 14

The pattern repeats back to Jan 15 where the error appeared 18 times per day. This also shows that as of Feb 20, the errors have stopped.

So the resolution was to let fsck look for bad blocks. But I have no reason to believe the disk will not continue to generate bad blocks. I'll be swapping in a new disk soon.

more bad blocks

Before I swapped in a new disk, the laptop started to manifest the same problem again. This time, it prevented the laptop from booting. It was handy that I still had the USB stick rescue device and it even had a persistence partition that allowed it to remember how to connect to my local wifi.

This time I tried doing a read-only test for bad blocks to see how that would work.

root@ubuntu:~# fsck -c /dev/sda1
fsck from util-linux-ng 2.17.2
e2fsck 1.41.11 (14-Mar-2010)
/dev/sda1: recovering journal
Error reading block 37279930 (Attempt to read block from filesystem resulted in short read).  Ignore error<y>? yes

Force rewrite<y>? yes

Clearing orphaned inode 9700500 (uid=1001, gid=1001, mode=0100600, size=1)
Clearing orphaned inode 11797125 (uid=1001, gid=1001, mode=0100644, size=20291)
Truncating orphaned inode 131331 (uid=0, gid=0, mode=0100644, size=2408)
Already cleared block #0 (40593919) found in orphaned inode 131331.
Checking for bad blocks (read-only test):  92.22% done, 2:17:10 elapsed

new disk

The old disk continued to manifest the bad blocks problem every few days. Each time, running "fsck -kcc /dev/sda1" would fix the problem, but it would take about 8 hours. My first try at putting in a new disk and reinstalling the OS encountered a suspiciously similar error in the CD install media that I created. I decided not to investigate that further, but tried the install again, this time using a bootable USB stick made for the purpose. That install went smoothly and the system hasn't had any further errors.

Running smartctl on the new disk revealed that it wasn't a new disk at all, but had been pulled from some other system after about 100 days of use. So I'm not entirely sure that this disk will have a long life, but who can guarantee a long life anyway?