Stinkerbelle hard disk failure
I'm jumping the gun a little by titling this page "hard disk failure". But I think the evidence is pretty compelling so far.
Symptoms
After an unidentified update on the latest ubuntu channel, the laptop started booting very slowly. But it seemed to work once it finally booted. Normally, any failure after an update is fixed by rebuilding the NVIDIA driver to match the newly updated kernel. But that wasn't the problem this time.
Certain files seem to be unreadable:
root@Stinkerbelle:/var/log# grep "dev sda, sector" syslog grep: syslog: Input/output error
When booting was taking a long time, I switched to a text console and saw disk errors every three seconds, like this:
tail -50 /var/log/kern.log
Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048486] ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048491] ata3.00: irq_stat 0x40000008 Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048512] ata3.00: failed command: READ FPDMA QUEUED Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048517] ata3.00: cmd 60/08:00:90:e3:57/00:00:14:00:00/40 tag 0 ncq 4096 in Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048519] res 41/40:00:90:e3:57/00:00:14:00:00/40 Emask 0x409 (media error) <F> Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048521] ata3.00: status: { DRDY ERR } Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.048541] ata3.00: error: { UNC } Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050470] ata3.00: configured for UDMA/133 Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050521] sd 2:0:0:0: [sda] Unhandled sense code Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050523] sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050527] sd 2:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor] Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050531] Descriptor sense data with sense descriptors (in hex): Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050533] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050560] 14 57 e3 90 Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050564] sd 2:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050587] sd 2:0:0:0: [sda] CDB: Read(10): 28 00 14 57 e3 90 00 00 08 00 Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050596] end_request: I/O error, dev sda, sector 341304208 Feb 11 13:27:33 Stinkerbelle kernel: [ 3498.050651] ata3: EH complete
What does the error mean?
From a whitepaper about Serial ATA Native Command Queuing (http://www.seagate.com/content/pdf/whitepaper/D2c_tech_paper_intc-stx_sata_ncq.pdf)
FPDMA
- First party direct memory access
NCQ
- Native command queuing
UNC
- Uncorrectable Data: An ECC error in the data field could not be corrected (a media error or read instability).
Other instances of this bug:
Lots of bug reports from March 2010 to now. No real solutions or workarounds. Some people are convinced that it is a software problem, other are convinced it is hardware.
- https://bugs.launchpad.net/ubuntu/+source/linux/+bug/550559
- http://ubuntuforums.org/archive/index.php/t-1034762.html
Tutorial on sneaky way to force a bad block
How many sectors have been reallocated because of bad writes?
root@Stinkerbelle:/etc/modprobe.d# smartctl -a /dev/sda | grep -i reallocated 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
The zeroes at the end indicate that no sectors have been reallocated for bad writes. This problem appears to be with bad reads. Using the technique above, I should be able to get those bad sectors reallocated by triggering a write to the bad sector and watch those counters go up. And then hopefully the read errors will go away.
hdparm -i /dev/sda
root@Stinkerbelle:~# hdparm -i /dev/sda /dev/sda: Model=WDC WD3200BEVT-22ZCT0, FwRev=11.01A11, SerialNo=WD-WXE308DU5524 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50 BuffType=unknown, BuffSize=8192kB, MaxMultSect=16, MultSect=off CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=625142448 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 AdvancedPM=yes: unknown setting WriteCache=enabled Drive conforms to: Unspecified: ATA/ATAPI-1,2,3,4,5,6,7 * signifies the current active mode
fsck finds nothing wrong
I booted to Ubuntu's recovery mode and ran fsck, which found no errors to correct.
Still, I should really boot with a rescue disk since the problem is manifesting in the root partition.
root@Stinkerbelle:/var/log# blkid /dev/sda1: UUID="2bad8eee-9489-43b8-b0d0-4f1121a0222e" TYPE="ext4" /dev/sda5: UUID="52976c5c-ebec-4b0c-8ab2-434be2b91034" TYPE="swap"
Since it is an ext4 filesystem, I should try running these when I can boot from a rescue disk
e2fsck -f /dev/sda1
Force checking even if the file system seems clean
e2fsck -p /dev/sda1
Preen. Auto repair safely
e2fsck -t /dev/sda1
print timing statistics
-v verbose
The I/O errors seem to come from a limited list of sectors:
310846776 341304208
fsck -cc /dev/sda1
root@ubuntu:~# fsck /dev/sda1 fsck from util-linux-ng 2.17.2 e2fsck 1.41.11 (14-Mar-2010) /dev/sda1: clean, 494829/18767872 files, 12976089/75070464 blocks
fsck thinks the disk is clean. That's not very helpful.
So I force it to run.
root@ubuntu:~# fsck -f /dev/sda1 fsck from util-linux-ng 2.17.2 e2fsck 1.41.11 (14-Mar-2010) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/sda1: 494829/18767872 files (0.6% non-contiguous), 12976089/75070464 blocks
It still comes up clean. So I tell it to do a check for bad blocks.
root@ubuntu:~# fsck -cc /dev/sda1 fsck from util-linux-ng 2.17.2 e2fsck 1.41.11 (14-Mar-2010) Checking for bad blocks (non-destructive read-write test) Testing with random pattern: 1.04% done, 5:04 elapsed
Checking for bad blocks takes a long time. Next time, I'll probably try just manually marking the bad sectors. This run is about 1% done after 5 minutes. So I could be looking at 8 hours to look for bad blocks with the read-write test. But when it is done, it should mark all the currently bad blocks.
Here is the full run of fsck -cc /dev/sda1. It took about 8 hours to do the bad block check.
root@ubuntu:~# fsck -cc /dev/sda1 fsck from util-linux-ng 2.17.2 e2fsck 1.41.11 (14-Mar-2010) Checking for bad blocks (non-destructive read-write test) Testing with random pattern: done /dev/sda1: Updating bad block inode. Pass 1: Checking inodes, blocks, and sizes Running additional passes to resolve blocks claimed by more than one inode... Pass 1B: Rescanning for multiply-claimed blocks Multiply-claimed block(s) in inode 131571: 42662770 42662771 42662802 Multiply-claimed block(s) in inode 131572: 38855591 38855592 Multiply-claimed block(s) in inode 132012: 40573867 40573868 40573869 40573871 40573872 40574408 40574440 Multiply-claimed block(s) in inode 132068: 44957920 44958012 44958013 Multiply-claimed block(s) in inode 132609: 42189764 Pass 1C: Scanning directories for inodes with multiply-claimed blocks Pass 1D: Reconciling multiply-claimed blocks (There are 5 inodes containing multiply-claimed blocks.) File /var/log/syslog (inode #131571, mod time Wed Feb 15 01:37:14 2012) has 3 multiply-claimed block(s), shared with 1 file(s): <The bad blocks inode> (inode #1, mod time Wed Feb 15 01:40:28 2012) Clone multiply-claimed blocks<y>? yes Error reading block 42662770 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Error reading block 42662771 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Error reading block 42662802 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes File /var/log/syslog.1 (inode #131572, mod time Thu Jan 12 17:35:01 2012) has 2 multiply-claimed block(s), shared with 1 file(s): <The bad blocks inode> (inode #1, mod time Wed Feb 15 01:40:28 2012) Clone multiply-claimed blocks<y>? yes Error reading block 38855591 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Error reading block 38855592 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes File /var/log/kern.log (inode #132012, mod time Wed Feb 15 01:37:14 2012) has 7 multiply-claimed block(s), shared with 1 file(s): <The bad blocks inode> (inode #1, mod time Wed Feb 15 01:40:28 2012) Clone multiply-claimed blocks<y>? yes Error reading block 40573867 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Error reading block 40573868 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Error reading block 40573869 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Error reading block 40573871 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Error reading block 40573872 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Error reading block 40574440 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes File /var/log/kern.log.1 (inode #132068, mod time Sun Jan 22 17:31:50 2012) has 3 multiply-claimed block(s), shared with 1 file(s): <The bad blocks inode> (inode #1, mod time Wed Feb 15 01:40:28 2012) Clone multiply-claimed blocks<y>? yes Error reading block 44957920 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Error reading block 44958012 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Error reading block 44958013 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes File /var/log/pm-powersave.log (inode #132609, mod time Tue Feb 14 13:41:28 2012) has 1 multiply-claimed block(s), shared with 1 file(s): <The bad blocks inode> (inode #1, mod time Wed Feb 15 01:40:28 2012) Clone multiply-claimed blocks<y>? yes Error reading block 42189764 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #0 (22519, counted=22503). Fix<y>? yes Free blocks count wrong for group #1185 (12309, counted=12311). Fix<y>? yes Free blocks count wrong for group #1238 (12307, counted=12314). Fix<y>? yes Free blocks count wrong for group #1287 (978, counted=979). Fix<y>? yes Free blocks count wrong for group #1301 (1384, counted=1387). Fix<y>? yes Free blocks count wrong for group #1372 (15604, counted=15607). Fix<y>? yes /dev/sda1: ***** FILE SYSTEM WAS MODIFIED ***** /dev/sda1: 494829/18767872 files (0.6% non-contiguous), 12976107/75070464 blocks
smartctl -a /dev/sda
I installed and ran smartctl to look for disk hardware errors. It looks like it found some errors, but it also says that the overall health is good.
smartctl 5.41 2011-06-09 r3365 [i686-linux-3.0.0-15-generic-pae] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Scorpio Blue Serial ATA Device Model: WDC WD3200BEVT-22ZCT0 Serial Number: WD-WXE308DU5524 LU WWN Device Id: 5 0014ee 20155be86 Firmware Version: 11.01A11 User Capacity: 320,072,933,376 bytes [320 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sat Feb 11 12:19:41 2012 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (10800) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 127) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2900 3 Spin_Up_Time 0x0027 188 187 021 Pre-fail Always - 1591 4 Start_Stop_Count 0x0032 097 097 000 Old_age Always - 3233 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 12987 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 097 097 000 Old_age Always - 3031 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 159 193 Load_Cycle_Count 0x0032 037 037 000 Old_age Always - 489827 194 Temperature_Celsius 0x0022 096 074 000 Old_age Always - 51 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 186 000 Old_age Always - 10 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 100 253 051 Pre-fail Offline - 0 SMART Error Log Version: 1 ATA Error Count: 1128 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 1128 occurred at disk power-on lifetime: 12987 hours (541 days + 3 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 20 26 1e 40 Error: UNC at LBA = 0x001e2620 = 1975840 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 08 00 20 26 1e 14 08 00:07:19.724 READ FPDMA QUEUED ef 10 02 00 00 00 00 08 00:07:19.724 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 00 08 00:07:19.724 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 00:07:19.723 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 00:07:19.723 SET FEATURES [Set transfer mode] Error 1127 occurred at disk power-on lifetime: 12987 hours (541 days + 3 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 20 26 1e 40 Error: UNC at LBA = 0x001e2620 = 1975840 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 08 00 20 26 1e 14 08 00:07:16.653 READ FPDMA QUEUED ef 10 02 00 00 00 00 08 00:07:16.653 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 00 08 00:07:16.653 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 00:07:16.652 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 00:07:16.652 SET FEATURES [Set transfer mode] Error 1126 occurred at disk power-on lifetime: 12987 hours (541 days + 3 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 20 26 1e 40 Error: UNC at LBA = 0x001e2620 = 1975840 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 08 00 20 26 1e 14 08 00:07:13.582 READ FPDMA QUEUED ef 10 02 00 00 00 00 08 00:07:13.582 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 00 08 00:07:13.582 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 00:07:13.581 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 00:07:13.581 SET FEATURES [Set transfer mode] Error 1125 occurred at disk power-on lifetime: 12987 hours (541 days + 3 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 20 26 1e 40 Error: UNC at LBA = 0x001e2620 = 1975840 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 08 00 20 26 1e 14 08 00:07:10.511 READ FPDMA QUEUED ef 10 02 00 00 00 00 08 00:07:10.511 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 00 08 00:07:10.510 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 00:07:10.510 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 00:07:10.510 SET FEATURES [Set transfer mode] Error 1124 occurred at disk power-on lifetime: 12987 hours (541 days + 3 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 20 26 1e 40 Error: UNC at LBA = 0x001e2620 = 1975840 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 08 00 20 26 1e 14 08 00:07:07.151 READ FPDMA QUEUED ef 10 02 00 00 00 00 08 00:07:07.151 SET FEATURES [Reserved for Serial ATA] 27 00 00 00 00 00 00 08 00:07:07.151 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 00:07:07.150 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 00:07:07.150 SET FEATURES [Set transfer mode] SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 3 - # 2 Short offline Completed without error 00% 1 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
I ordered a new disk, but now I'm not so sure it is a hardware problem.
lspci -v[edit]
00:1f.1 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) IDE Controller (rev 03) (prog-if 8a [Master SecP PriP]) Subsystem: Acer Incorporated [ALI] Device 0146 Flags: bus master, medium devsel, latency 0, IRQ 19 I/O ports at 01f0 [size=8] I/O ports at 03f4 [size=1] I/O ports at 0170 [size=8] I/O ports at 0374 [size=1] I/O ports at 18a0 [size=16] Kernel driver in use: ata_piix 00:1f.2 SATA controller: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA AHCI Controller (rev 03) (prog-if 01 [AHCI 1.0]) Subsystem: Acer Incorporated [ALI] Device 0146 Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 46 I/O ports at 18d8 [size=8] I/O ports at 18cc [size=4] I/O ports at 18d0 [size=8] I/O ports at 18c8 [size=4] I/O ports at 18e0 [size=32] Memory at fc304000 (32-bit, non-prefetchable) [size=2K] Capabilities: [80] MSI: Enable+ Count=1/4 Maskable- 64bit- Capabilities: [70] Power Management version 3 Capabilities: [a8] SATA HBA v1.0 Kernel driver in use: ahci Kernel modules: ahci
What changed to cause this?
How do I figure out what was updated recently with apt or dpkg?
ls -lt /var/cache/apt/archives | head
drwxr-xr-x 2 root root 4096 2012-02-11 12:15 partial -rw-r--r-- 1 root root 974052 2012-02-09 13:39 libssl1.0.0_1.0.0e-2ubuntu4.2_i386.deb -rw-r--r-- 1 root root 509818 2012-02-09 13:39 openssl_1.0.0e-2ubuntu4.2_i386.deb -rw-r--r-- 1 root root 60048382 2012-02-08 11:21 acroread_9.4.7-1oneiric1_i386.deb -rw-r--r-- 1 root root 13130 2012-02-08 11:20 acroread-common_9.4.7-1oneiric1_all.deb -rw-r--r-- 1 root root 47388 2012-02-02 13:07 firefox-globalmenu_10.0+build1-0ubuntu0.11.10.1_i386.deb -rw-r--r-- 1 root root 18755900 2012-02-02 13:07 firefox_10.0+build1-0ubuntu0.11.10.1_i386.deb -rw-r--r-- 1 root root 9286 2012-02-02 13:06 firefox-gnome-support_10.0+build1-0ubuntu0.11.10.1_i386.deb -rw-r--r-- 1 root root 409834 2012-02-02 13:05 firefox-locale-en_10.0+build1-0ubuntu0.11.10.1_all.deb
Looking in kern.log, I can see the onset of this problem. The first instance of the error is at Jan 13 07:55:28.
Now, what else happened around that time?
/var/log/apt/history.log
Starting from the time of the first error and looking backward, these are the updates that were applied:
Start-Date: 2012-01-06 17:17:43 Commandline: aptdaemon role='role-commit-packages' sender=':1.2122' Upgrade: firefox-globalmenu:i386 (8.0+build1-0ubuntu0.11.10.3, 9.0.1+build1-0ubuntu0.11.10.2), firefox:i386 (8.0+build1-0ubuntu0.11.10.3, 9.0.1+build1-0ubuntu0.11.10.2), firefox-gnome-support:i386 (8.0+build1-0ubuntu0.11.10.3, 9.0.1+build1-0ubuntu0.11.10.2), xul-ext-ubufox:i386 (1.0-0ubuntu2.1, 1.0.2-0ubuntu0.11.10.1), firefox-locale-en:i386 (8.0+build1-0ubuntu0.11.10.3, 9.0.1+build1-0ubuntu0.11.10.2) End-Date: 2012-01-06 17:18:05 Start-Date: 2012-01-05 07:41:58 Commandline: aptdaemon role='role-commit-packages' sender=':1.1936' Upgrade: ubuntuone-client:i386 (2.0.0-0ubuntu2.2, 2.0.0-0ubuntu2.3), unity:i386 (4.24.0-0ubuntu2b1, 4.24.0-0ubuntu2.1), gnome-shell:i386 (3.2.1-0ubuntu1, 3.2.1-0ubuntu1.1), unity-lens-applications:i386 (0.4.12-0ubuntu2, 0.4.12-0ubuntu2.1), unity-services:i386 (4.24.0-0ubuntu2b1, 4.24.0-0ubuntu2.1), banshee:i386 (2.2.1-1ubuntu2, 2.2.1-1ubuntu3), libunity-core-4.0-4:i386 (4.24.0-0ubuntu2b1, 4.24.0-0ubuntu2.1), banshee-extension-soundmenu:i386 (2.2.1-1ubuntu2, 2.2.1-1ubuntu3), libsyncdaemon-1.0-1:i386 (2.0.0-0ubuntu2.2, 2.0.0-0ubuntu2.3), libwebkitgtk-3.0-0:i386 (1.4.3-0ubuntu3, 1.4.3-0ubuntu4), gir1.2-webkit-3.0:i386 (1.4.3-0ubuntu3, 1.4.3-0ubuntu4), ibwebkitgtk-3.0-common:i386 (1.4.3-0ubuntu3, 1.4.3-0ubuntu4), unity-common:i386 (4.24.0-0ubuntu2b1, 4.24.0-0ubuntu2.1), banshee-extension-ubuntuonemusicstore:i386 (2.2.1-1ubuntu2, 2.2.1-1ubuntu3), sysvinit-utils:i386 (2.88dsf-13.10ubuntu4, 2.88dsf-13.10ubuntu4.1), python-ubuntuone-client:i386 (2.0.0-0ubuntu2.2, 2.0.0-0ubuntu2.3), libwebkitgtk-1.0-common:i386 (1.4.3-0ubuntu3, 1.4.3-0ubuntu4), libwebkitgtk-1.0-0:i386 (1.4.3-0ubuntu3, 1.4.3-0ubuntu4), sysv-rc:i386 (2.88dsf-13.10ubuntu4, 2.88dsf-13.10ubuntu4.1), binutils:i386 (2.21.53.20110810-0ubuntu5, 2.21.53.20110810-0ubuntu5.1), initscripts:i386 (2.88dsf-13.10ubuntu4, 2.88dsf-13.10ubuntu4.1) End-Date: 2012-01-05 07:43:05
problem moves to another disk
After running fsck -cc on the root partition and find some over-allocated blocks, I thought the problem was fixed. Then I tried swapping out the old disk that I had no more confidence in, and installing ubuntu on a new disk, I saw the problem reappear on a different device (cd drive):
[ 2904.123144] sr 0:0:0:0: [sr0] Unhandled sense code [ 2904.123150] sr 0:0:0:0: [sr0] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 2904.123157] sr 0:0:0:0: [sr0] Sense Key : Medium Error [current] [ 2904.123165] sr 0:0:0:0: [sr0] ASC=0x10 <<vendor>> ASCQ=0x90 [ 2904.123173] sr 0:0:0:0: [sr0] CDB: Read(10): 28 00 00 04 72 df 00 00 20 00 [ 2904.123190] end_request: I/O error, dev sr0, sector 1166204 [ 2904.123197] Buffer I/O error on device sr0, logical block 291551 [ 2904.123202] Buffer I/O error on device sr0, logical block 291552 [ 2904.123210] Buffer I/O error on device sr0, logical block 291553 [ 2904.123215] Buffer I/O error on device sr0, logical block 291554 [ 2904.123224] Buffer I/O error on device sr0, logical block 291555 [ 2904.123226] Buffer I/O error on device sr0, logical block 291556 [ 2904.123229] Buffer I/O error on device sr0, logical block 291557 [ 2904.123231] Buffer I/O error on device sr0, logical block 291558 [ 2904.123234] Buffer I/O error on device sr0, logical block 291559 [ 2904.123236] Buffer I/O error on device sr0, logical block 291560 [ 2905.315451] SQUASHFS error: squashfs_read_data failed to read block 0x238e4018 [ 2905.315456] SQUASHFS error: Unable to read fragment cache entry [238e4018] [ 2905.315458] SQUASHFS error: Unable to read page, block 238e4018, size c6df [ 2905.315465] SQUASHFS error: Unable to read fragment cache entry [238e4018] [ 2905.315467] SQUASHFS error: Unable to read page, block 238e4018, size c6df [ 2905.315471] SQUASHFS error: Unable to read fragment cache entry [238e4018] [ 2905.315473] SQUASHFS error: Unable to read page, block 238e4018, size c6df [ 2905.315477] SQUASHFS error: Unable to read fragment cache entry [238e4018] [ 2905.315479] SQUASHFS error: Unable to read page, block 238e4018, size c6df [ 2905.315482] SQUASHFS error: Unable to read fragment cache entry [238e4018] [ 2905.315484] SQUASHFS error: Unable to read page, block 238e4018, size c6df [ 2905.315488] SQUASHFS error: Unable to read fragment cache entry [238e4018] [ 2905.315489] SQUASHFS error: Unable to read page, block 238e4018, size c6df [ 2905.315493] SQUASHFS error: Unable to read fragment cache entry [238e4018] [ 2905.315495] SQUASHFS error: Unable to read page, block 238e4018, size c6df [ 2905.315498] SQUASHFS error: Unable to read fragment cache entry [238e4018] [ 2905.315500] SQUASHFS error: Unable to read page, block 238e4018, size c6df [ 2905.315505] SQUASHFS error: Unable to read fragment cache entry [238e4018] [ 2905.315506] SQUASHFS error: Unable to read page, block 238e4018, size c6df [ 2954.027018] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null) [ 2954.218484] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null) [ 2954.616512] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null) [ 2954.784208] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null) [ 2954.973560] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null) [ 2997.992568] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
The install failed.
Try this workaround March 26, 2009
http://ubuntuforums.org/showthread.php?t=1034762
It's Kernel Bug on ata ACPI. I put "options libata noacpi=1" on /etc/modprobe.d/options and the ERROR is gone.
Resolution
The annoying symptom is fixed. No more I/O Errors and slow boot. And no more errors in kern.log.
The error that I'm counting in kern.log is this:
Feb 14 14:56:53 Stinkerbelle kernel: [40683.124114] ata3.00: failed command: READ FPDMA QUEUED
Here is the histogram of the error in terms of counts of errors per day:
julie@Stinkerbelle:~$ grep FPDMA /var/log/kern.log | awk '{ print $1, $2};' | uniq -c 18 Jan 23 18 Jan 24 18 Jan 25 18 Jan 26 18 Jan 27 18 Jan 28 30 Jan 29 18 Jan 30 18 Jan 31 18 Feb 1 18 Feb 2 18 Feb 3 18 Feb 4 24 Feb 5 18 Feb 6 18 Feb 7 18 Feb 8 18 Feb 9 18 Feb 10 1450 Feb 11 312 Feb 12 342 Feb 13 143 Feb 14
The pattern repeats back to Jan 15 where the error appeared 18 times per day. This also shows that as of Feb 20, the errors have stopped.
So the resolution was to let fsck look for bad blocks. But I have no reason to believe the disk will not continue to generate bad blocks. I'll be swapping in a new disk soon.
more bad blocks[edit]
Before I swapped in a new disk, the laptop started to manifest the same problem again. This time, it prevented the laptop from booting. It was handy that I still had the USB stick rescue device and it even had a persistence partition that allowed it to remember how to connect to my local wifi.
This time I tried doing a read-only test for bad blocks to see how that would work.
root@ubuntu:~# fsck -c /dev/sda1 fsck from util-linux-ng 2.17.2 e2fsck 1.41.11 (14-Mar-2010) /dev/sda1: recovering journal Error reading block 37279930 (Attempt to read block from filesystem resulted in short read). Ignore error<y>? yes Force rewrite<y>? yes Clearing orphaned inode 9700500 (uid=1001, gid=1001, mode=0100600, size=1) Clearing orphaned inode 11797125 (uid=1001, gid=1001, mode=0100644, size=20291) Truncating orphaned inode 131331 (uid=0, gid=0, mode=0100644, size=2408) Already cleared block #0 (40593919) found in orphaned inode 131331. Checking for bad blocks (read-only test): 92.22% done, 2:17:10 elapsed
new disk[edit]
The old disk continued to manifest the bad blocks problem every few days. Each time, running "fsck -kcc /dev/sda1" would fix the problem, but it would take about 8 hours. My first try at putting in a new disk and reinstalling the OS encountered a suspiciously similar error in the CD install media that I created. I decided not to investigate that further, but tried the install again, this time using a bootable USB stick made for the purpose. That install went smoothly and the system hasn't had any further errors.
Running smartctl on the new disk revealed that it wasn't a new disk at all, but had been pulled from some other system after about 100 days of use. So I'm not entirely sure that this disk will have a long life, but who can guarantee a long life anyway?