The RAID “Write Hole” Problem

When implementing a RAID architecture solution with built-in redundancy, it is easy to be complacent and believe that only a serious fault can cause a failure, which is easily detectable. There is however the “Write Hole” phenomenon which can occur as the result of a power failure, which is almost undetectable, and left unchecked could result in data corruption. This is a problem which can occur in any RAID scheme which either uses a mirror setup, or calculates parity data written to another disk; including RAID 1, RAID 5 and RAID 6.

Although the problem is fairly rare, and only occurs if the power failure happens while data is being written to the RAID, it can lead to serious issues. In the event of data recovery being required at a later date, it is impossible to determine which disk in the RAID was not updated correctly. If you are located in area which suffers many power outages, the problems could quickly mount up without anyone knowing.

Power Failure Data Loss

In an ideal world, when data is written to a RAID array, the data and parity information would be committed to the disks at precisely the time. However this is almost never the case, so any interruption in the power at such a critical moment could result in either the data or the parity information not being written, or in the case of a mirrored pair, one disk will still contain a copy of the old data.

If the parity information is incorrect it will be recalculated if data is subsequently written to the same data slice. However if this does not happen, a potential problem is hidden, which in the event of a disk failure requiring a re-build may restore the old data if that disk is the failed one. However, the situation if the disk which was not updated holds data there is an immediate problem, which again could go undetected for a long time.

Will Synchronisation Solve The Problem?

If any problems are confined solely to the parity data or in the case of a mirrored RAID, the secondary copy, then resynchronising whereby the parity data is recalculated, will resolve the issue. This is a recommendation that is often made, but is only useful under these circumstances.

As explained above, if data was not written to the data stripe when the parity data has been updated, the situation will not be corrected, even though the parity data will have been recalculated. In this case, it always precludes the opportunity whereby the correct data could be recovered from the parity data.

UPS and Data Recovery

By installing an uninterruptible power supply (UPS) in the RAID system, the “write hole” phenomenon should no longer be an issue, as the RAID should remain running long after any data is written to it, with a controlled shutdown usually being initiated. It is important though to ensure that any installed UPS is regularly checked, as their correct operation is paramount in avoiding any data corruption through an unscheduled power outage.

As explained above, if the “write hole” phenomenon occurs, it is impossible to determine which disk contains the correct data. In a few cases manual inspection by a data recovery specialist may reveal the correct data if a clear error in file system metadata is encountered. It is therefore important to take all necessary precautions, to ensure that the “write hole” phenomenon is never an issue, as the consequences can be very serious.

