Monday 20 May 2013

Fixing Database Corruption in a VLDB

A client of mine phoned me last week with an error message that had appeared in their application. My worst fears were confirmed when I heard the words "logical consistency-based I/O error".

The error message reported was:
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x31b2521f; actual: 0x4bd50f4c). It occurred during a read of page (3:8191715) in database ID 17 at offset 0x00000f9fdc6000 in file 'C:\MSSQL\Data\Data2.ndf'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.


The first thing to do is identify the database using the database ID presented in the error message, which in my case is 17.

USE master;
GO 
SELECT *
FROM sys.databases
WHERE database_id = 17;
 


So my corrupted database is a VLDB and the client will not be happy with the amount of downtime required to perform DBCC CHECKDB. The last full backup was taken almost 18 hours earlier and the client does not want to lose a full day's worth of work. The database is in simple recovery model so page level restores are out of the question. My task was simple - recover the page without losing data and without serious down time.

My solution was actually rather simple, but it requires you to have a good copy of the corrupted data. This can be a recent backup taken before the corruption or, as in my case, a subscriber with a valid copy of the data.

The very first step for me was to backup the corrupted database and restore it onto another server. The purpose of this is to run a full DBCC CHECKDB to identify if we have more corruption somewhere in the database.

DBCC CHECKDB() WITH NO_INFOMSGS, ALL_ERRORMSGS;

This would take an estimated 16 hours so we can check the results later. I was lucky enough to have a situation where I had no further corruption.

The next step is to identify the object in the error message. The error message states which page is corrupted, in our case (3:8191715). We can use the undocumented command DBCC PAGE to fetch the page header.

-- Trace to send output to screen
DBCC TRACEON (3604)
-- and print the page contents on the screen
-- DBCC PAGE (database_id, file_id, page_id, print option)

DBCC PAGE (17, 3, 8191715, 1);

In the page header you will find Metadata: ObjectID and Metadata: IndexID. It is important to note that if you have any IndexID > 1 then it refers to a secondary index and you can simply drop and recreate the affected index. IndexID 0 refers to a heap and IndexID 1 refers to a clustered index in which case we will need to fix the problem. I was dealing with a clustered index on a large table.

Now we know which table is affected we must first identify any further corrupt pages in our table. We can use DBCC CHECKTABLE to check one table.

DBCC CHECKTABLE('MyTable', NOINDEX) WITH NO_INFOMSGS, ALL_ERRORMSGS;

In my case I was dealing with only one page. If you have more pages then repeat the next step for each page BEFORE running the repair.

Now we know which pages are corrupt we must find out which rows are on this page. First find out what columns your clustered key is made up of, in my case the column ID. We must now output the contents of the page using print option 3 in DBCC PAGE.

-- Trace to send output to screen
DBCC TRACEON (3604)
-- and print the page contents on the screen
-- DBCC PAGE (database_id, file_id, page_id, print option)

DBCC PAGE (17, 3, 8191715, 3);

The output of this command will list the page slots, so look for "Slot 0 column 1" after the header. You will find the value of the first ID column, in my case 36240, and at the end of the page look for column 1 in the last Slot number and you will find the largest ID for that page, in my case 36450. Now you know which rows are in the affected page you can fetch those rows from a valid backup or other valid copy.

INSERT INTO MyTable_BACKUP
SELECT * FROM MyTable
WHERE ID BETWEEN 36240 AND 36450;

Copy the data into a staging table and run the repair, which will delete the corrupted page from the database file.

DBCC CHECKTABLE('MyTable') WITH ALLOW_DATA_LOSS;


Now the page is gone we can simply restore the missing rows from our backup.

INSERT INTO MyTable
SELECT * FROM MyTable_BACKUP;


Your VLDB is now fixed.

On a side note, if you have replication as I do then you will need to delete the missing rows from the subscriber before inserting them into the repaired table. You do not need to disable replication while running the repair.


No comments:

Post a Comment