Production Disasters

Just finished a marathon weekend. On thursday, one of our production server’s disk decided that it was time and went foooo ! The box has a hot-swap thing, but something happened(which is always the case in production) that even the hot-swap was not working. Since the storage was from the filer, all the production (8 in number) databases were down. The sys admins got the h/w fixed and it was time for oracle databases.

Since I was very new to the organization, I wasn’t allowed to check what was the extent of damage to oracle databases. When the db was started, dbwr complained that couple of datafiles are missing. So they got it from the recent backup and tried a recovery.

Should have been pretty simple considering that it was just another recovery of datafiles. The problem was with the way recovery was attempted on them. After restoring the datafiles, one doesn’t attemp an incomplete recovery on them. Scanning the alert.log of the db in question, I found that the statement used was ‘recover database until cancel’.

How would you expect the db to recover from this. The rest of the datafiles are at time say t1 and these restored datafiles need to be in t1-x. Just a tiny mistake is enough to get your weekend schedule go haywire.

What would it take to have a tested procedure for disasters. Something which will give the on-call dba a checklist to analyze the extent of damage (if any) and what are the steps that are required to get the db back in production?

After this incident, I’ve made a personal commitment to have a ‘Recovery Plan’ which would address most of the crashes and the steps that are need to get the db up. This would definitely make the life of the on-call dba a lot more helpful.

Leave a Reply