Monday, May 17, 2010

Real world example what it can cause using hidden parameters without Oracle Support’s direct control

As we know, there are many undocumented initialization parameters in Oracle Database and sometimes it is called "hidden parameters". These parameters is critical tool for Oracle Support Engineers in case of emergency problems relating customers At the same time, Oracle strongly recommends using these parameters only and only under Oracle Support’s direct control and confirmation. Otherwise, database may be corrupted and Oracle may not support it.

Here, I would like to write real world example what happened to me when I used hidden parameter without informing and confirmation by Oracle Support.

Database: 10.2.0.4
OS: HP-UX
Database size: 2 Tb
Business availability: 24*7

22:40 PM – Database Server rebooted itself. During OS startup, Oracle tried to startup and open database. Database mounted, unfortunately there was problem during crash recovery case:

From alert log:

RECOVERY OF THREAD 1 STUCK AT BLOCK 902035 OF FILE 6
Sat Apr 10 23:01:23 2010
Aborting crash recovery due to error 1172



23:30 PM – Fortunately (but this happiness was temporary, pls read further), datafile 6 was INDEX tablespace, tried manual recovery, but no success:

ORA-00600: internal error code, arguments: [3020], [6], [902035], [26067859], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 6, block# 902035)
ORA-10564: tablespace XXX
ORA-01110: data file 6: 'xxx'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 12918
Sat Apr 10 23:23:04 2010
Media Recovery failed with error 600

Checking standby database, the same problem is there.

That is too bad, below error means 'STUCK RECOVERY'. There is an inconsistency between the information stored in the redo and the information stored in a database block being recovered.

The only solution was to restore database and recover until time (when first ORA-00600 appeared).

Database size is enough huge and restoring database and recovering would take too much time. I decided to mark the problem block (902035) as corrupt one and allowing recovery to proceed.

01:30 AM
SQL>ALTER DATABASE RECOVER datafile 6 allow 1 corruption;
Database altered.

The ALLOW integer CORRUPTION clause lets you specify, in the event of logfile corruption, the number of corrupt blocks that can be tolerated while allowing recovery to proceed.

Starting manual recovery, Oracle marked block (902035) as corrupt one and did not raise error (it is expected behavior and error will be written into alert log).
From alert log:

Media Recovery Start
ALLOW CORRUPTION option must use serial recovery
Recovery of Online Redo Log: Thread 1 Group 3 Seq 5237 Reading mem 0
CORRUPTING BLOCK 902035 OF FILE 6 AND CONTINUING RECOVERY
ORA-10567: Redo is inconsistent with data block (file# 6, block# 902035)
ORA-10564: tablespace PMS_INDEXES
ORA-01110: data file 6: '/rtb/admin/u03/ORACLE/admin/tspace/pmsind1.dbf'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 12918
Sun Apr 11 02:22:46 2010
Media Recovery Complete


Oh my God! Now, the same problem with UNDO tablespace!

Recovery of Online Redo Log: Thread 1 Group 3 Seq 5237 Reading mem 0
RECOVERY OF THREAD 1 STUCK AT BLOCK 105 OF FILE 2
Aborting crash recovery due to error 1172
ORA-01172: recovery of thread 1 stuck at block 105 of file 2
ORA-01151: use media recovery to recover block, restore backup if needed
ORA-1172 signalled during: alter database open...

02:20 AM – Thinking… The first problem was related to INDEX tablespace and losing it should not make so big problem. However, there was problem with REDO and UNDO which is critical component of consistent Oracle database. Oracle is so smart in terms of consistent database issues, I decided to try the same operations (marking blocks as corrupt) with UNDO tablespace. If something goes wrong, Oracle should not let me open database and I would be forced to restore and recover until time.

SQL>ALTER DATABASE RECOVER datafile 2 allow 1 corruption;
Database altered.

From alert log:

Media Recovery Start
ALLOW CORRUPTION option must use serial recovery
CORRUPTING BLOCK 148066 OF FILE 2 AND CONTINUING RECOVERY
Sun Apr 11 02:27:35 2010
Errors in file /rtb/app/oracle/10.2.0/rdbms/log/admin_ora_5074.trc:
ORA-10567: Redo is inconsistent with data block (file# 2, block# 105)
ORA-10560: block type 'KTU UNDO BLOCK'
Media Recovery CompleteCompleted: ALTER DATABASE RECOVER datafile 2 allow 1 corruption


Well,

02:35 AM
SQL>ALTER DATABASE OPEN;

From alert log:

Started redo scan
Completed redo scan
52196 redo blocks read, 9534 data blocks need recovery
Started redo application at
Thread 1: logseq 5237, block 643846
Recovery of Online Redo Log: Thread 1 Group 3 Seq 5237 Reading mem 0
Sun Apr 11 02:28:19 2010
Completed redo application

Here I though database will be opened, unfortunately but it was not:

ORA-01578: ORACLE data block corrupted (file # 2, block # 105)
ORA-01110: data file 2: '/rtb/admin/u01/ORACLE/admin/tspace/undopms.dbf'
Sun Apr 11 02:28:23 2010
Error 1578 happened during db open, shutting down database
USER: terminating instance due to error 1578

Block 105 is the one which we corrupted earlier in order to by pass stuck recovery. But now, Oracle needs to read that block and it could not. Restoring that block could success if it was not changed from last backup. Because, if I restore that block from backup, then I would have to recover it. Calculations showed that it will take so much time. Skipping this procedure.

Well, what can I do?

Hopefully there was UNDO segment which is needed by Oracle during recovery phase. Decided to go with hidden parameter _offline_rollback_segments. This tells Oracle which rollback segments have to be forcefully offlined.

Added below event into init.ora file:
event = 10015 trace name context forever, level 10

After starting up database, trace file will be generated in USER_DUMP_DEST. The trace file has a message that says "error recovering tx(#, #) object #". Tx(#, #) refers to the transaction information and the object # is the object id of the object that has a corruption. Checked trace file and found that problem rollback segment was: “_SYSSMU7$”. Added _offline_rollback_segments= _SYSSMU7$ line into init file and restarted database and good news!

From alert log:

Completed: ALTER DATABASE OPEN

But, wait: after opening database, frequent errors appeared:

ORA-12012: error on auto execute of job 423
ORA-01578: ORACLE data block corrupted (file # 2, block # 105)
ORA-01110: data file 2: 'XXX

Well, Tried to create new UNDO tablespace and assign it active one. Restarted database, however the same error in alert log:

ORA-12012: error on auto execute of job 423
ORA-01578: ORACLE data block corrupted (file # 2, block # 105)
ORA-01110: data file 2: 'XXX

Dropped old UNDO tablespace – success. But real problems started from that time:

Massive and strange ORA-00600 in alert log:

Errors in file xxxx.trc:
ORA-00600: internal error code, arguments: [25027], [21], [8536530], [], [], [],
ORA-00600: internal error code, arguments: [25027], [21], [8536530], [], [], [],
ORA-00600: internal error code, arguments: [25027], [21], [8536530], [], [], [],
….

CPU is 100%, too bad performance in database and so on.

Well, quick search on ORA-600 lookup tool, it tells hopefully, Oracle DATA DICTIONARY corrupted. What does it mean, I had not to drop the rollback segment once the database is up. Because it forced the active transactions in the rollback segment to be committed and so we might have to rebuild the database. Instead, once the database is brought up, I had to drop offlined rollback segment first, then drop tablespace itself.


04:00 AMToo bad – I corrupted database. What I could do, is I have to rebuild database. Rebuild means either restore and recover database until time and at least additional 5 hours (2 TB) or plan something else.

I decided to use Transportable Tablespaces feature. Idea was that:

Export dump file of business tablespaces from problematic database.
Export dump file of logical structure (FULL=y ROWS=n) from problematic database
Create new database and import business tablespaces into new database using transportable tablespace feature
Import remain objects (procedure, packages, dblinks and etc) into new database.
Open database

It was successful and database opened.

However, it was great lesson what it can cause using hidden parameters without Oracle Support’s direct control.

5 comments:

Babu said...

Nice..

Babu
http://babudba.blogspot.com

Teymur Hajiyev said...

Welcome, Babu!

Unknown said...

Big risk, but great result

Punit said...

Really a god Job,
Similar situation i faced long time back but solved the problem after long hours by using hidden paramters.
but that happened in the day time.

Hayat Mohammad Khan said...

It nice article, we are facing this problem now.

We have shipped all the datafiles related to index table space showing courrpution, but it now asking for other data files which are not related to them...

ORA-10564: tablespace CASHL_IDX
ORA-01110: data file 31: '/dev/rcashdat14'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 559991
Wed Mar 19 12:01:51 2014

Asking for file# 50, which is not related to this index

Regards

Hayat Mohammad Khan
Senior DBA
PTCL,
Pakistan