Thursday, February 9, 2012

RAC crash and ora-27504 error

A few days ago, at customer site, RAC 10g goes down. Rac has 2 nodes, rac1, rac2 (asm).
Happened that on rac2 node cpu fan stop working, all local disks (2 in raid1) was broken, so I manage to successfully connect to rac1.

On rac1, CRS was healthy, but, ASM and DB instance was down ??

After a little investigating, find in ASM alert log error something about "instance stopped ** ORA-27504"!
I tried to manually start ASM, but get this:

After reading Metalink and forums, find that it could be something with interconnect.
This was helpfully:  RAC Single Instance (ASM) startup fails with ORA-27300/ORA-27301/ORA-27302/ORA-27303 [ID 331934.1]
From Note:
"This is caused by the private NIC is not connected to a switch. The NIC driver detects that there is no cable connected and, as oracle tests the state of the NIC, the ASM instance startup fails with above error. Similarly any RAC database single instance would also fails with the same error stack."

So, then we focus on network and system logs, and find out that 2 days before was some switching between interconnect network interface besause failure.
After calling customer to check, they report to us that switch is also go away ! :-)
Ok, while customer was in finding a new switch, I wanted to try start that ASM instance, like the error said: set _disable_interface_checking = TRUE
After change init+ASM1.ora, and startup command, I get error about wrong init.ora!
Fortunately, switch was replaced with new one in short time, so, I start ASM instance after that without problems. DB instance also started after that, and everybody was happy.

But, is bugging me that can I somehow use that parameter and in this case to start ASM without looking to interface!
So, after lot off Google and reading, I opened SR on Metalink and find out this:

1) You can not modify it manually and put it in init.ora.
2) You should first fix the network interface/switch issues then restart the db and asm then issue the below command for future purpose.

3) Run it for both asm and database instance

And, in the future, you should not get this particular error.

Every day we learn something, right ? :-)