Wednesday, September 19, 2012

STANDBY did not release space in FRA, despite large reclaimable space

At customer site, we have following configuration: 3-node ASM RAC and 1-node ASM standby, both with FRA, also in ASM, version.

Few days ago, we get error, that archiving on standby was stopped.

In alert.log on primary site, on log_archive_dest_2 (to standby) was error:
ORA-07286: sksagdi: cannot obtain device information
In alert.log, on standby site:

Did you see that ?  0.00% used , cca 200GB available. FRA is 250GB size.

I did not want to delete archivelogs, on this site there is not tape to backup archivelogs with RMAN to tape, or disk (there is no space). After I check ASM group that is used for FRA, I saw that there is space to expand FRA:

 SO, to quick resolve error, I increase FRA:

and immediatelly after that, archivelogs start comming and applying.

But, it is bothered me, that, with archivelog delete policy set "TO APPLIED ON ALL STANDBY;", and RMAN regular backups on primary, archivelogs must be deleted and set to reclaimable space immediately after they applied on standby db.

When I query v$recovery_file_dest (before increasing):

Well, Oracle should be now release reclaimable space and start use this new free space . But, it is not.

Here are, how you can check is archivelogs became reclaimable:

The following query will list all the archivelogs included in v$flash_recovery_area_usage, those with rectype of 11 are the ones that are reclaimable (I reduced to last 3 with rownum, to compare with max values in next 2 queries):

which archivelogs are applied:

which archivelogs are tranfered to standby side:   
This tells me that space should be free, because, obviously, archivelogs was marked as reclaimable soon as they applied on standby (max values from last queries is the same) and I opened SR on Metalink, and try investigate further...

In the note Archiver Hung On Ora-15041, FRA not Cleaning Files Automatically. [ID 316281.1], I found the same simptoms as I have, but, that case is if you have  log_archive_dest_1 parameter set directly to some mount point or  ASM group, files will get generated but will not be considered by FRA algorithm for cleaning up when space pressure mounts up.
You must set log_archive_dest_1='location=use_db_recovery_file_dest', then Oracle will check check for files whose space is eligible for reclaiming.

But, I have this already, and in the Note, you have a little guery to verify this:

All 3 locations is YES on IS_RECOVERY_DEST_FILE column, v$archived_log.IS_RECOVERY_DEST_FILE=YES, so it should be automatically  managed.

After a few reviews and uploads with Oracle support, we find that this is BUG !!! :-)

Yes, and this is bug from version :-)  (Bug 14227959: STANDBY DID NOT RELEASE SPACE IN FRA)

So, here are the workarround:

First, you can set this parameter, to provide more detailed information about the space pressure:
and solution (for me that worked):
after that, I decreased FRA from 450 GB to 370 (used files was about 360GB), to check how Oracle will free the space.

Than, I watch alert.log 2-3 days, and space in FRA, and now, it is always something like that:

And, I asked for recomended value, how often I must execute this refresh function for aged files in FRA, and answer was to execute dbms_backup_restore.refreshAgedFiles on standby at regular intervals (once in 1 hour).

At the end, here are all files that I reference for this error:

FRA: Flashback logs are not being deleted when space is needed for archive logs [ID 1481739.1]
V$Flash_recovery_area_usage is not being updated [ID 1471471.1]
Space issue in Flash Recovery Area( FRA ) [ID 829755.1]
Database Crashed With ORA-19815 ORA-19809 ORA-16038 [ID 829254.1]
How is the space pressure managed in the Flash Recovery Area - An Example. [ID 315098.1]

Hope this helps someone.

Tuesday, May 22, 2012

OEM does not send alert notification mails! Did you maybe change SYSMAN password ? :-)

A few days ago, I got a new  db to maintain and administer. After first look, I started EM console, and ask customer for SMTP parameters. 
Of course, there is no SYS and SYSMAN password :-). After consultation with customer, we decide to change SYSMAN password, so I can configure mail alerting from OEM. 
I successfully reset SYSMAN password, and create one account with sysdba privilege, because we don't have SYS password:

 Ok, now is all ready for EM console.. I set SMTP server, mail account, rules, notification, schedule,.. Test mail for account and mail server is coming. But, when I schedule alerting and lower some thresholds for metrics (CPU, tablespace used,..)(don't want to wait some real alert :-) ) to get first alert mail, nothing happened!!!

Hmmm... Double check everything again..nothing...
Than, I found something strange: SYSMAN account is getting locked!
So, I unlock, but, after few seconds, again.. Aha, someone trying to use SYSMAN with old pass, obviosly.
OK, I solved this temporarily with creating another profile with failed_login_attempts = unlimited and alter SYSMAN with that new profile. (SYSMAN was in DEFAULT profile with failed_login_attempts = 10)

 But, still, mail alerting does not working!

Quick check for jobs parameter and do I have some alerts at all ? Quick searching on forums, net , metalink.. nothing useful a lot..
Than, I try one "trick"...I start to create SR on metalink, describe my problem in details with keywords, and on step 3 or 4, you get advises to look some Notes, based on problem/keywords you enter in describing your issue. 4 Notes is pop in my case, 3 of them was some usual general Note, but one does not ! :-)
One was: How To Change the Password of the Database User Sysman (DB Control Repository Schema) [ID 259379.1]

Waaaiit ?!!!

 I WAS change pass for SYSMAN and write SR about mail alerting !! Of course, after reading a Note, I found that, when you change SYSMAN pass, you need to stop/reset EM console !! Because my DB is, it was enough to : (before you must manually change file) After start EM console, my mailbox get 25 messages immediately ! :-) And, finally, I change SYSMAN user again to DEFAULT profile, and everything was ok, without locking!

What I learned today:
 1) Always backup old password hash when you change user password :-(
2) Use Metalink, it is there for us :-)

Thursday, February 9, 2012

RAC crash and ora-27504 error

A few days ago, at customer site, RAC 10g goes down. Rac has 2 nodes, rac1, rac2 (asm).
Happened that on rac2 node cpu fan stop working, all local disks (2 in raid1) was broken, so I manage to successfully connect to rac1.

On rac1, CRS was healthy, but, ASM and DB instance was down ??

After a little investigating, find in ASM alert log error something about "instance stopped ** ORA-27504"!
I tried to manually start ASM, but get this:

After reading Metalink and forums, find that it could be something with interconnect.
This was helpfully:  RAC Single Instance (ASM) startup fails with ORA-27300/ORA-27301/ORA-27302/ORA-27303 [ID 331934.1]
From Note:
"This is caused by the private NIC is not connected to a switch. The NIC driver detects that there is no cable connected and, as oracle tests the state of the NIC, the ASM instance startup fails with above error. Similarly any RAC database single instance would also fails with the same error stack."

So, then we focus on network and system logs, and find out that 2 days before was some switching between interconnect network interface besause failure.
After calling customer to check, they report to us that switch is also go away ! :-)
Ok, while customer was in finding a new switch, I wanted to try start that ASM instance, like the error said: set _disable_interface_checking = TRUE
After change init+ASM1.ora, and startup command, I get error about wrong init.ora!
Fortunately, switch was replaced with new one in short time, so, I start ASM instance after that without problems. DB instance also started after that, and everybody was happy.

But, is bugging me that can I somehow use that parameter and in this case to start ASM without looking to interface!
So, after lot off Google and reading, I opened SR on Metalink and find out this:

1) You can not modify it manually and put it in init.ora.
2) You should first fix the network interface/switch issues then restart the db and asm then issue the below command for future purpose.

3) Run it for both asm and database instance

And, in the future, you should not get this particular error.

Every day we learn something, right ? :-)