Lately I started implementing Data Guard - Physical Standby - as a  DRP environment for our production E-Businsess Suite database and I must  share with you one issue I encountered during implementation. 
I chose one of our test environments as a primary instance and I used a new server, which was prepared to the standby database in production, as the server for the standby database in test. Both are Red-Hat enterprise linux 4.
The implementation process went fast with no special issues (at lease I thought so...), everything seems to work fine, archived logs were transmitted from the primary server to the standby server and successfully applied on the standby database. I even executed switchover to the standby server (both database and application tier), and switchover back to the primary server with no problems.
The standby database was configured for maximum performance mode, I also created standby redo log files and LGWR was set to asynchronous (ASYNC) network transmission.
The exact setting from init.ora file:
At this stage, when the major  part of the implementation had been done, I found some time to deal with  some other issues, like interfaces to other systems, scripts, configure  rsync for concurrent log files, etc... , and some modifications to the  setup document I wrote during implementation.
While doing those other issues, I left the physical standby instance active so archive log files are transmitted and applied on the standby instance. After a couple of hours I noticed the following error in the primary database alert log file:
I don't  remember if I've ever had a corruption in redo log file before...  
What is wrong?! Is it something with the physical standby instance ?? Actually, if it's something with the standby instance I would have expected for a corruption in the standby redo log files not the primary's..
The primary instance resides on a Netapp volume, so I checked the mount option in /etc/fstab but they were fine. I asked our infrastructure team to check if something went wrong with the network during the time I got the corruption, but they reported that there was no error or something unusual.
Ok, I had no choice but to reconstruct the physical standby database, since when an archive log file is missing, the standby database is out of sync'. I set the 'log_archive_dest_state_2' to defer so no further archive log will be transferred to the standby server, cleared the corrupted redo log files (alter database clear unarchived logfile 'logfile.log') and reconstruct the physical standby database.
Meanwhile (copy database files takes long...), I checked documentation again, maybe I missed something, maybe I configured something wrong.. I have read a lot and didn't find anything that can shed some light on this issue.
At this stage, the standby was up and ready. First, I held up the redo transport service (log_archive_dest_state_2='defer') to see if I'll get a corruption when standby is off. After one or two days with no corruption I activated the standby.
Then I saw the following sentence in Oracle® Data Guard Concepts and Administration 10g Release 2 (10.2):
"All members of a Data Guard configuration must run an Oracle image that is built for the same platform. For example, this means a Data Guard configuration with a primary database on a 32-bit Linux on Intel system can have a standby database that is configured on a 32-bit Linux on Intel system"
One moment, I thought to myself, the standby server is based on AMD processors and the primary server is based on Intel's.. Is it the problem?!
When talking about same platform, is the meaning same processors also? Isn't it sufficient to have same 32 bit OS on x86 machines?
Weird but I had to check it...
Meanwhile, I got a corruption in redo log file again which assured there is a real problem and it wasn't accidentally.
So I used another AMD based server (identical to the standby server) and started all over again – primary and standby instances. After two or three days with no corruption I started to believe the difference in the processors was the problem. But one day later I got a corruption again (Oh no…)
I must say that on the one hand I was very frustrated, but on the other hand it was a relief to know it's not the difference in the processors.
It was so clear that when I'll find out the problem it will be something stupid..
So it is not the processors, not the OS and not the network. What else can it be?!
And here my familiarity with the "filesystemio_option" initialization parameter begins (thanks to Oracle Support!). I don't know how I missed this note before, but all is written here - Note 437005.1: Redo Log Corruption While Using Netapps Filesystem With Default Setting of Filesystemio_options Parameter.
When the redo log files are on a netapp volume, "filesystemio_options" must be set to "directio" (or "setall"). When "filesystemio_options" is set to "none" (like my instance before), read/writes to the redo log files are using the OS buffer cache. Since netapp storage is based on NFS (which is stateless protocol), when performing asynchronous writing over the network, the consistency of writes is not guaranteed. Some writes can be lost. By setting the "filesystemio_options" to "directio", writes bypasses the OS cache layer so no write will be lost.
Needless to say that when I set it to "directio" everything was fine and I haven't gotten any corruption again.
I chose one of our test environments as a primary instance and I used a new server, which was prepared to the standby database in production, as the server for the standby database in test. Both are Red-Hat enterprise linux 4.
The implementation process went fast with no special issues (at lease I thought so...), everything seems to work fine, archived logs were transmitted from the primary server to the standby server and successfully applied on the standby database. I even executed switchover to the standby server (both database and application tier), and switchover back to the primary server with no problems.
The standby database was configured for maximum performance mode, I also created standby redo log files and LGWR was set to asynchronous (ASYNC) network transmission.
The exact setting from init.ora file:
| log_archive_dest_2='SERVICE=[SERVICE_NAME] LGWR ASYNC=20480 OPTIONAL REOPEN=15 MAX_FAILURE=10 NET_TIMEOUT=30' | 
While doing those other issues, I left the physical standby instance active so archive log files are transmitted and applied on the standby instance. After a couple of hours I noticed the following error in the primary database alert log file:
| ARC3: Log  corruption near block 146465 change 8181238407160 time ? Mon Mar 2 13:04:43 2009 Errors in file [ORACLE_HOME]/admin/[CONTEXT_NAME]/bdump/[sid]_arc3_16575.trc: ORA-00354: corrupt redo log block header ORA-00353: log corruption near block 146465 change 8181238407160 time 02/03/2009 11:57:54 ORA-00312: online log 3 thread 1: '[logfile_dir]/redolog3.ora' ARC3: All Archive destinations made inactive due to error 354 Mon Mar 2 13:04:44 2009 ARC3: Closing local archive destination LOG_ARCHIVE_DEST_1: '[archivelog_dir]/arch_[xxxxx].arc' (error 354)([SID]) Committing creation of archivelog '[archivelog_dir]/arch_[xxxxx].arc' (error 354) ARCH: Archival stopped, error occurred. Will continue retrying Mon Mar 2 13:04:45 2009 ORACLE Instance [SID] - Archival Error | 
What is wrong?! Is it something with the physical standby instance ?? Actually, if it's something with the standby instance I would have expected for a corruption in the standby redo log files not the primary's..
The primary instance resides on a Netapp volume, so I checked the mount option in /etc/fstab but they were fine. I asked our infrastructure team to check if something went wrong with the network during the time I got the corruption, but they reported that there was no error or something unusual.
Ok, I had no choice but to reconstruct the physical standby database, since when an archive log file is missing, the standby database is out of sync'. I set the 'log_archive_dest_state_2' to defer so no further archive log will be transferred to the standby server, cleared the corrupted redo log files (alter database clear unarchived logfile 'logfile.log') and reconstruct the physical standby database.
Meanwhile (copy database files takes long...), I checked documentation again, maybe I missed something, maybe I configured something wrong.. I have read a lot and didn't find anything that can shed some light on this issue.
At this stage, the standby was up and ready. First, I held up the redo transport service (log_archive_dest_state_2='defer') to see if I'll get a corruption when standby is off. After one or two days with no corruption I activated the standby.
Then I saw the following sentence in Oracle® Data Guard Concepts and Administration 10g Release 2 (10.2):
"All members of a Data Guard configuration must run an Oracle image that is built for the same platform. For example, this means a Data Guard configuration with a primary database on a 32-bit Linux on Intel system can have a standby database that is configured on a 32-bit Linux on Intel system"
One moment, I thought to myself, the standby server is based on AMD processors and the primary server is based on Intel's.. Is it the problem?!
When talking about same platform, is the meaning same processors also? Isn't it sufficient to have same 32 bit OS on x86 machines?
Weird but I had to check it...
Meanwhile, I got a corruption in redo log file again which assured there is a real problem and it wasn't accidentally.
So I used another AMD based server (identical to the standby server) and started all over again – primary and standby instances. After two or three days with no corruption I started to believe the difference in the processors was the problem. But one day later I got a corruption again (Oh no…)
I must say that on the one hand I was very frustrated, but on the other hand it was a relief to know it's not the difference in the processors.
It was so clear that when I'll find out the problem it will be something stupid..
So it is not the processors, not the OS and not the network. What else can it be?!
And here my familiarity with the "filesystemio_option" initialization parameter begins (thanks to Oracle Support!). I don't know how I missed this note before, but all is written here - Note 437005.1: Redo Log Corruption While Using Netapps Filesystem With Default Setting of Filesystemio_options Parameter.
When the redo log files are on a netapp volume, "filesystemio_options" must be set to "directio" (or "setall"). When "filesystemio_options" is set to "none" (like my instance before), read/writes to the redo log files are using the OS buffer cache. Since netapp storage is based on NFS (which is stateless protocol), when performing asynchronous writing over the network, the consistency of writes is not guaranteed. Some writes can be lost. By setting the "filesystemio_options" to "directio", writes bypasses the OS cache layer so no write will be lost.
Needless to say that when I set it to "directio" everything was fine and I haven't gotten any corruption again.
 
 
No comments:
Post a Comment