This is a little note to myself on how to fix a corrupt spfile in clustered ASM. I hope you find it useful, too.
Let’s assume you made a change to the ASM (server) parameter file that causes an issue. You are most likely to notice this once CRS is restarted but parts of the stack fail to come up. If “crsctl check crs” mentions any component not started you can try to find out where in the bootstrap process you are stuck. Here is the output from my system.
[root@rac12pri1 ~]# crsctl stat res -t -init
--------------------------------------------------------------------------------
Name Target State Server State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE OFFLINE STABLE
ora.cluster_interconnect.haip
1 ONLINE OFFLINE rac12pri1 STARTING
ora.crf
1 ONLINE OFFLINE STABLE
ora.crsd
1 ONLINE OFFLINE STABLE
ora.cssd
1 ONLINE ONLINE rac12pri1 STABLE
ora.cssdmonitor
1 ONLINE ONLINE rac12pri1 STABLE
ora.ctssd
1 ONLINE ONLINE rac12pri1 OBSERVER,STABLE
ora.diskmon
1 OFFLINE OFFLINE STABLE
ora.drivers.acfs
1 ONLINE ONLINE rac12pri1 STABLE
ora.evmd
1 ONLINE INTERMEDIATE rac12pri1 STABLE
ora.gipcd
1 ONLINE ONLINE rac12pri1 STABLE
ora.gpnpd
1 ONLINE ONLINE rac12pri1 STABLE
ora.mdnsd
1 ONLINE ONLINE rac12pri1 STABLE
ora.storage
1 ONLINE OFFLINE STABLE
--------------------------------------------------------------------------------
[root@rac12pri1 ~]#
I noticed that lots of components are not started. If you are interested in the startup order and dependencies between processes you can find this documented in the Clusterware Administration and Deployment Guide – Chapter 1, Figure 1-2
Another useful piece of information is the Clusterware alert.log. Unlike Oracle Clusterware version 11.2 where log information was in the $GRID_HOME the 12c CRS logs moved to the ADR. A quick look at the alert.log showed
2015-07-28 09:16:51.247 [OCSSD(11611)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 11611 2015-07-28 09:16:52.347 [OCSSD(11611)]CRS-1713: CSSD daemon is started in hub mode 2015-07-28 09:16:57.974 [OCSSD(11611)]CRS-1707: Lease acquisition for node rac12pri1 number 1 completed 2015-07-28 09:16:59.076 [OCSSD(11611)]CRS-1605: CSSD voting file is online: /dev/vdc1; details in /u01/app/oracle/diag/crs/rac12pri1/crs/trace/ocssd.trc. 2015-07-28 09:16:59.089 [OCSSD(11611)]CRS-1672: The number of voting files currently available 1 has fallen to the minimum number of voting files required 1. 2015-07-28 09:17:08.198 [OCSSD(11611)]CRS-1601: CSSD Reconfiguration complete. Active nodes are rac12pri1 . 2015-07-28 09:17:10.276 [OCTSSD(11694)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 11694 2015-07-28 09:17:11.261 [OCTSSD(11694)]CRS-2403: The Cluster Time Synchronization Service on host rac12pri1 is in observer mode. 2015-07-28 09:17:11.469 [OCTSSD(11694)]CRS-2407: The new Cluster Time Synchronization Service reference node is host rac12pri1. 2015-07-28 09:17:11.469 [OCTSSD(11694)]CRS-2401: The Cluster Time Synchronization Service started on host rac12pri1. 2015-07-28 09:17:43.016 [ORAROOTAGENT(11376)]CRS-5019: All OCR locations are on ASM disk groups [CHM], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "/u01/app/oracle/diag/crs/rac12pri1/crs/trace/ohasd_orarootagent_root.trc". 2015-07-28 09:18:05.139 [OCSSD(11611)]CRS-1625: Node rac12pri2, number 2, was shut down 2015-07-28 09:18:05.139 [OCSSD(11611)]CRS-1625: Node rac12pri3, number 3, was shut down 2015-07-28 09:18:05.139 [OCSSD(11611)]CRS-1625: Node rac12pri4, number 4, was shut down
In other words, CSSD has found the block device I use for the voting files, and concludes its initial work. However, the oracle ROOT agent (orarootagent) cannot proceed since none of the OCR locations on ASM can be opened. Checking the log file at that particular time I can see where the problem is:
2015-07-28 09:17:42.989946*:kgfo.c@2846: kgfoCheckMount dg=CHM ok=0
2015-07-28 09:17:42.990045 : USRTHRD:3741497088: {0:9:3} -- trace dump on error exit --
2015-07-28 09:17:42.990057 : USRTHRD:3741497088: {0:9:3} Error [kgfoAl06] in [kgfokge] at kgfo.c:2850
2015-07-28 09:17:42.990067 : USRTHRD:3741497088: {0:9:3} ORA-15077: could not locate ASM instance serving a
required diskgroup
2015-07-28 09:17:42.990077 : USRTHRD:3741497088: {0:9:3} Category: 7
2015-07-28 09:17:42.990115 : USRTHRD:3741497088: {0:9:3} DepInfo: 15077
2015-07-28 09:17:42.990382 : USRTHRD:3741497088: {0:9:3} -- trace dump end --
2015-07-28 09:17:42.990408 :CLSDYNAM:3741497088: [ora.storage]{0:9:3} [start] retcode = 7, kgfoCheckMount(CHM)
2015-07-28 09:17:42.990423 :CLSDYNAM:3741497088: [ora.storage]{0:9:3} [start] (null) category: 7, operation:
kgfoAl06, loc: kgfokge, OS error: 15077,
other: ORA-15077: could not locate ASM instance serving a required diskgroup
So there is not a single ASM instance that could serve the required diskgroup. Hmmm… So maybe I have to back out the change I just made. I have developed a habit of creating backups (pfiles) of spfiles prior to implementing changes. But even if there is no backup of the spfile I can still get the system back, and here are the steps I used. Just as with the database, I need to
- Create a temporary pfile on the file system
- Start ASM using this temporary pfile
- Create a backup my (bad) spfile from the ASM disk group
- Extract all parameters
- Create a proper pfile that I use to start the cluster with
- Convert that to a spfile in ASM
Fixing the problem
The first step is to create a temporary pfile. Using the ASM instance’s alert.log I can go scroll up to a point in time before the change I made to check which parameters are needed. These following are just an example, your settings are different!
... Using parameter settings in server-side spfile +CHM/rac12pri/ASMPARAMETERFILE/registry.253.885820125 System parameters with non-default values: large_pool_size = 12M remote_login_passwordfile= "EXCLUSIVE" asm_diskstring = "/dev/vd*1" asm_diskgroups = "DATA" asm_diskgroups = "RECO" asm_power_limit = 1 NOTE: remote asm mode is remote (mode 0x202; from cluster type) Cluster communication is configured to use the following interface(s) for this instance 169.254.106.70 169.254.184.41 cluster interconnect IPC version: Oracle UDP/IP (generic) IPC Vendor 1 proto 2 ...
The new pfile, /tmp/init+ASM1.ora, has the following contents:
large_pool_size = 12M remote_login_passwordfile= "EXCLUSIVE" asm_diskstring = "/dev/vd*1" asm_diskgroups = "DATA" asm_diskgroups = "RECO" asm_power_limit = 1
I can now start the first ASM instance:
[oracle@rac12pri1 ~]$ sqlplus / as sysasm SQL*Plus: Release 12.1.0.2.0 Production on Tue Jul 28 09:23:23 2015 Copyright (c) 1982, 2014, Oracle. All rights reserved. Connected to an idle instance. SQL> startup pfile='/tmp/init+ASM1.ora' ASM instance started Total System Global Area 1140850688 bytes Fixed Size 2933400 bytes Variable Size 1112751464 bytes ASM Cache 25165824 bytes ASM diskgroups mounted SQL>
The alert.log also records the location of the spfile-you should back this up now (using asmcmd or any other tool). Using the backup, you should be able to reconstruct your spfile, but make sure to take the offending parameter out.
I decided to create the spfile as spfileASM.ora in ASM. I amended my temporary pfile with the settings from the recovered spfile and put it back into the cluster.
SQL> create spfile='+CHM/rac12pri/spfileASM.ora' from pfile='/tmp/init+ASM1.ora'; File created.
Why the name change? You cannot create files in ASM that have OMF names. Trying to create the spfile with the original name will cause an error:
SQL> create spfile='+CHM/rac12pri/ASMPARAMETERFILE/registry.253.885820125' from pfile='/tmp/init+ASM1.ora'; create spfile='+CHM/rac12pri/ASMPARAMETERFILE/registry.253.885820125' from pfile='/tmp/init+ASM1.ora' * ERROR at line 1: ORA-17502: ksfdcre:4 Failed to create file +CHM/rac12pri/ASMPARAMETERFILE/registry.253.885820125 ORA-15177: cannot operate on system aliases
The really nice thing is that this is reflected in the Grid Plug And Play (GPNP) profile immediately. The ASM alert.log showed:
2015-07-28 09:25:01.323000 +01:00 NOTE: updated gpnp profile ASM SPFILE to NOTE: header on disk 0 advanced to format #2 using fcn 0.0 2015-07-28 09:25:58.332000 +01:00 NOTE: updated gpnp profile ASM diskstring: /dev/vd*1 NOTE: updated gpnp profile ASM diskstring: /dev/vd*1 NOTE: updated gpnp profile ASM SPFILE to +CHM/rac12pri/spfileASM.ora
And the XML profile is updated too (reformatted for better readability)
[oracle@rac12pri1 ~]$ gpnptool get -o-
<?xml version="1.0" encoding="UTF-8"?>
<gpnp:GPnP-Profile Version="1.0" xmlns="http://www.grid-pnp.org/2005/11/gpnp-profile"
xmlns:gpnp="http://www.grid-pnp.org/2005/11/gpnp-profile"
xmlns:orcl="http://www.oracle.com/gpnp/2005/11/gpnp-profile"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.grid-pnp.org/2005/11/gpnp-profile gpnp-profile.xsd"
ProfileSequence="7" ClusterUId="886a0e42a...5d805357c76a"
ClusterName="rac12pri" PALocation="">
<gpnp:Network-Profile>
<gpnp:HostNetwork id="gen" HostName="*">
<gpnp:Network id="net1" IP="192.168.100.0" Adapter="eth0" Use="public"/>
<gpnp:Network id="net2" IP="192.168.101.0" Adapter="eth1" Use="cluster_interconnect"/>
<gpnp:Network id="net3" IP="192.168.102.0" Adapter="eth2" Use="asm,cluster_interconnect"/>
</gpnp:HostNetwork>
</gpnp:Network-Profile>
<orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/>
<orcl:ASM-Profile id="asm" DiscoveryString="/dev/vd*1" SPFile="+CHM/rac12pri/spfileASM.ora" Mode="remote"/>
<ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#">...</ds:Signature>
</gpnp:GPnP-Profile>
This should be it-the correct values have been restored, the spfile is back on shared storage, and I should be able to start with this combination. After having issued the stop/start commands to CRS it was indeed all well:
[root@rac12pri1 ~]# crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
[root@rac12pri1 ~]# crsctl stat res -t
--------------------------------------------------------------------------------
Name Target State Server State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ASMNET1LSNR_ASM.lsnr
ONLINE ONLINE rac12pri1 STABLE
ora.CHM.dg
ONLINE ONLINE rac12pri1 STABLE
ora.DATA.dg
ONLINE ONLINE rac12pri1 STABLE
ora.LISTENER.lsnr
ONLINE ONLINE rac12pri1 STABLE
ora.RECO.dg
ONLINE ONLINE rac12pri1 STABLE
ora.net1.network
ONLINE ONLINE rac12pri1 STABLE
ora.ons
ONLINE ONLINE rac12pri1 STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE rac12pri1 STABLE
ora.LISTENER_SCAN2.lsnr
1 ONLINE ONLINE rac12pri1 STABLE
ora.LISTENER_SCAN3.lsnr
1 ONLINE ONLINE rac12pri1 STABLE
ora.MGMTLSNR
1 ONLINE ONLINE rac12pri1 169.254.1.137 192.16
8.101.10 192.168.102
.10,STABLE
ora.asm
1 ONLINE ONLINE rac12pri1 STABLE
2 ONLINE OFFLINE STABLE
3 ONLINE OFFLINE STABLE
ora.cdb.db
1 OFFLINE OFFLINE Instance Shutdown,ST
ABLE
2 OFFLINE OFFLINE STABLE
3 OFFLINE OFFLINE STABLE
4 OFFLINE OFFLINE STABLE
ora.cvu
1 ONLINE ONLINE rac12pri1 STABLE
ora.mgmtdb
1 ONLINE ONLINE rac12pri1 Open,STABLE
ora.ncdb.db
1 ONLINE ONLINE rac12pri1 Open,STABLE
2 ONLINE OFFLINE STABLE
3 ONLINE OFFLINE STABLE
4 ONLINE OFFLINE STABLE
ora.ncdb.fotest.svc
1 ONLINE OFFLINE STABLE
2 ONLINE ONLINE rac12pri1 STABLE
ora.oc4j
1 ONLINE ONLINE rac12pri1 STABLE
ora.rac12pri1.vip
1 ONLINE ONLINE rac12pri1 STABLE
ora.rac12pri2.vip
1 ONLINE INTERMEDIATE rac12pri1 FAILED OVER,STABLE
ora.rac12pri3.vip
1 ONLINE INTERMEDIATE rac12pri1 FAILED OVER,STABLE
ora.rac12pri4.vip
1 ONLINE INTERMEDIATE rac12pri1 FAILED OVER,STABLE
ora.scan1.vip
1 ONLINE ONLINE rac12pri1 STABLE
ora.scan2.vip
1 ONLINE ONLINE rac12pri1 STABLE
ora.scan3.vip
1 ONLINE ONLINE rac12pri1 STABLE
--------------------------------------------------------------------------------
Time to start Clusterware on the other nodes and to report “We are back and running” :)
Reference
- How to start up the ASM instance when the spfile is misconstrued or lost ? (Doc ID 1313657.1)
- Clusterware Administration and Deployment Guide – Troubleshooting