Fixing a problem with the ASM spfile preventing RAC 12c from starting

This is a little note to myself on how to fix a corrupt spfile in clustered ASM. I hope you find it useful, too.

Let’s assume you made a change to the ASM (server) parameter file that causes an issue. You are most likely to notice this once CRS is restarted but parts of the stack fail to come up. If “crsctl check crs” mentions any component not started you can try to find out where in the bootstrap process you are stuck. Here is the output from my system.

[root@rac12pri1 ~]# crsctl stat res -t -init
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  OFFLINE                               STABLE
ora.cluster_interconnect.haip
      1        ONLINE  OFFLINE      rac12pri1                STARTING
ora.crf
      1        ONLINE  OFFLINE                               STABLE
ora.crsd
      1        ONLINE  OFFLINE                               STABLE
ora.cssd
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.cssdmonitor
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.ctssd
      1        ONLINE  ONLINE       rac12pri1                OBSERVER,STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.drivers.acfs
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.evmd
      1        ONLINE  INTERMEDIATE rac12pri1                STABLE
ora.gipcd
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.gpnpd
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.mdnsd
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.storage
      1        ONLINE  OFFLINE                               STABLE
--------------------------------------------------------------------------------
[root@rac12pri1 ~]#

I noticed that lots of components are not started. If you are interested in the startup order and dependencies between processes you can find this documented in the Clusterware Administration and Deployment Guide – Chapter 1, Figure 1-2

Another useful piece of information is the Clusterware alert.log. Unlike Oracle Clusterware version 11.2 where log information was in the $GRID_HOME the 12c CRS logs moved to the ADR. A quick look at the alert.log showed


2015-07-28 09:16:51.247 [OCSSD(11611)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 11611
2015-07-28 09:16:52.347 [OCSSD(11611)]CRS-1713: CSSD daemon is started in hub mode
2015-07-28 09:16:57.974 [OCSSD(11611)]CRS-1707: Lease acquisition for node rac12pri1 number 1 completed
2015-07-28 09:16:59.076 [OCSSD(11611)]CRS-1605: CSSD voting file is online: /dev/vdc1; details in /u01/app/oracle/diag/crs/rac12pri1/crs/trace/ocssd.trc.
2015-07-28 09:16:59.089 [OCSSD(11611)]CRS-1672: The number of voting files currently available 1 has fallen to the minimum number of voting files required 1.
2015-07-28 09:17:08.198 [OCSSD(11611)]CRS-1601: CSSD Reconfiguration complete. Active nodes are rac12pri1 .
2015-07-28 09:17:10.276 [OCTSSD(11694)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 11694
2015-07-28 09:17:11.261 [OCTSSD(11694)]CRS-2403: The Cluster Time Synchronization Service on host rac12pri1 is in observer mode.
2015-07-28 09:17:11.469 [OCTSSD(11694)]CRS-2407: The new Cluster Time Synchronization Service reference node is host rac12pri1.
2015-07-28 09:17:11.469 [OCTSSD(11694)]CRS-2401: The Cluster Time Synchronization Service started on host rac12pri1.
2015-07-28 09:17:43.016 [ORAROOTAGENT(11376)]CRS-5019: All OCR locations are on ASM disk groups [CHM], and none of these disk groups are
   mounted. Details are at "(:CLSN00140:)" in "/u01/app/oracle/diag/crs/rac12pri1/crs/trace/ohasd_orarootagent_root.trc".
2015-07-28 09:18:05.139 [OCSSD(11611)]CRS-1625: Node rac12pri2, number 2, was shut down
2015-07-28 09:18:05.139 [OCSSD(11611)]CRS-1625: Node rac12pri3, number 3, was shut down
2015-07-28 09:18:05.139 [OCSSD(11611)]CRS-1625: Node rac12pri4, number 4, was shut down

In other words, CSSD has found the block device I use for the voting files, and concludes its initial work. However, the oracle ROOT agent (orarootagent) cannot proceed since none of the OCR locations on ASM can be opened. Checking the log file at that particular time I can see where the problem is:

2015-07-28 09:17:42.989946*:kgfo.c@2846: kgfoCheckMount dg=CHM ok=0
2015-07-28 09:17:42.990045 : USRTHRD:3741497088: {0:9:3} -- trace dump on error exit --

2015-07-28 09:17:42.990057 : USRTHRD:3741497088: {0:9:3} Error [kgfoAl06] in [kgfokge] at kgfo.c:2850

2015-07-28 09:17:42.990067 : USRTHRD:3741497088: {0:9:3} ORA-15077: could not locate ASM instance serving a
  required diskgroup

2015-07-28 09:17:42.990077 : USRTHRD:3741497088: {0:9:3} Category: 7

2015-07-28 09:17:42.990115 : USRTHRD:3741497088: {0:9:3} DepInfo: 15077

2015-07-28 09:17:42.990382 : USRTHRD:3741497088: {0:9:3} -- trace dump end --

2015-07-28 09:17:42.990408 :CLSDYNAM:3741497088: [ora.storage]{0:9:3} [start] retcode = 7, kgfoCheckMount(CHM)
2015-07-28 09:17:42.990423 :CLSDYNAM:3741497088: [ora.storage]{0:9:3} [start] (null) category: 7, operation:
 kgfoAl06, loc: kgfokge, OS error: 15077,
 other: ORA-15077: could not locate ASM instance serving a required diskgroup

So there is not a single ASM instance that could serve the required diskgroup. Hmmm… So maybe I have to back out the change I just made. I have developed a habit of creating backups (pfiles) of spfiles prior to implementing changes. But even if there is no backup of the spfile I can still get the system back, and here are the steps I used. Just as with the database, I need to

Create a temporary pfile on the file system
Start ASM using this temporary pfile
Create a backup my (bad) spfile from the ASM disk group
Extract all parameters
Create a proper pfile that I use to start the cluster with
Convert that to a spfile in ASM

Fixing the problem

The first step is to create a temporary pfile. Using the ASM instance’s alert.log I can go scroll up to a point in time before the change I made to check which parameters are needed. These following are just an example, your settings are different!

...
Using parameter settings in server-side spfile +CHM/rac12pri/ASMPARAMETERFILE/registry.253.885820125
System parameters with non-default values:
  large_pool_size          = 12M
  remote_login_passwordfile= "EXCLUSIVE"
  asm_diskstring           = "/dev/vd*1"
  asm_diskgroups           = "DATA"
  asm_diskgroups           = "RECO"
  asm_power_limit          = 1
NOTE: remote asm mode is remote (mode 0x202; from cluster type)
Cluster communication is configured to use the following interface(s) for this instance
  169.254.106.70
  169.254.184.41
cluster interconnect IPC version: Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
...

The new pfile, /tmp/init+ASM1.ora, has the following contents:

  large_pool_size          = 12M
  remote_login_passwordfile= "EXCLUSIVE"
  asm_diskstring           = "/dev/vd*1"
  asm_diskgroups           = "DATA"
  asm_diskgroups           = "RECO"
  asm_power_limit          = 1

I can now start the first ASM instance:

[oracle@rac12pri1 ~]$ sqlplus / as sysasm

SQL*Plus: Release 12.1.0.2.0 Production on Tue Jul 28 09:23:23 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup pfile='/tmp/init+ASM1.ora'
ASM instance started

Total System Global Area 1140850688 bytes
Fixed Size                  2933400 bytes
Variable Size            1112751464 bytes
ASM Cache                  25165824 bytes
ASM diskgroups mounted
SQL>

The alert.log also records the location of the spfile-you should back this up now (using asmcmd or any other tool). Using the backup, you should be able to reconstruct your spfile, but make sure to take the offending parameter out.

I decided to create the spfile as spfileASM.ora in ASM. I amended my temporary pfile with the settings from the recovered spfile and put it back into the cluster.


SQL> create spfile='+CHM/rac12pri/spfileASM.ora' from pfile='/tmp/init+ASM1.ora';

File created.

Why the name change? You cannot create files in ASM that have OMF names. Trying to create the spfile with the original name will cause an error:


SQL> create spfile='+CHM/rac12pri/ASMPARAMETERFILE/registry.253.885820125' from pfile='/tmp/init+ASM1.ora';
create spfile='+CHM/rac12pri/ASMPARAMETERFILE/registry.253.885820125' from pfile='/tmp/init+ASM1.ora'
*
ERROR at line 1:
ORA-17502: ksfdcre:4 Failed to create file
+CHM/rac12pri/ASMPARAMETERFILE/registry.253.885820125
ORA-15177: cannot operate on system aliases

The really nice thing is that this is reflected in the Grid Plug And Play (GPNP) profile immediately. The ASM alert.log showed:

2015-07-28 09:25:01.323000 +01:00
NOTE: updated gpnp profile ASM SPFILE to
NOTE: header on disk 0 advanced to format #2 using fcn 0.0
2015-07-28 09:25:58.332000 +01:00
NOTE: updated gpnp profile ASM diskstring: /dev/vd*1
NOTE: updated gpnp profile ASM diskstring: /dev/vd*1
NOTE: updated gpnp profile ASM SPFILE to +CHM/rac12pri/spfileASM.ora

And the XML profile is updated too (reformatted for better readability)


[oracle@rac12pri1 ~]$ gpnptool get -o-

<?xml version="1.0" encoding="UTF-8"?>
<gpnp:GPnP-Profile Version="1.0" xmlns="http://www.grid-pnp.org/2005/11/gpnp-profile"
  xmlns:gpnp="http://www.grid-pnp.org/2005/11/gpnp-profile"
  xmlns:orcl="http://www.oracle.com/gpnp/2005/11/gpnp-profile"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.grid-pnp.org/2005/11/gpnp-profile gpnp-profile.xsd"
  ProfileSequence="7" ClusterUId="886a0e42a...5d805357c76a"
  ClusterName="rac12pri" PALocation="">
  <gpnp:Network-Profile>
    <gpnp:HostNetwork id="gen" HostName="*">
     <gpnp:Network id="net1" IP="192.168.100.0" Adapter="eth0" Use="public"/>
     <gpnp:Network id="net2" IP="192.168.101.0" Adapter="eth1" Use="cluster_interconnect"/>
     <gpnp:Network id="net3" IP="192.168.102.0" Adapter="eth2" Use="asm,cluster_interconnect"/>
    </gpnp:HostNetwork>
  </gpnp:Network-Profile>
  <orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/>
  <orcl:ASM-Profile id="asm" DiscoveryString="/dev/vd*1" SPFile="+CHM/rac12pri/spfileASM.ora" Mode="remote"/>
  <ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#">...</ds:Signature>
</gpnp:GPnP-Profile>

This should be it-the correct values have been restored, the spfile is back on shared storage, and I should be able to start with this combination. After having issued the stop/start commands to CRS it was indeed all well:

[root@rac12pri1 ~]# crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online   

[root@rac12pri1 ~]# crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ASMNET1LSNR_ASM.lsnr
               ONLINE  ONLINE       rac12pri1                STABLE
ora.CHM.dg
               ONLINE  ONLINE       rac12pri1                STABLE
ora.DATA.dg
               ONLINE  ONLINE       rac12pri1                STABLE
ora.LISTENER.lsnr
               ONLINE  ONLINE       rac12pri1                STABLE
ora.RECO.dg
               ONLINE  ONLINE       rac12pri1                STABLE
ora.net1.network
               ONLINE  ONLINE       rac12pri1                STABLE
ora.ons
               ONLINE  ONLINE       rac12pri1                STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.MGMTLSNR
      1        ONLINE  ONLINE       rac12pri1                169.254.1.137 192.16
                                                             8.101.10 192.168.102
                                                             .10,STABLE
ora.asm
      1        ONLINE  ONLINE       rac12pri1                STABLE
      2        ONLINE  OFFLINE                               STABLE
      3        ONLINE  OFFLINE                               STABLE
ora.cdb.db
      1        OFFLINE OFFLINE                               Instance Shutdown,ST
                                                             ABLE
      2        OFFLINE OFFLINE                               STABLE
      3        OFFLINE OFFLINE                               STABLE
      4        OFFLINE OFFLINE                               STABLE
ora.cvu
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.mgmtdb
      1        ONLINE  ONLINE       rac12pri1                Open,STABLE
ora.ncdb.db
      1        ONLINE  ONLINE       rac12pri1                Open,STABLE
      2        ONLINE  OFFLINE                               STABLE
      3        ONLINE  OFFLINE                               STABLE
      4        ONLINE  OFFLINE                               STABLE
ora.ncdb.fotest.svc
      1        ONLINE  OFFLINE                               STABLE
      2        ONLINE  ONLINE       rac12pri1                STABLE
ora.oc4j
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.rac12pri1.vip
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.rac12pri2.vip
      1        ONLINE  INTERMEDIATE rac12pri1                FAILED OVER,STABLE
ora.rac12pri3.vip
      1        ONLINE  INTERMEDIATE rac12pri1                FAILED OVER,STABLE
ora.rac12pri4.vip
      1        ONLINE  INTERMEDIATE rac12pri1                FAILED OVER,STABLE
ora.scan1.vip
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.scan2.vip
      1        ONLINE  ONLINE       rac12pri1                STABLE
ora.scan3.vip
      1        ONLINE  ONLINE       rac12pri1                STABLE
--------------------------------------------------------------------------------

Time to start Clusterware on the other nodes and to report “We are back and running” :)

Reference

How to start up the ASM instance when the spfile is misconstrued or lost ? (Doc ID 1313657.1)
Clusterware Administration and Deployment Guide – Troubleshooting