One of the many new manuals that showed up in Oracle Database 12.2 is the Oracle Autonomous Health Framework User’s Guide. This book contains a lot of information, some of which we already know from previous releases, and some new stuff as well. One of the new chapters caught my eye in particular:
“Proactively Detecting and Diagnosing Performance Issues for Oracle RAC”
Said chapter 5 introduces a new feature, called the Cluster Health Advisor (CHA), that should help you understand the performance of your cluster better. Really? Time to start the VMs to create a test.
My system is based on Oracle Linux 7.3 with UEK4. I created a 2 node 12.2.0.1.0 RAC, there weren’t any patches out at the time of writing. In other words, this function might change a little. My database, named CDB, features 2 PDBs: swingbench1 and swingbench2 (you can already guess where this is heading). It appears that my cluster is a “flex cluster” operating in standard mode. You noticed during configuration that you defined the nodes as “HUB nodes”, didn’t you? I have print screens from my RAC installation that I’ll put into a blog post when I find the time. In the meantime head over to oracle-base.com where you can find a RAC 12.2 installation walkthrough for Oracle Linux on VirtualBox.
The plan for this post is to overload the system deliberately to provoke the health advisor to give me recommendations about the cluster and/or the database.
The official documentation is found here: https://docs.oracle.com/database/122/ATNMS/monitoring-rac-with-cha.htm
How is the Cluster Health Advisory implemented?
As per the documentation, there are Cluster Health Advisory daemon processes found on each node, and sure enough there are:
[oracle@rac122pri1 ~]$ crsctl stat res -t -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.ASMNET1LSNR_ASM.lsnr ONLINE ONLINE rac122pri1 STABLE ONLINE ONLINE rac122pri2 STABLE ora.DATA.dg ONLINE ONLINE rac122pri1 STABLE ONLINE ONLINE rac122pri2 STABLE ora.LISTENER.lsnr ONLINE ONLINE rac122pri1 STABLE ONLINE ONLINE rac122pri2 STABLE ora.MGMT.dg ONLINE ONLINE rac122pri1 STABLE ONLINE ONLINE rac122pri2 STABLE ora.OCR.dg ONLINE ONLINE rac122pri1 STABLE ONLINE ONLINE rac122pri2 STABLE ora.RECO.dg ONLINE ONLINE rac122pri1 STABLE ONLINE ONLINE rac122pri2 STABLE ora.chad ONLINE ONLINE rac122pri1 STABLE ONLINE ONLINE rac122pri2 STABLE ora.net1.network ONLINE ONLINE rac122pri1 STABLE ONLINE ONLINE rac122pri2 STABLE ora.ons ONLINE ONLINE rac122pri1 STABLE ONLINE ONLINE rac122pri2 STABLE ora.proxy_advm OFFLINE OFFLINE rac122pri1 STABLE OFFLINE OFFLINE rac122pri2 STABLE ...
The resource in question is “ora.chad”, and its profile defines dependencies to the GIMR or Grid Infrastructure Management Repository:
[oracle@rac122pri1 ~]$ crsctl stat res ora.chad -p | grep "DEPENDENCIES\b" START_DEPENDENCIES=hard(global:ora.mgmtdb) pullup(global:ora.mgmtdb) STOP_DEPENDENCIES=hard(global:intermediate:ora.mgmtdb) [oracle@rac122pri1 ~]$
The documentation further states that the Health Advisor (CHA) Daemon automatically manages cluster nodes. Databases registered in Clusterware however must be registered specifically with the monitoring daemon.
The main tool to interact with the CHA is – chactl (there’s a surprise). It can take quite a few options:
[oracle@rac122pri1 bin]$ chactl -h CLSCH-3659 : invalid command line syntax Usage: chactl monitor cluster [-model [-force]] chactl monitor database -db [-model [-force]] chactl unmonitor database -db chactl status [cluster|database [-db ]] [-verbose] chactl config [cluster|database -db ] chactl calibrate {cluster|database -db } -model [-force] [-timeranges 'start=,end=,...'] [-kpiset 'name= min= max=,...' ] WHERE: -interval : interval is in hrs -timeranges 'start=,end=,...' : Timestamp must be in format 'YYYY-MM-DD HH24:MI:SS' KPI for db: CPUPERCENT - CPU utilization - Percent IOREAD - Disk read - Mbyte/sec DBTIMEPERCALL - Database time per user call - usec/call IOWRITE - Disk write - Mbyte/sec IOTHROUGHPUT - Disk throughput - IO/sec KPI for cluster: CPUPERCENT - CPU utilization - Percent IOREAD - Disk read - Mbyte/sec IOWRITE - Disk write - Mbyte/sec IOTHROUGHPUT - Disk throughput - IO/sec chactl query diagnosis [-cluster|-db ] [-start -end ] [-htmlfile ] chactl query model [-name [-verbose]] chactl query repository chactl query calibration {-cluster|-db } [-timeranges 'start=,end=,...'] [-kpiset 'name= min= max=,...' ] [-interval ] WHERE: -interval : interval is in hrs -timeranges 'start=,end=,...' : Timestamp must be in format 'YYYY-MM-DD HH24:MI:SS' KPI for db: CPUPERCENT - CPU utilization - Percent IOREAD - Disk read - Mbyte/sec DBTIMEPERCALL - Database time per user call - usec/call IOWRITE - Disk write - Mbyte/sec IOTHROUGHPUT - Disk throughput - IO/sec KPI for cluster: CPUPERCENT - CPU utilization - Percent IOREAD - Disk read - Mbyte/sec IOWRITE - Disk write - Mbyte/sec IOTHROUGHPUT - Disk throughput - IO/sec chactl remove model -name chactl rename model -from -to chactl import model -name -file [-force] chactl export model -name -file chactl set maxretention -time chactl resize repository -entities [-force | -eval] [oracle@rac122pri1 bin]$
The output of the “help” command has been reformatted for readability.
And yes the documentation is right. At first only the cluster resource was monitored.
[oracle@rac122pri1 ~]$ chactl status monitoring nodes rac122pri2, rac122pri1 not monitoring databases [oracle@rac122pri1 ~]$ chactl status -verbose monitoring nodes rac122pri2, rac122pri1 using model DEFAULT_CLUSTER not monitoring databases [oracle@rac122pri1 ~]$
Monitoring my database
In the next step I wanted to include my database to the monitoring configuration. That is easy:
[oracle@rac122pri1 ~]$ chactl monitor database -db CDB [oracle@rac122pri1 ~]$ chactl status -verbose monitoring nodes rac122pri2, rac122pri1 using model DEFAULT_CLUSTER monitoring database cdb, instances CDB2, CDB1 using model DEFAULT_DB
Creating load and causing trouble
With the monitoring in place it was time to create some trouble. I started a swingbench run deliberately overloading the CPUs on my cluster. To add to the fun I rebooted a node during the swingbench execution. Although that didn’t stop processing (I used the connection pool method to establish sessions) I caused all of my sessions to pile up on the surviving node. It held its head up, but performance took a nosedive.
Any recommendations?
The Cluster Health Advisor can be interrogated using the “query diagnosis” command. After causing sufficient trouble I asked for a first assessment. Interestingly enough I hit the repository right after I rebooted node 2. I hadn’t checked at the time, but node 2 ran the GIMR – hence this output:
[oracle@rac122pri1 bin]$ chactl query diagnosis -db CDB\ > -start "2017-03-15 05:20:50" -end "2017-03-15 05:55:50" CLSCH-2005 : Oracle Cluster Health Analysis Service (OCHAD) failed due to a Grid Infrastructure Management Repository error. Unable to start the Universal Connection Pool: oracle.ucp.UniversalConnectionPoolException: Cannot get Connection from Datasource: java.sql.SQLRecoverableException: Listener refused the connection with the following error: ORA-12514, TNS:listener does not currently know of service requested in connect descriptor
After a little while the failed resources were up again, and I could get some output:
[oracle@rac122pri1 bin]$ chactl query diagnosis -db CDB \ > -start "2017-03-15 05:20:50" -end "2017-03-15 05:55:50" 2017-03-15 05:54:40.0 Database cdb DB Recovery Read IO Performance (CDB1) [detected] 2017-03-15 05:55:00.0 Database cdb DB Recovery Read IO Performance (CDB1) [cleared] Problem: DB Recovery Read IO Performance Description: CHA detected that recovery reads are slower than expected. Cause: The Cluster Health Advisor (CHA) detected that a database instance left the cluster unexpectedly and instance recovery had a larger impact on performance than expected because more database blocks than expected had to be recovered. The data blocks to which recovery was applied were read from disk or from the global cache. Action: Decrease the target for the mean time to recover ( MTTR target ) to increase the incremental checkpointing rate. [oracle@rac122pri1 bin]$
Spot on!
SQL> show parameter mttr NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ fast_start_mttr_target integer 0 SQL>
This is typical for DBCA created databases based on the General_Purpose.dbc template. Good catch!
Now the system was still struggling under load:
SQL> select count(*), a.inst_id, b.name from gv$session a, v$pdbs b 2 where a.con_id = b.con_id and a.username = 'SOE' group by a.inst_id, b.name; COUNT(*) INST_ID NAME ---------- ---------- ------------------------------ 75 1 SWINGBENCH1
The pool was set to 75 users, but should have been split between both nodes. With the second node down however there was only one to run on.
The operating system didn’t look too happy either:
[root@rac122pri1 trace]# top -b -n 1 | head -n 15 top - 06:05:08 up 21:14, 2 users, load average: 15.47, 13.66, 8.69 Tasks: 1403 total, 10 running, 1393 sleeping, 0 stopped, 0 zombie %Cpu(s): 5.0 us, 2.2 sy, 0.0 ni, 92.5 id, 0.2 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 16167836 total, 5510468 free, 4714544 used, 5942824 buff/cache KiB Swap: 782332 total, 782332 free, 0 used. 7338672 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10339 root 20 0 158952 5516 3608 R 16.7 0.0 0:00.03 top 10341 root 20 0 28604 5428 3728 R 11.1 0.0 0:00.02 perl 20798 oracle 20 0 2545588 812828 805112 S 11.1 5.0 0:40.34 oracle_20798_cd 20824 oracle 20 0 2543552 817728 809924 S 11.1 5.1 0:40.01 oracle_20824_cd 20828 oracle 20 0 2545592 810800 803068 S 11.1 5.0 0:39.88 oracle_20828_cd 24365 oracle 20 0 2543524 588452 581008 S 11.1 3.6 0:22.72 oracle_24365_cd 24457 oracle 20 0 2543748 596728 588752 S 11.1 3.7 0:22.56 oracle_24457_cd 28569 oracle 20 0 2543524 552440 545016 S 11.1 3.4 0:20.18 oracle_28569_cd [root@rac122pri1 trace]#
Although I guess that is not too bad for a VM with only 4 cores! Inevitably though there is trouble (but it seems to be over)
[oracle@rac122pri1 bin]$ chactl query diagnosis -db CDB \ > -start "2017-03-15 05:20:00" -end "2017-03-15 06:20:00" 2017-03-15 05:54:40.0 Database cdb DB Recovery Read IO Performance (CDB1) [detected] 2017-03-15 05:55:00.0 Database cdb DB Recovery Read IO Performance (CDB1) [cleared] 2017-03-15 05:58:55.0 Database cdb Long Waits for Global Cache Message Requests (CDB1) [detected] 2017-03-15 05:58:55.0 Database cdb Long Redo Log Sync Wait (CDB1) [detected] 2017-03-15 06:04:25.0 Database cdb Long Waits for Global Cache Message Requests (CDB1) [cleared] 2017-03-15 06:04:25.0 Database cdb Long Redo Log Sync Wait (CDB1) [cleared] Problem: DB Recovery Read IO Performance Description: CHA detected that recovery reads are slower than expected. Cause: The Cluster Health Advisor (CHA) detected that a database instance left the cluster unexpectedly and instance recovery had a larger impact on performance than expected because more database blocks than expected had to be recovered. The data blocks to which recovery was applied were read from disk or from the global cache. Action: Decrease the target for the mean time to recover ( MTTR target ) to increase the incremental checkpointing rate. Problem: Long Waits for Global Cache Message Requests Description: CHA detected that global cache messages are taking a long time. Cause: The Cluster Health Advisor (CHA) detected that global cache messages have not completed and are blocking database processes. It is likely that Global Cache Server Processes (LMS) on another node are unresponsive or are waiting for CPU or Memory. Action: Check whether CHA reports any issues related to severe performance degradation caused by CPU or memory starvation on other nodes in the cluster and follow the suggestions to resolve those problems. Problem: Long Redo Log Sync Wait Description: CHA detected that commits are blocked for several seconds. Cause: The Cluster Health Advisor (CHA) detected that commits are blocked for several seconds because cluster commit progagation messages are very slow. It is likely that a Global Cache Server Process (LMS) on another node is unresponsive or is waiting for CPU or Memory. Action: Check whether CHA reports any issues related to severe performance degradation on other nodes in the cluster. Resolve the problems found on other nodes and instances of this database in the cluster. [oracle@rac122pri1 bin]$
I really like the text representation because I like the command line. For those of us who prefer a more aesthetically pleasing representation of the data, you can also create a HTML version of the findings specifying the htmlfile flag.
What else?
Going over the list of options again I found a few more that look interesting. For instance, what’s the footprint of this information in the repository? Query it:
[oracle@rac122pri1 bin]$ chactl query repository specified max retention time(hrs): 72 available retention time(hrs) : 307 available number of entities : 17 allocated number of entities : 4 total repository size(gb) : 15.00 allocated repository size(gb) : 1.29 [oracle@rac122pri1 bin]$
And what are these models I see when querying the status? DEFAULT_DB and DEFAULT_CLUSTER exist:
[oracle@rac122pri1 bin]$ chactl query model Models: DEFAULT_CLUSTER, DEFAULT_DB [oracle@rac122pri1 bin]$ chactl query model -name DEFAULT_DB Model: DEFAULT_DB Target Type: DATABASE Version: 12.2.0.1_0 OS Calibrated on: Calibration Target Name: Calibration Date: Calibration Time Ranges: Calibration KPIs: [oracle@rac122pri1 bin]$
The documentation elaborates, it is possible to calibrate the tool to better reflect the environment.
This is actually quite useful, however I doubt many clusters will deviate from the standard model. Anyway, if you want to calibrate your cluster to a specific workload, you start of querying CHA to check if it has sufficient data. In my case I could dream up a scenario called “swingbench run”.
The cluster as such comes first:
[oracle@rac122pri1 bin]$ chactl query calibration -cluster -timeranges >'start=2017-03-15 05:20:00,end=2017-03-15 06:20:00' Cluster name : rac122pri Start time : 2017-03-15 05:20:00 End time : 2017-03-15 06:30:00 Total Samples : 1647 Percentage of filtered data : 100% 1) Disk read (ASM) (Mbyte/sec) MEAN MEDIAN STDDEV MIN MAX 1.19 0.45 3.71 0.00 86.69 <25 <50 <75 =100 99.64% 0.18% 0.12% 0.06% 0.00% 2) Disk write (ASM) (Mbyte/sec) MEAN MEDIAN STDDEV MIN MAX 3.14 0.42 4.67 0.00 61.67 <50 <100 <150 =200 99.94% 0.06% 0.00% 0.00% 0.00% 3) Disk throughput (ASM) (IO/sec) MEAN MEDIAN STDDEV MIN MAX 339.59 100.00 512.97 0.00 9700.00 <5000 <10000 <15000 =20000 99.88% 0.12% 0.00% 0.00% 0.00% 4) CPU utilization (total) (%) MEAN MEDIAN STDDEV MIN MAX 37.89 16.80 33.90 4.30 99.90 <20 <40 <60 =80 55.68% 7.35% 3.76% 4.74% 28.48% [oracle@rac122pri1 bin]$
As per the command’s help message you can limit the output of KPIs. Similarly you can check if you have enough data for the database.
[oracle@rac122pri1 bin]$ chactl query calibration -db CDB -timeranges \ > 'start=2017-03-15 05:20:00,end=2017-03-15 06:20:00' Database name : cdb Start time : 2017-03-15 05:20:00 End time : 2017-03-15 06:30:00 Total Samples : 1632 Percentage of filtered data : 100% 1) Disk read (ASM) (Mbyte/sec) MEAN MEDIAN STDDEV MIN MAX 1.16 0.46 3.61 0.00 86.69 <25 <50 <75 =100 99.69% 0.12% 0.12% 0.06% 0.00% 2) Disk write (ASM) (Mbyte/sec) MEAN MEDIAN STDDEV MIN MAX 3.17 0.43 4.69 0.00 61.67 <50 <100 <150 =200 99.94% 0.06% 0.00% 0.00% 0.00% 3) Disk throughput (ASM) (IO/sec) MEAN MEDIAN STDDEV MIN MAX 341.61 100.00 514.31 0.00 9700.00 <5000 <10000 <15000 =20000 99.88% 0.12% 0.00% 0.00% 0.00% 4) CPU utilization (total) (%) MEAN MEDIAN STDDEV MIN MAX 37.94 16.75 33.99 4.30 99.90 <20 <40 <60 =80 55.82% 7.17% 3.62% 4.66% 28.74% 5) Database time (per user call) (usec/call) MEAN MEDIAN STDDEV MIN MAX 108878.33 6712.80 2778378.28 0.00 93077792.00 <10000000 <20000000 <30000000 <40000000 <50000000 <60000000 =70000000 99.88% 0.00% 0.00% 0.00% 0.00% 0.00% 0.06% 0.06% [oracle@rac122pri1 bin]$
Be careful though, sometimes you don’t have enough data! I saw a message like this when setting the window to 30 minutes:
[oracle@rac122pri1 ~]$ chactl query calibration -cluster \ > -timeranges 'start=2017-03-15 05:50:00,end=2017-03-15 06:20:00' Cluster name : rac122pri Start time : 2017-03-15 05:50:00 End time : 2017-03-15 06:20:00 Total Samples : 687 Percentage of filtered data : 100% The number of data samples may not be sufficient for calibration. ...
It looks like you can leave it at that, or create a new, more accurate model with finer granularity. Interesting new features await!