In the previous blog post I used libvirt and KVM in my lab environment to simulate a transient disk failure and how to recover from it. This post takes this example a step further: I am simulating another disk failure, but this time won’t pretend I can fix the issue and put it back. In other words, I simulate the effect of the disk_repair_time hitting zero.
Most of what I am covering here is an extension of the previous post, I’ll mention the main detail here for your benefit, but would like to invite you to revert to the previous post for more detail.
The idea is to show you the output of the ASM alert.log and result of the lost disk in the V$-views.
As with the previous post, the code examples in this one are for demonstration purposes only!
The setup in a nutshell
I am using Oracle Linux 7.2 with UEK4 as the host system; KVM and libvirt are responsible for the guests (virtual machines). The guest VM used for this example is named asmtest, and uses Oracle Linux 7.2 as well. There are 9 ASM disks – 8 for +DATA featuring normal redundancy in 2 failure groups. I added +RECO for the sake of completeness with external redundancy. This post is about +DATA. To keep it simple I used an Oracle Restart configuration patched to the July 2016 PSU.
Removing the disk
As with the previous example I am using libvirt to remove a “LUN” temporarily from the guest. And sure enough, the VM picks this up. This is the relevant output obtained via journalctl -f
Sep 29 15:34:36 asmtest kernel: sd 2:0:0:0: [sda] Synchronizing SCSI cache Sep 29 15:34:36 asmtest kernel: sd 2:0:0:0: [sda] Sep 29 15:34:36 asmtest kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Sep 29 15:34:36 asmtest kernel: sd 2:0:0:0: [sda] Sep 29 15:34:36 asmtest kernel: Sense Key : Illegal Request [current] Sep 29 15:34:36 asmtest kernel: sd 2:0:0:0: [sda] Sep 29 15:34:36 asmtest kernel: Add. Sense: Logical unit not supported
In the previous post I keep referring to I copied the part of the ASM instance’s alert.log that showed how the disk repair timer was ticking down. This time I am simulating the case where – for whatever reason – the transient failure could not be fixed. In that case, this is what you would see in the alert.log:
2016-09-29 15:38:21.752000 +01:00 WARNING: Started Drop Disk Timeout for Disk 1 (DATA_0001) in group 1 with a value 600 WARNING: Disk 1 (DATA_0001) in group 1 will be dropped in: (600) secs on ASM inst 1 2016-09-29 15:41:25.379000 +01:00 WARNING: Disk 1 (DATA_0001) in group 1 will be dropped in: (416) secs on ASM inst 1 2016-09-29 15:44:29.012000 +01:00 WARNING: Disk 1 (DATA_0001) in group 1 will be dropped in: (232) secs on ASM inst 1 2016-09-29 15:47:32.643000 +01:00 WARNING: Disk 1 (DATA_0001) in group 1 will be dropped in: (48) secs on ASM inst 1 2016-09-29 15:50:36.259000 +01:00 WARNING: Disk 1 (DATA_0001) in group 1 will be dropped in: (0) secs on ASM inst 1
The last line in the above output is definitely NOT what you want to see. If at all possible, you should fix the problem causing DATA_0001 in this example from being dropped. If you see the message about the disk being dropped you are facing the inevitable rebalance operation. Here is the continued output from the ASM alert.log:
WARNING: PST-initiated drop of 1 disk(s) in group 1(.4232658126)) SQL> alter diskgroup DATA drop disk DATA_0001 force /* ASM SERVER */ NOTE: cache closing disk 1 of grp 1: (not open) DATA_0001 NOTE: requesting all-instance membership refresh for group=1 GMON updating for reconfiguration, group 1 at 40 for pid 23, osid 3216 NOTE: cache closing disk 1 of grp 1: (not open) DATA_0001 NOTE: group DATA: updated PST location: disk 0003 (PST copy 0) NOTE: group DATA: updated PST location: disk 0000 (PST copy 1) NOTE: group 1 PST updated. NOTE: membership refresh pending for group 1/0xfc493cce (DATA) GMON querying group 1 at 41 for pid 16, osid 2701 NOTE: cache closing disk 1 of grp 1: (not open) _DROPPED_0001_DATA SUCCESS: refreshed membership for 1/0xfc493cce (DATA) SUCCESS: alter diskgroup DATA drop disk DATA_0001 force /* ASM SERVER */ SUCCESS: PST-initiated drop disk in group 1(4232658126)) NOTE: Attempting voting file refresh on diskgroup DATA NOTE: starting rebalance of group 1/0xfc493cce (DATA) at power 1 Starting background process ARB0 ARB0 started with pid=20, OS id=3655 NOTE: assigning ARB0 to group 1/0xfc493cce (DATA) with 1 parallel I/O 2016-09-29 15:50:40.257000 +01:00 NOTE: restored redundancy of control and online logs in DATA NOTE: Rebalance has restored redundancy for any existing control file or redo log in disk group DATA NOTE: restored redundancy of control and online logs in DATA 2016-09-29 15:50:51.655000 +01:00 NOTE: requesting all-instance membership refresh for group=1 GMON updating for reconfiguration, group 1 at 42 for pid 21, osid 3719 NOTE: cache closing disk 1 of grp 1: (not open) _DROPPED_0001_DATA NOTE: group DATA: updated PST location: disk 0003 (PST copy 0) NOTE: group DATA: updated PST location: disk 0000 (PST copy 1) NOTE: group 1 PST updated. 2016-09-29 15:50:54.647000 +01:00 GMON updating for reconfiguration, group 1 at 43 for pid 21, osid 3719 NOTE: cache closing disk 1 of grp 1: (not open) _DROPPED_0001_DATA NOTE: group DATA: updated PST location: disk 0003 (PST copy 0) NOTE: group DATA: updated PST location: disk 0000 (PST copy 1) NOTE: group 1 PST updated. 2016-09-29 15:50:57.571000 +01:00 NOTE: membership refresh pending for group 1/0xfc493cce (DATA) GMON querying group 1 at 44 for pid 16, osid 2701 GMON querying group 1 at 45 for pid 16, osid 2701 NOTE: Disk _DROPPED_0001_DATA in mode 0x0 marked for de-assignment SUCCESS: refreshed membership for 1/0xfc493cce (DATA) NOTE: Attempting voting file refresh on diskgroup DATA 2016-09-29 15:51:03.589000 +01:00 NOTE: stopping process ARB0 SUCCESS: rebalance completed for group 1/0xfc493cce (DATA)
The last line indicates that the rebalance operation is complete. My disk group was almost empty, the rebalance operation took almost no time. But indeed, when you check V$ASM_DISK, the disk is gone:
SQL> r
1 select name,path,disk_number,mount_status,header_status,mode_status,state,failgroup,repair_timer
2* from v$asm_disk where group_number = 1 order by disk_number
NAME PATH DISK_NUMBER MOUNT_S HEADER_STATU MODE_ST STATE FAILGROUP REPAIR_TIMER
---------- --------------- ----------- ------- ------------ ------- -------- ---------- ------------
DATA_0000 /dev/asm-disk02 0 CACHED MEMBER ONLINE NORMAL BATCH1 0
DATA_0002 /dev/asm-disk03 2 CACHED MEMBER ONLINE NORMAL BATCH1 0
DATA_0003 /dev/asm-disk06 3 CACHED MEMBER ONLINE NORMAL BATCH2 0
DATA_0004 /dev/asm-disk07 4 CACHED MEMBER ONLINE NORMAL BATCH2 0
DATA_0005 /dev/asm-disk08 5 CACHED MEMBER ONLINE NORMAL BATCH2 0
DATA_0006 /dev/asm-disk04 6 CACHED MEMBER ONLINE NORMAL BATCH1 0
DATA_0007 /dev/asm-disk05 7 CACHED MEMBER ONLINE NORMAL BATCH2 0
7 rows selected.
SQL> select count(*), failgroup from v$asm_disk where name like 'DATA%' group by failgroup;
COUNT(*) FAILGROUP
---------- ----------
3 BATCH1
4 BATCH2
My system is fully operational, and the rebalance did not run into any space problems. Space problems are the last thing you want to have when rebalancing. I did some research about this subject earlier and documented it in a blog post.
In a future blog post I am going to investigate what happens when a partner disk of DATA_0001 fails while the repair timer is ticking down.