2022-12-03

from 200 active sessions down to 2 - no silver bullets

 Recently I had some fun with an Oracle Enterprise Manager Cloud control installation (13.5.0.10). The Repository is 19.15 in a MAA Gold configuration: RAC and DataGuard, also the OEM has 2 WLS nodes with a load balancer in front.  

The system was not available for some hours (nothing to blame the architecture, and nothing I can discuss here). 

After everything was working again, the Average Active Sessions on the repository were quite high: around 200. It was clear there was a lot of work to be processed, all the events coming in from the agents to be collected, aggregated and so on. Still that load was something to investigate and improve - if possible. 
The first information I focused on were the high cluster related waits (all the gc ... events) together with buffer busy waits. I picket the top SQL which contributes to these waits: SQL_ID:79rrazqdmaq8h
INSERT INTO EM_METRIC_KEYS (METRIC_KEY_ID, KEY_PART_1, KEY_PART_2, KEY_PART_3, KEY_PART_4, KEY_PART_5, KEY_PART_6, KEY_PART_7, METRIC_KEY_GUID, CREATION_DATE)
  VALUES (EM_METRIC_KEY_ID_SEQ.NEXTVAL, :B1 , :B2 , :B3 , :B4 , :B5 , :B6 , :B7 , :B8 , SYSDATE)
  RETURNING METRIC_KEY_ID INTO :O0
At a first glance, the sequence might be a source for such problems. But as the version is greater than 19.10 already, dynamic sequence cache should mitigate this issue already (and _dynamic_sequence_cache was not set). After a little bit of investigation which objects could be affected at all (e.g. in v$active_session_history.CURRENT_OBJ# for the specific SQL_ID and wait event or in v$session using p1 (file#) and p2 (block#) and then identify the object in DBA_EXTENTS). In this case it's still sequence related, but not SEQ$. Instead the majority of these waits were on the index on EM_METRIC_KEYS(METRIC_KEY_ID).  The solution is still the same: Increase the CACHE for the sequence to a value where each involved instance can use it's only index block based on proper index block split (In my case a CACHE size of 5000 die the trick).
Some other INSERT statements also suffered from the same problems and could be improved by the same way. 
This helped to reduce the AAS down to ~80.

As these type of INSERTs disappeared from the top statements, the next to focus on is an UPDATE: 
SQL_ID: 7hyx2gndnw242
UPDATE EM_EVENT_BUS_QUEUES 
  SET LAST_DEQUEUE_TIME = MGMT_GLOBAL.SYSDATE_UTC 
  WHERE QNAME = :B1
Again a lot of Cluster waits with some Concurrency. The specific wait events looked quite familiar to the previous case. Here several sessions are fighting for the same block to do an update there. As the consistency of the block needs to be communicated clusterwide, and the block itself needs to travel back & forth even cache-fusion and modern Exadata hardware can only help so much. 
A quick check on the table EM_EVENT_BUS_QUEUES_E (the one behind the view EM_EVENT_BUS_QUEUES) showed there are only 89 rows in it (at that time, for this specific system). This leads to the idea to provide each session with direct access to the QUEUE name it is searching for (in the hope the same QUEUE is not processed by different sessions at the same time). 
Doing so is simple: 
ALTER TABLE SYSMAN.EM_EVENT_BUS_QUEUES_E PCTFREE 99; 
ALTER TABLE SYSMAN.EM_EVENT_BUS_QUEUES_E MOVE ONLINE;
This wastes a little bit of space, but contention on the blocks is eliminated. 
Again there were several statements and objects which could be threatened by this method. 
AAS went down to ~35. 

Next top statement is slightly different: 
SQL_ID: 747d8puab2h4u
BEGIN 
  DBMS_APPLICATION_INFO.SET_ACTION(:1 ); 
  EM_LOADERJOB.dequeu_job(:2 ,:3 ,:4 ,:5 ,:6 ,:7 ); 
END;
The wait events are still cluster related. But PL/SQL itself does rarely cause such events, and I was lazy digging into the dequeu_job procedure. But lazyness pays off if it leads me to a solution: in my case a simple google search for 747d8puab2h4u leads to the MOS document 
the solution provided there did not help, but some more search - this time within MOS - led to 
and further to 
and at last 
EM13c : Setting Up RAC Services In Repository Database (Doc ID 2693517.1)
There several tasks are explained, but based on my specific issue, I only focussed on one: Loaderjob service
To implement this specific suggestion, I created a dedicated service on the database with only 1 active instance (don't forget to create this service afterwards on the dataguard, or your next switchover test will become a disaster). 
Then this service was configured to the OMS:
emctl set property -name "oracle.sysman.core.pbs.gcloader.connectDescriptor" -value "\(DESCRIPTION=\(ADDRESS_LIST=\(ADDRESS=\(PROTOCOL=TCP\)\(HOST=<primary racinstance>\)\(PORT=1521\)\)\)\(CONNECT_DATA=\(SERVICE_NAME=...loaderjob\)\)\)"
 Of course, the connection string needs to contain both SCANs, in case of switchover tests ... 
This requires the OMS to restart.
After even one OMS restarted, the AAS ... was still at ~25, but the cluster waits mostly disappeared.

The last statement to improve was a SELECT: 
SQL_ID: dc3p6u6d0uvzn
select 1 
from "SYSMAN"."AQ$_EM_LOADERJOB_Q_TBL_L" 
where msgid = :1 
  and subscriber# = :2 
  and name = :3 
  and address# = :4
This SELECT is only ON CPU - at least no clusterwaits but still something worth to improve, if possible. 
The important observation here was the number of rows in this table. At the time of investigation it was 1. At the same time the segment used significant more blocks than required for 1 row. This ON CPU did just one thing: checking some blocks in buffercache IF there are any rows. And these checks were done at a high pace which created the still significant load. 
The solution for this last task was quite simple: 
alter table "SYSMAN"."AQ$_EM_LOADERJOB_Q_TBL_L" MOVE online;   
(shrink space might also have done the trick)
After that last step, the AAS was down to ~2.



These improvements in performance (or maybe more important, reduction in resource consumption) did take some time - and I did not show all the dead ends in between. 
There is also no silver bullet at all. Just a simple focus on top resource consumers. Of course basic knowledge about the system helps, as well as minimal skills in using a search engine (both online available an don software vendors support page).  
In this specific case, tuning & diagnostic pack was licensed. But all the information used here was available in simple data dictionary and dynamic performance views, if required. 



Keine Kommentare: