Recently I had some fun with an Oracle Enterprise Manager Cloud control installation (13.5.0.10). The Repository is 19.15 in a MAA Gold configuration: RAC and DataGuard, also the OEM has 2 WLS nodes with a load balancer in front.
The system was not available for some hours (nothing to blame the architecture, and nothing I can discuss here).
After everything was working again, the Average Active Sessions on the repository were quite high: around 200. It was clear there was a lot of work to be processed, all the events coming in from the agents to be collected, aggregated and so on. Still that load was something to investigate and improve - if possible.
The first information I focused on were the high cluster related waits (all the gc ... events) together with buffer busy waits. I picket the top SQL which contributes to these waits: SQL_ID:79rrazqdmaq8h
At a first glance, the sequence might be a source for such problems. But as the version is greater than 19.10 already, dynamic sequence cache should mitigate this issue already (and _dynamic_sequence_cache was not set). After a little bit of investigation which objects could be affected at all (e.g. in v$active_session_history.CURRENT_OBJ# for the specific SQL_ID and wait event or in v$session using p1 (file#) and p2 (block#) and then identify the object in DBA_EXTENTS). In this case it's still sequence related, but not SEQ$. Instead the majority of these waits were on the index on EM_METRIC_KEYS(METRIC_KEY_ID). The solution is still the same: Increase the CACHE for the sequence to a value where each involved instance can use it's only index block based on proper index block split (In my case a CACHE size of 5000 die the trick).
Some other INSERT statements also suffered from the same problems and could be improved by the same way. This helped to reduce the AAS down to ~80.
As these type of INSERTs disappeared from the top statements, the next to focus on is an UPDATE: SQL_ID: 7hyx2gndnw242
UPDATE EM_EVENT_BUS_QUEUES
SET LAST_DEQUEUE_TIME = MGMT_GLOBAL.SYSDATE_UTC
WHERE QNAME = :B1
Again a lot of Cluster waits with some Concurrency. The specific wait events looked quite familiar to the previous case. Here several sessions are fighting for the same block to do an update there. As the consistency of the block needs to be communicated clusterwide, and the block itself needs to travel back & forth even cache-fusion and modern Exadata hardware can only help so much.
A quick check on the table EM_EVENT_BUS_QUEUES_E (the one behind the view EM_EVENT_BUS_QUEUES) showed there are only 89 rows in it (at that time, for this specific system). This leads to the idea to provide each session with direct access to the QUEUE name it is searching for (in the hope the same QUEUE is not processed by different sessions at the same time).
Doing so is simple:
ALTER TABLE SYSMAN.EM_EVENT_BUS_QUEUES_E PCTFREE 99;
ALTER TABLE SYSMAN.EM_EVENT_BUS_QUEUES_E MOVE ONLINE;
This wastes a little bit of space, but contention on the blocks is eliminated.
Again there were several statements and objects which could be threatened by this method.
The wait events are still cluster related. But PL/SQL itself does rarely cause such events, and I was lazy digging into the dequeu_job procedure. But lazyness pays off if it leads me to a solution: in my case a simple google search for 747d8puab2h4u leads to the MOS document
the solution provided there did not help, but some more search - this time within MOS - led to
and further to
and at last
EM13c : Setting Up RAC Services In Repository Database (Doc ID 2693517.1)
There several tasks are explained, but based on my specific issue, I only focussed on one: Loaderjob service
To implement this specific suggestion, I created a dedicated service on the database with only 1 active instance (don't forget to create this service afterwards on the dataguard, or your next switchover test will become a disaster). Then this service was configured to the OMS:
emctl set property -name "oracle.sysman.core.pbs.gcloader.connectDescriptor" -value "\(DESCRIPTION=\(ADDRESS_LIST=\(ADDRESS=\(PROTOCOL=TCP\)\(HOST=<primary racinstance>\)\(PORT=1521\)\)\)\(CONNECT_DATA=\(SERVICE_NAME=...loaderjob\)\)\)"
Of course, the connection string needs to contain both SCANs, in case of switchover tests ...
This requires the OMS to restart.
After even one OMS restarted, the AAS ... was still at ~25, but the cluster waits mostly disappeared.
The last statement to improve was a SELECT: SQL_ID: dc3p6u6d0uvzn
select 1
from "SYSMAN"."AQ$_EM_LOADERJOB_Q_TBL_L"
where msgid = :1
and subscriber# = :2
and name = :3
and address# = :4
This SELECT is only ON CPU - at least no clusterwaits but still something worth to improve, if possible.
The important observation here was the number of rows in this table. At the time of investigation it was 1. At the same time the segment used significant more blocks than required for 1 row. This ON CPU did just one thing: checking some blocks in buffercache IF there are any rows. And these checks were done at a high pace which created the still significant load.
The solution for this last task was quite simple:
alter table "SYSMAN"."AQ$_EM_LOADERJOB_Q_TBL_L" MOVE online;
(shrink space might also have done the trick)
After that last step, the AAS was down to ~2.
These improvements in performance (or maybe more important, reduction in resource consumption) did take some time - and I did not show all the dead ends in between.
There is also no silver bullet at all. Just a simple focus on top resource consumers. Of course basic knowledge about the system helps, as well as minimal skills in using a search engine (both online available an don software vendors support page).
In this specific case, tuning & diagnostic pack was licensed. But all the information used here was available in simple data dictionary and dynamic performance views, if required.
One of my customers faced a problem with false results. The most remarkable detail was the inconsistency of these wrong results. It depended on the queries he run. So his first assumption was an error in the SQL parser/optimizer/wherever. But as I striped down the correct/wrong SQLs more and more to get a simple testcase, maybe something to provide to Oracle Support, it became clear the SQLs are fine, just one physical representation of the data was wrong!
Yes, even in a perfectly normalized relational database, there is a lot of physically duplicated data. One might think of materialized views, but in fact, every index is an implicit duplication of some data for optimized physical access.
Each RDBMS promises to keep this duplicated in sync (if the word would have been used in IT in last millenium, they would have called it autonomous ), and each RDBMS fails in doing so sometimes.
I tried to identify those rows and indexes where information in the index and the row differs. Then the customer needs to understand the data and decide, which data is correct, store this data in the block and re-create all indexes which does not match the tables data.
The basic idea is to select the rows from the index and from the table and compare it's data. Even this sounds easy, Oracles Optimizer does a lot of effort to avoid what it sees as unnecessary work. Getting the same column of the same row from 2 different sources is such an avoidable work. I first tried some queries where I get the questionable columns (and their ROWIDs) in a sub-query and then join to the table again retrieving the tables row and value. Something like:
select /*+ qb_name(main) */ t.pk, t.problematic_column
from TABLE t
where t.rowid in
(
select /*+ qb_name(get_rowid) */ ti.rowid
from TABLE ti
where ti.problematic_column='123456'
);
But to avoid the optimizers clevernes, I had to use some hints like NO_MERGE, NO_UNNEST, NO_ELIMINATE_SQ(@GET_ROWID), INDEX, MATERIALIZE and there might be even more with higher versions.
My final approach is maybe not as eligant in pure SQL, but for me it provides more stability:
with
FUNCTION get_problematic_column (rid in varchar2) return number IS
cid number;
begin
SELECT problematic_column into cid
from TABLE t
where rowid=rid;
return cid;
end;
select /*+ INDEX(ti, SUSPICIOUS_INDEX) qb_name(get_rowid) */
ti.rowid, ti.pk, ti.problematic_column as problematic_column_in_index,
get_problematic_column(rowid) as problematic_column_at_ROWID
from TABLE ti
where ti.problematic_column='123456'
;
This is not the most beautiful query, but it should give a good idea. The function get_problematic_column does a TABLE ACCESS BY USER ROWID and we can be very sure this will not change and provide the data as it's in the tables block. The query get_rowid should access the TABLE via the SUSPICIOUS_INDEX - I only have to ensure the WHERE clause matches the index columns.
With little adaptions I could test all indices for invalid columns.
In this case the tables data was valid and one index needed to be rebuilt. Much easier than opening a Service Request at MOS.
If anyone wants to know the underlying hardware: it's an Exadata cluster (I was never told exact specifications) with proper ASM Diskgroup configuration.