2022-12-29

AHF Insights

Oracle recently released a new version of its Autonomous Health Framework (AHF). At the time of writing the version is 22.3.1. 
As the development is quite active there, one of the new features is AHF Insights with a nice description: 

AHF Insights provides deeper diagnostic insights into Oracle diagnostic collections collected by AHF diagnostic utilities, Oracle EXAchk, Oracle Trace File Analyzer, Exawatcher, and Cluster Health Monitor.

This sounds promising, so I was tempted to give it a try! 
My first attempt failed: I gave it a chance at my sandbox in my laptop, but right now AHF Insights only work on Engineered Systems. I didn't found this documented somewhere - but ok. The good news: I was told AHF Insights will be available in future versions also for non-Engineered environments. As this is the first release, I can follow the argument here. 

So I had to grab an Engineered System machine, install AHF, let it run for some time and then just run 

ahf analysis create --type insights
I also could be executed for a specific time in the past (with --last n{m|h}) or some range (--from DATETIME --to DATETIME) - But as I was not searching for a specific issue, I took the default 2 Hours
The execution took some minutes, but it was faster than an EXAchk
The zipfile created has a fileformat like <hostname>_insights_YYYY_MM_DD_HH24_MI_SS - it contains one directory web containing an index.html and a lot of subfolders:
.
└── web
    ├── css
    │   ├── alta
    │   │   └── 11.1.1
    │   │       └── web
    │   │           ├── fonts
    │   │           └── images
    │   │               └── sprites
    │   ├── fonts
    │   ├── images
    │   └── redwood
    │       └── 11.1.1
    │           ├── common
    │           └── web
    │               ├── fonts
    │               └── images
    ├── dynamicHtml
    ├── icons
    ├── js
    │   ├── jsons
    │   │   ├── insights
    │   │   ├── meta
    │   │   └── topology
    │   ├── libs
    │   │   ├── oj
    │   │   │   └── v11.1.1
    │   │   │       └── resources
    │   │   │           └── root
    │   │   └── require
    │   ├── viewModels
    │   └── views
    │       ├── compliance
    │       │   └── js
    │       │       └── input
    │       ├── dbParameter
    │       │   └── js
    │       │       └── input
    │       ├── ganttPage
    │       │   └── js
    │       │       └── input
    │       ├── os
    │       │   └── js
    │       │       └── input
    │       └── rpm
    │           └── js
    │               └── input
    └── log


the welcome.html is all you need to open. In my case the first page looks like this: 
It contains some basic information about 
  • the AHF version 
  • the time range it covers
  • System Topology
  • Insights it found



The Cluster information contains basic details about the cluster itself, it's resources and the ASM disks. 
The summary information has anCopy as text button which generates a nice output: 

========================================================================================================================
Area                           | Dimension                      | Value
========================================================================================================================
System                         | Type                           | Exadata
System                         | Total Node Count               | 2
Cluster                        | Node Count                     | 1
Cluster                        | GI Version                     | 19.17.0.0.0
Cluster                        | Timezone                       | Europe/
Cluster                        | Cluster Name                   | 
Cluster                        | CRS Home                       | /u01/app/19.0.0.0/grid
Databases                      | Count                          | 1
Databases                      | Database Home                  | /u02/
Database Servers               | Count                          | 2
Database Servers               | Hardware Model                 | Standard PC (Q35 + ICH9, 2009)
Database Servers               | Image Version                  | 21.2.13.0.0.220602
Database Servers               | Operating System               | Linux x86_64
Database Servers               | Operating System Version       | 4.14.35-2047.511.5.5.1.el7uek.x86_64
========================================================================================================================
Unfortunately this feature isn't present on all the pages. 
The Cluster Resources and ASM disks just contain information about all OCR resources and the ASM diskgroups. 

The Databases Topology information provides some basic information about the CDBs and their PDBs, together with some very basic instance information like cpu_count, sga_target, pga_targetpga_limit and processes

On the Insights section, those with some findings worth to check are highlighted in red. On my system, there are 4 such sections. I will show some of them as examples: 


The Timeline gives a graph of all relevant events, in my case there are only 2 types of events on one hostname.  Below there are filters which can be used if many events appear on the timeline. Also a list of Timestamp, Type, Event, Hostname and Description of each (filtered) event is shown. 



If you move the mouse to the right top of the chart, a list of icons becomes visible for an interactive interaction with the graph:
This can be used in all charts in the Insights.



In the Operating System Issues Page, a lot of detailed information about CPU, Memory, IO, Network, and Processes is shown, both as Charts in Metrics and Grouped Details in the Reports section.  
A nice detail in Process Aggregation is the aggregation on instance basis - so I can check the aggregated memory, processes, etc. for each instance. 


In the Best Practices a list of all the checks, their findings and recommendations is given. 
I always be careful with "Best Practices" - in this particular case the underscore parameters were given by Oracle support in a specific service request. 
But of course, you can see them as recommendations, either follow them or at least document the specific reason you did NOT follow them for a given purpose. 



In the last section, Recommended Software, you can see the early stage of the software. 
For me, 19.17.0.0.0 is close enough to 19.17.0.0.221018.
Maybe in a future release, Oracle will agree ;) 

At all, I can highly recommend to use AHF Insights whenever you need to investigate an issue which might have a wider scope than one single SQL or instance. Even as a first release, it can provide a lot of useful information in a nice and well structured report. 

2022-12-22

OEM 13.5 - add cman target with emcli

 Yesterday I had some fun adding a cman to Oracle Enterprise Manager. 

Using the web frontend is fine for a single target, but it doesn't scale well. But OEM also has a very fine command line utility: emcli

emcli has a high amount of parameters, in my case I'm interested in emcli add_target. This verb has basically 4 parameters: -name, -type -host and -propertiesname and host are kind of obvious, type I already identified as oracle_cman, but the properties are of some mystery. I did not find the list of properties somewhere in the documentation, and also google didn't help. 

So I mixed find and grep together in som random methods until I found the file 

 ./middleware/plugins/oracle.sysman.db.oms.plugin_13.5.1.0.0/metadata/targetType/oracle_cman.xml

Beside some other information (e.g. the name of the perlscripts which test the target) there is also a list of all properties:


  <InstanceProperties>
     <InstanceProperty NAME="Machine" CREDENTIAL="FALSE" OPTIONAL="FALSE">
       <Display>
         <Label NLSID="oracle_cman_Machine_iprop">Machine Name</Label>
       </Display>
     </InstanceProperty>
     <InstanceProperty NAME="Port" CREDENTIAL="FALSE" OPTIONAL="FALSE">
       <Display>
         <Label NLSID="oracle_cman_Port_iprop">Port Number</Label>
       </Display>
     </InstanceProperty>
     <InstanceProperty NAME="Protocol" CREDENTIAL="FALSE" OPTIONAL="FALSE">
       <Display>
         <Label NLSID="oracle_cman_Protocol_iprop">Connection Protocol</Label>
       </Display>
     </InstanceProperty>
     <InstanceProperty NAME="CManOraDir" CREDENTIAL="FALSE" OPTIONAL="FALSE">
       <Display>
         <Label NLSID="oracle_cman_CmanOraDir_iprop">cman.ora Directory</Label>
       </Display>
     </InstanceProperty>
     <InstanceProperty NAME="CManName" CREDENTIAL="FALSE" OPTIONAL="FALSE">
       <Display>
         <Label NLSID="oracle_cman_CmanName_iprop">Connection Manager Name</Label>
       </Display>
     </InstanceProperty>
     <InstanceProperty NAME="OracleHome" CREDENTIAL="FALSE" OPTIONAL="FALSE">
       <Display>
         <Label NLSID="oracle_cman_OracleHome_iprop">Oracle Home</Label>
       </Display>
     </InstanceProperty>
     <InstanceProperty NAME="Password" CREDENTIAL="TRUE" OPTIONAL="TRUE">
       <Display>
         <Label NLSID="oracle_cman_OracleHome_iprop">Connection Manager Password</Label>
       </Display>
     </InstanceProperty>


But all this effort would not have been necessary 😁  

At my tests, I hit some issues, and so identified

emcli add_target -name=cman_xxx -type=oracle_cman -host="xxx.yyy.zzz"

threw a proper error:

Error: Required properties are missing: Machine, Port, Protocol, CManOraDir, CManName, OracleHome


This syntax worked for me:


emcli add_target -name=cman_xxx -type=oracle_cman \
  -host="xxx.yyy.zzz" \
  -properties="Machine:xxx.yyy.zzz;Port:1521;Protocol:TCP;CManOraDir:<tns_admin>;CManName:cman_xxx;OracleHome:<oracle_home>;Password:🍻🍻🍻"

My next goal is to test if these values are stored correct.
For this I stumbled about the Note 

Emcli Command To Get Target Properties Information Like ORACLE HOME (Doc ID 2329892.1)


emcli list -resource="TargetProperties" -search="TARGET_NAME ='cman_xxx'" -column="PROPERTY_NAME,PROPERTY_VALUE" -script 


PROPERTY_NAME  PROPERTY_VALUE
CManNamecman_xxx
CManOraDir<TNS_ADMIN>
Machinexxx.yyy.zzz
OracleHome<ORACLE_HOME>
Port1521
ProtocolTCP
orcl_gtp_osLinux
orcl_gtp_platformx86_64

unfortunately, the Password isn't shown here and can not be checked by this method.

2022-12-21

OEM 13.5 - add cman target

Right now I have to add some connection manager instances to Oracle Enterprise manager (13.5.0.10). I did not find any automatic discovery, so I had to enter the required values manually. But not all of these values are quite clear (to me) and all the documentation I found was ... of little use. 

Luckily I stumbled across $ORACLE_HOME/plugins/oracle.sysman.db.agent.plugin_*/scripts/cmanresp.pl
This script seem to check if cman is available at all. To understand the meaning of these properties, the code is quite useful. Here some relevant parts: 

my $oracleHome = $ENV{CMAN_ORACLE_HOME};
my $executable = $ENV{CMAN_ORACLE_HOME} . "/bin/cmctl";
my $name = $ENV{CMAN_NAME};
my $cmanOraDir = $ENV{CMAN_ORA_DIR};
my $machine = $ENV{CMAN_MACHINE};
my $protocol = $ENV{CMAN_PROTOCOL};
my $port = $ENV{CMAN_PORT};
my $password = $ENV{CMAN_PASSWORD};

#set default protocol
if(!defined($protocol) or  $protocol eq "")
{
  $protocol = "TCP";
}

#Set environment variable ORACLE_HOME
$ENV{ORACLE_HOME} = $oracleHome;

my $address = "(ADDRESS=(PROTOCOL=$protocol)(HOST=$machine)(PORT=$port))";

my $responseTime;

#----------------------------------------------
#Execute tnsping and see if cman is reachable.
#-----------------------------------------------
#check tnsping availability
if ( !-x "$oracleHome/bin/tnsping" )
 {
  print "em_result=|0|tnsping not found in $oracleHome/bin\n";
  exit;
 }     
Connection Protocol, Machine Name and Port Number should match exactly the values as in cmans connection string - they are combined to (ADDRESS=(PROTOCOL=$protocol)(HOST=$machine)(PORT=$port)).
The default Protocol is TCP - as it's a required field, this should be entered there. 

Another question I raised (and could not answer before) was if or how the Password was used. In this script it's quite easy visible:

#Command to get the cman status
if($password eq "")
{
  $command = "$executable show status -c $name";
}
else
{
  $command = "$executable show status -c $name -p $password";
}

If no password is given, cmctl is used without the -p parameter. 

The properties Oracle Home (ORACLE_HOME) and cman.ora directory (TNS_ADMIN) are obvious, even for me. 


That's a short post, at least my documentation-to-self.

2022-12-03

from 200 active sessions down to 2 - no silver bullets

 Recently I had some fun with an Oracle Enterprise Manager Cloud control installation (13.5.0.10). The Repository is 19.15 in a MAA Gold configuration: RAC and DataGuard, also the OEM has 2 WLS nodes with a load balancer in front.  

The system was not available for some hours (nothing to blame the architecture, and nothing I can discuss here). 

After everything was working again, the Average Active Sessions on the repository were quite high: around 200. It was clear there was a lot of work to be processed, all the events coming in from the agents to be collected, aggregated and so on. Still that load was something to investigate and improve - if possible. 
The first information I focused on were the high cluster related waits (all the gc ... events) together with buffer busy waits. I picket the top SQL which contributes to these waits: SQL_ID:79rrazqdmaq8h
INSERT INTO EM_METRIC_KEYS (METRIC_KEY_ID, KEY_PART_1, KEY_PART_2, KEY_PART_3, KEY_PART_4, KEY_PART_5, KEY_PART_6, KEY_PART_7, METRIC_KEY_GUID, CREATION_DATE)
  VALUES (EM_METRIC_KEY_ID_SEQ.NEXTVAL, :B1 , :B2 , :B3 , :B4 , :B5 , :B6 , :B7 , :B8 , SYSDATE)
  RETURNING METRIC_KEY_ID INTO :O0
At a first glance, the sequence might be a source for such problems. But as the version is greater than 19.10 already, dynamic sequence cache should mitigate this issue already (and _dynamic_sequence_cache was not set). After a little bit of investigation which objects could be affected at all (e.g. in v$active_session_history.CURRENT_OBJ# for the specific SQL_ID and wait event or in v$session using p1 (file#) and p2 (block#) and then identify the object in DBA_EXTENTS). In this case it's still sequence related, but not SEQ$. Instead the majority of these waits were on the index on EM_METRIC_KEYS(METRIC_KEY_ID).  The solution is still the same: Increase the CACHE for the sequence to a value where each involved instance can use it's only index block based on proper index block split (In my case a CACHE size of 5000 die the trick).
Some other INSERT statements also suffered from the same problems and could be improved by the same way. 
This helped to reduce the AAS down to ~80.

As these type of INSERTs disappeared from the top statements, the next to focus on is an UPDATE: 
SQL_ID: 7hyx2gndnw242
UPDATE EM_EVENT_BUS_QUEUES 
  SET LAST_DEQUEUE_TIME = MGMT_GLOBAL.SYSDATE_UTC 
  WHERE QNAME = :B1
Again a lot of Cluster waits with some Concurrency. The specific wait events looked quite familiar to the previous case. Here several sessions are fighting for the same block to do an update there. As the consistency of the block needs to be communicated clusterwide, and the block itself needs to travel back & forth even cache-fusion and modern Exadata hardware can only help so much. 
A quick check on the table EM_EVENT_BUS_QUEUES_E (the one behind the view EM_EVENT_BUS_QUEUES) showed there are only 89 rows in it (at that time, for this specific system). This leads to the idea to provide each session with direct access to the QUEUE name it is searching for (in the hope the same QUEUE is not processed by different sessions at the same time). 
Doing so is simple: 
ALTER TABLE SYSMAN.EM_EVENT_BUS_QUEUES_E PCTFREE 99; 
ALTER TABLE SYSMAN.EM_EVENT_BUS_QUEUES_E MOVE ONLINE;
This wastes a little bit of space, but contention on the blocks is eliminated. 
Again there were several statements and objects which could be threatened by this method. 
AAS went down to ~35. 

Next top statement is slightly different: 
SQL_ID: 747d8puab2h4u
BEGIN 
  DBMS_APPLICATION_INFO.SET_ACTION(:1 ); 
  EM_LOADERJOB.dequeu_job(:2 ,:3 ,:4 ,:5 ,:6 ,:7 ); 
END;
The wait events are still cluster related. But PL/SQL itself does rarely cause such events, and I was lazy digging into the dequeu_job procedure. But lazyness pays off if it leads me to a solution: in my case a simple google search for 747d8puab2h4u leads to the MOS document 
the solution provided there did not help, but some more search - this time within MOS - led to 
and further to 
and at last 
EM13c : Setting Up RAC Services In Repository Database (Doc ID 2693517.1)
There several tasks are explained, but based on my specific issue, I only focussed on one: Loaderjob service
To implement this specific suggestion, I created a dedicated service on the database with only 1 active instance (don't forget to create this service afterwards on the dataguard, or your next switchover test will become a disaster). 
Then this service was configured to the OMS:
emctl set property -name "oracle.sysman.core.pbs.gcloader.connectDescriptor" -value "\(DESCRIPTION=\(ADDRESS_LIST=\(ADDRESS=\(PROTOCOL=TCP\)\(HOST=<primary racinstance>\)\(PORT=1521\)\)\)\(CONNECT_DATA=\(SERVICE_NAME=...loaderjob\)\)\)"
 Of course, the connection string needs to contain both SCANs, in case of switchover tests ... 
This requires the OMS to restart.
After even one OMS restarted, the AAS ... was still at ~25, but the cluster waits mostly disappeared.

The last statement to improve was a SELECT: 
SQL_ID: dc3p6u6d0uvzn
select 1 
from "SYSMAN"."AQ$_EM_LOADERJOB_Q_TBL_L" 
where msgid = :1 
  and subscriber# = :2 
  and name = :3 
  and address# = :4
This SELECT is only ON CPU - at least no clusterwaits but still something worth to improve, if possible. 
The important observation here was the number of rows in this table. At the time of investigation it was 1. At the same time the segment used significant more blocks than required for 1 row. This ON CPU did just one thing: checking some blocks in buffercache IF there are any rows. And these checks were done at a high pace which created the still significant load. 
The solution for this last task was quite simple: 
alter table "SYSMAN"."AQ$_EM_LOADERJOB_Q_TBL_L" MOVE online;   
(shrink space might also have done the trick)
After that last step, the AAS was down to ~2.



These improvements in performance (or maybe more important, reduction in resource consumption) did take some time - and I did not show all the dead ends in between. 
There is also no silver bullet at all. Just a simple focus on top resource consumers. Of course basic knowledge about the system helps, as well as minimal skills in using a search engine (both online available an don software vendors support page).  
In this specific case, tuning & diagnostic pack was licensed. But all the information used here was available in simple data dictionary and dynamic performance views, if required.