berxblog: Oracle

Posts mit dem Label Oracle werden angezeigt. Alle Posts anzeigen

2022-11-15

spfile parameters derived from cpu_count 11.1 up to 21

In comparison to previous versions sga_target is bigger again, now it's 12g (otherwise instances with higher ocunts wouldn't even start)

#!/usr/bin/ksh  -x

integer i=1
while ((i <= 255));
do
  print " $i ";

  echo "*.audit_file_dest='/_some_dir_/BX1/adump'
*.audit_trail='db'
*.compatible='21.0.0.0.0'
*.control_files='/u02/oradata/CDB1/control01.ctl','/u02/oradata/CDB1/control02.ctl'
*.db_block_size=8192
*.db_create_file_dest='/u02/oradata'
*.db_name='cdb1'
*.db_unique_name='cdb1'
*.diagnostic_dest='/u01/appl/oracle'
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=12g
*.enable_pluggable_database=true
*.undo_tablespace='UNDOTBS1'
*._disable_cpu_check='TRUE'
*.cpu_count=$i" > /_some_dir_/cpus/initBX1_$i.ora

  echo "shutdown abort;
startup pfile='/_some_dir_/cpus/initBX1_$i.ora' ;

insert into init_parameters
select '21.08'     ,
       $i         ,
       a.ksppinm  ,
       b.ksppstvl ,
       c.ksppstvl
from x\$ksppi a, x\$ksppcv b, x\$ksppsv c
where a.indx = b.indx and a.indx = c.indx;

commit;

select distinct(\"CPUs\") from init_parameters;

shutdown;
exit;"> cpus/21doit$i.sql

sqlplus "/ as sysdba" @cpus/21doit$i.sql >cpus/21log$i.log

sleep 5

  (( i = i + 1));
done

To get a good picture of the changes, feel free to select parameters, versions or whatever you want below.
Raw data is available at GistHub.
The js app is available here.
(It's based on an idea PerfSheet.js by Luca Canali.

As some of the values are not numbers, I applied this mapping:

TRUE	1
FALSE	0
OFF	0
ADAPTIVE	-1
INSTANCE	-2

2018-03-31

spfile parameters derived from cpu_count

In Oracle databases some spfile parameters are derived based on others, if they are not set explicit.
One of the main drivers I discovered is CPU_COUNT.
So I did some tests based in the past for version 11.1.0.7 and 12.1.0.2.
To simulate a machine with sufficient CPUs, I had to use a trick described in a previous post. Using _disable_cpu_check threw an ORA-7445:[ksucre] in 12.2 with cpu_count >126.

the script I used this time is basically (of course with 180 instead of 122 for 18.0).

#!/usr/bin/ksh  -x

integer i=1
while ((i <= 255));
do
  print " $i ";

  echo "*.audit_file_dest='/_some_dir_/BX1/adump'
*.audit_trail='db'
*.compatible='12.2.0.1.0'
*.control_files='/_some_dir_/BX1/controlfile/o1_mf_djd017z2_.ctl', '/_some_dir_/fast_recovery_area/BX1/BX1/controlfile/o1_mf_djd01834_.ctl'
*.db_block_size=8192
*.db_create_file_dest='/_some_dir_/oradata'
*.db_name='BX1'
*.db_recovery_file_dest='/_some_dir_/fast_recovery_area/BX1'
*.db_recovery_file_dest_size=7G
*.db_unique_name='BX1'
*.diagnostic_dest='/appl/oracle'
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=7g
*.undo_tablespace='UNDOTBS1'
*.max_string_size='EXTENDED'
#*._disable_cpu_check='TRUE'
*.cpu_count=$i" > /_some_dir_/cpus/initBX1_$i.ora

  echo "shutdown abort;
startup pfile='/_some_dir_/cpus/initBX1_$i.ora' ;

insert into init_parameters
select '12.2'     ,
       $i         ,
       a.ksppinm  ,
       b.ksppstvl ,
       c.ksppstvl
from x\$ksppi a, x\$ksppcv b, x\$ksppsv c
where a.indx = b.indx and a.indx = c.indx;

commit;

select distinct(\"CPUs\") from init_parameters;

shutdown;
exit;"> cpus/122doit$i.sql

sqlplus "/ as sysdba" @cpus/122doit$i.sql >cpus/122log$i.log

sleep 5

  (( i = i + 1));
done

	11.1.0.7	12.1.0.2	12.2.0.1	18.0.0.0
_cursor_db_buffers_pinned	x	x	x	x
_db_block_buffers	x	x	x	x
_db_block_lru_latches	x	x	x	x
__db_cache_size	x	x	x	x
_db_file_exec_read_count			x	x
db_file_multiblock_read_count			x	x
_db_handles		x	x	x
db_writer_processes	x	x	x	x
dml_locks		x	x	x
_enqueue_hash		x	x	x
_enqueue_hash_chain_latches	x	x	x	x
_enqueue_locks	x	x	x	x
_enqueue_resources		x	x	x
_flashback_generation_buffer_size	x	x	x	x
_gc_fast_index_split_wait			x
_gcs_min_cr_slaves			x	x
_kghdsidx_count	x	x
_hang_resolution_scope			x	x
_ksmsq_hintmaxproc				x
__large_pool_size		x	x	x
_lms_rollbacks			x	x
log_buffer	x	x	x	x
_log_parallelism_max	x	x	x	x
_log_simultaneous_copies	x	x	x	x
_max_log_write_parallelism			x	x
_max_outstanding_log_writes		x	x	x
_max_pending_scn_bcasts		x	x	x
_messages		x	x	x
_num_longop_child_latches	x	x	x	x
_num_rlslaves			x	x
parallel_max_servers	x	x	x	x
_parallel_min_message_pool	x	x	x	x
parallel_min_servers		x	x	x
parallel_servers_target		x	x	x
pga_aggregate_limit		x	x	x
_pmon_max_slaves			x	x
_pmon_slaves_arr_size			x	x
processes		x	x	x
_px_proactive_slave_alloc_threshold		x	x	x
_session_allocation_latches		x	x	x
sessions		x	x	x
sga_max_size			x	x
shared_pool_reserved_size		x	x	x
__shared_pool_size	x	x	x	x
_shared_server_num_queues	x	x	x	x
_small_table_threshold	x	x	x	x
_spin_count	x	x	x	x
_sqlmon_max_plan		x	x	x
_super_shared_conversion_threshold			x	x
transactions		x	x	x
_use_single_log_writer		x	x	x

TRUE	1
FALSE	0
OFF	0
ADAPTIVE	-1
INSTANCE	-2

2017-12-12

manage ORACLE_HOMEs and patches in a centralised way

Managing many ORACLE_HOMEs in a complex environment can be tricky, especially when questions about their patches & bugs or comparisons should be done.

It's required to understand the different types of patches available. This can be (according to the documentation & my interpretation). Especially the differences between SYSTEM PATCH (in our case only Bundle Patches), COMPOSITE and SINGLETON patches, and BUGs - of course.

A way to handle all the information is to use the details provided by opatch and the patches, store them in some tables and afterwards query them.

For ORACLE_HOME related information, all the information required can be generated by a simple opatch command:

${ORACLE_HOME}/OPatch/opatch lsinventory -bugs_fixed -xml ${bugs_fixed_file}

This then needs to be uploaded into a table. For this purpose we use a SQLcl inspired by Kris Rices SQLcl examples.

${SQLCL} "${conn_string}" << EOF
script
var fileName="${bugs_fixed_file}";
var theBlob = conn.createBlob();
var outBlob = theBlob.setBinaryStream(1);
var path = java.nio.file.FileSystems.getDefault().getPath(fileName);
theBlob.setBytes(1, java.nio.file.Files.readAllBytes(path));
// ctx.write("length: " + theBlob.length());
var HashMap = Java.type("java.util.HashMap");
bind_map = new HashMap();
bind_map.put("theblob", theBlob);
bind_map.put("name", "${name}");
bind_map.put("version", "${version}");
try {
var retval = util.execute("insert into oh_registry (name, version, bugs_fixed) values (:name, :version, blob2clob(:theblob))", bind_map);
// ctx.write(retval + "\n");
}
catch (e) {
ctx.write("\n\n ERROR:" + e + "\n\n");
}
sqlcl.setStmt("select name, version, dbms_lob.getlength(bugs_fixed) from oh_registry where name='${name}'");
sqlcl.run();
/
commit;
exit
EOF

(blob2clob is just a wrapper for DBMS_LOB.CONVERTTOCLOB)

With this information from several ORACLE_HOMEs, fancy stuff is possible with simple xml queries:

Which patches are different between 2 ORACLE_HOMEs?

select
    sub1.patchid as OH1
    , sub2.patchid as OH2
    , NVL2(sub1.patchid
         , NVL(sub1.patchdescription , 
                 (SELECT md.DESCRIPTION 
                  FROM ORACLE_PATCHES_MANUALDESC md
                  WHERE md.patchnum = sub1.patchid) 
              )
         , NVL(sub2.patchdescription ,
                 (SELECT md.DESCRIPTION 
                  FROM ORACLE_PATCHES_MANUALDESC md
                  WHERE md.patchnum = sub2.patchid)          
               )
          )
      description
from
    (select
        xml1.patchid
        , nvl(xml1.patchdescription, (select description from oracle_patches where patchnum=xml1.patchid)) patchdescription
    from
       oh_registry
       , XMLTable(
                '/InventoryInstance/patches/patch' PASSING xmltype(bugs_fixed) 
                COLUMNS  
                        patchID number path 'patchID'
                        , patchDescription varchar2(64) path 'patchDescription'
                ) xml1
    where 
        name='${home1}') sub1
    full outer join
    (select
        xml2.patchid 
        , nvl(xml2.patchdescription, (select description from oracle_patches where patchnum=xml2.patchid)) patchdescription
    from
       oh_registry
       , XMLTable(
                '/InventoryInstance/patches/patch' PASSING xmltype(bugs_fixed) 
                COLUMNS  
                        patchID number path 'patchID'
                        , patchDescription varchar2(64) path 'patchDescription'
                ) xml2
    where 
        name='${home2}') sub2 on (sub1.patchid = sub2.patchid)
where
    (sub1.patchid is null
    or sub2.patchid is null) or ( ${full_precidate} != 0)
order by
    sub1.patchid
    , sub2.patchid
;

which can show results like

rdbms_12102_Oct17BPJa_beta2 rdbms_12102_Oct17BPJa_beta5 DESCRIPTION
--------------------------- --------------------------- -------------------------------------------------------------------------------------
                   19450139                             KN LNX PERFORMANCE ISSUE WHEN RUNNING GATHER TABLE STATS WITH INCREMENTAL STATS
                                               21385422 12C DB ORA-46264 DURING DBMS_AUDIT_MGMT OPERATION
                                               26988490 MERGE REQUEST ON TOP OF DATABASE BP 12.1.0.2.171017 FOR BUGS 19450139 20807398

So between these 2 ORACLE_HOMEs, 1 patch disappeared and 2 new came in.

A similar query can provide the delta of BUGs:
select

sub1.bug as OH1
    , sub2.bug as OH2
    , case 
        when sub1.bug is null then sub2.description
        else sub1.description
    end description
from
    (select
        xml1.bug 
        , xml1.description 
    from
       oh_registry
       , XMLTable(
                '//bug' PASSING xmltype(bugs_fixed)
                COLUMNS
                        bug number path '@id'
                        , description varchar(64) path 'description'
                ) xml1
    where
        name='${home1}') sub1
    full outer join
    (select
        xml2.bug 
        , xml2.description
    from
       oh_registry
       , XMLTable(
                '//bug' PASSING xmltype(bugs_fixed)
                COLUMNS
                        bug number path '@id'
                        , description varchar(64) path 'description'
                ) xml2
    where
        name='${home2}') sub2 on (sub1.bug = sub2.bug)
where
    ( sub1.bug is null
    or sub2.bug is null ) or ( ${full_precidate} != 0)
order by
    sub1.bug
    , sub2.bug
;

rdbms_12102_Oct17BPJa_beta2 rdbms_12102_Oct17BPJa_beta5 DESCRIPTION
--------------------------- --------------------------- ----------------------------------------------------------------
                                               20807398 ORA-00600 [KGL-HASH-COLLISION] WITH FIX TO BUG 20465582
                                               21385422 12C DB ORA-46264 DURING DBMS_AUDIT_MGMT OPERATION
                                               21529241 DBMS_STATS ORA-06502  PL/SQL  NUMERIC OR VALUE ERROR

This shows quite good the difference between BUGs and Patches.

Also to query all ORACLE_HOMEs which contain a specific BUG is possible:
Inthis case, I'm querying for Bug:22652097

WITH dim_1
AS (SELECT name as OH_NAME, 
                xml1.patchid,
                Nvl(xml1.patchdescription, (SELECT description
                                            FROM   oracle_patches_manualdesc
                                            WHERE  patchnum = xml1.patchid))
                   patchdescription,
                bugs
         FROM   oh_registry,
                XMLTABLE( '/InventoryInstance/patches/patch' passing xmltype(bugs_fixed)
                COLUMNS
                    patchid NUMBER path 'patchID'
                  , bugs xmltype path 'bugs'
                  , patchdescription
                VARCHAR2(64) path 'patchDescription' ) xml1
--         WHERE  name = 'ebs_rdbms_12102_Apr17b_1'
         )
SELECT dim_1.OH_NAME as NAME,
       patchid,
       patchdescription,
       bug,
       bugdescription
FROM   dim_1,
       XMLTABLE( '//bug' passing bugs COLUMNS
           bug NUMBER path '@id'
         , bugdescription VARCHAR(64) path 'description' 
               )
WHERE bug= to_number('${bug}')
ORDER  BY dim_1.OH_NAME,
          patchid,
          bug;

NAME                                PATCHID PATCHDESCRIPTION                                                        BUG BUGDESCRIPTION
-------------------------------- ---------- ---------------------------------------------------------------- ---------- ----------------------------------------------------------------
rdbms_12102_Apr17BPJa              25929584 MERGE REQUEST ON TOP OF DATABASE BP 12.1.0.2.170418 FOR BUGS 186   22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS
rdbms_12102_Apr17BPJb              25929584 MERGE REQUEST ON TOP OF DATABASE BP 12.1.0.2.170418 FOR BUGS 186   22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS
rdbms_12102_Apr17BPJc              25929584 MERGE REQUEST ON TOP OF DATABASE BP 12.1.0.2.170418 FOR BUGS 186   22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS
rdbms_12102_Apr17BPJd              25929584 MERGE REQUEST ON TOP OF DATABASE BP 12.1.0.2.170418 FOR BUGS 186   22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS
rdbms_12102_Aug17BPJa_beta2        27011973 MERGE REQUEST ON TOP OF DATABASE BP 12.1.0.2.170718 FOR BUGS 261   22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS
rdbms_12102_Jan17BPJa              25635590 MERGE REQUEST ON TOP OF DATABASE BP 12.1.0.2.170117 FOR BUGS 189   22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS
rdbms_12102_Oct17BPJa              26717470 Database Bundle Patch : 12.1.0.2.171017 (26717470)                 22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS
rdbms_12102_Oct17BPJa_beta1        26717470 Database Bundle Patch : 12.1.0.2.171017 (26717470)                 22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS
rdbms_12102_Oct17BPJa_beta2        26717470 Database Bundle Patch : 12.1.0.2.171017 (26717470)                 22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS
rdbms_12102_Oct17BPJa_beta3        26717470 Database Bundle Patch : 12.1.0.2.171017 (26717470)                 22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS
rdbms_12102_Oct17BPJa_beta4        26717470 Database Bundle Patch : 12.1.0.2.171017 (26717470)                 22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS
rdbms_12102_Oct17BPJa_beta5        26717470 Database Bundle Patch : 12.1.0.2.171017 (26717470)                 22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS
rdbms_12102_Oct17BPJb_beta1        26717470 Database Bundle Patch : 12.1.0.2.171017 (26717470)                 22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATIS

So a lot of useful queries can be done when information regarding ORACLE_HOMEs is stored in a RDBMS.

Similar things can be done for patches as well. patches contain a lot of useful information. Those are stored in inventory.xml, actions.xml and if they exist in bundle.xml, and any README* files. If these files are loaded into tables similar to the method shown above, they can be used to generate even more insight. Some care must be taken as SYSTEM and COMPOSITE patches contain other patches which must be processed as well (together with their dependencies). We normalized the xml files to several tables during the load for easier/faster queries.

It's possible to search for all patches which will fix a given bug:

BUGNUM DESCRIPTION
---------- -----------------------------------------------------------------------------
  22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATISTICS FEATURES

single patches

  PATCHNUM DESCRIPTION
---------- ------------------------------------------------------------------------------
  26886035 MERGE REQUEST ON TOP OF DATABASE BP 12.1.0.2.170418 FOR BUGS 21156276 22652097
  27011973 MERGE REQUEST ON TOP OF DATABASE BP 12.1.0.2.170718 FOR BUGS 26165305 19450139
  25304579
  26022324
  26166813
  26816088
  25929584 MERGE REQUEST ON TOP OF DATABASE BP 12.1.0.2.170418 FOR BUGS 18650065 18961555
  26165162
  22652097 PROVIDE SEPARATE CONTROLS FOR ADAPTIVE PLANS AND ADAPTIVE STATISTICS FEATURES
  25635590 MERGE REQUEST ON TOP OF DATABASE BP 12.1.0.2.170117 FOR BUGS 18961555 19450139
  26165461
  26757390
  26898071 MERGE REQUEST ON TOP OF DATABASE BP 12.1.0.2.170718 FOR BUGS 21156276 22652097
  26717470 Database Bundle Patch : 12.1.0.2.171017 (26717470)

14 rows selected.

composite patches

  PATCHNUM DESCRIPTION
---------- ----------------------------------------------------------------------------------
  26717470 Database Bundle Patch : 12.1.0.2.171017 (26717470)

bundle patches

  PATCHNUM DESCRIPTION
---------- ----------------------------------------------------------------------------------
  26635880 OracleÂ® Database Patch 26635880 - Database Proactive Bundle Patch 12.1.0.2.171017

Or the difference between 2 merge patches:

26886035_BUGS 25929584_BUGS BUG_DESCRIPTION
------------- ------------- -------------------------------------------------------------------------------
     21156276               ORA-00600   [QESDPSIGERROR], [ADAPTIVE JOIN RESOLUTION FAILED], [394]
                   18961555 STATIC PL/SQL BASELINE REPRODUCTION BROKEN BY FIX 18020394
                   19450139 KN LNX PERFORMANCE ISSUE WHEN RUNNING GATHER TABLE STATS WITH INCREMENTAL STATS
                   20508819 WRONG RESULTS OR ORA-7445 WHEN _OPTIMIZER_AGGR_GROUPBY_ELIM=TRUE
                   20807398 ORA-00600 [KGL-HASH-COLLISION] WITH FIX TO BUG 20465582
                   21529241 DBMS_STATS ORA-06502  PL/SQL  NUMERIC OR VALUE ERROR
                   21826068 WRONG RESULTS STILL WHEN _OPTIMIZER_AGGR_GROUPBY_ELIM=TRUE
                   22324460 ORA-600 QKSANGETTEXTSTR 1 AND CORRUPT PLANS WITH SPM
                   22913528 WRONG RESULTS WITH PARTITION PRUNING AND MIN/MAX SCANS
                   18650065 WRONG RESULTS ON QUERY WITH SUBQUERY USING OR EXISTS
                   21171382 AUTO DOP COMPUTES A HIGH DOP UNNECESSARILY

Using the available information about ORACLE_HOMEs and Patches together it's even possible to check if a patch can be applied to an ORACLE_HOME, or where they conflict.

...
conflicts on:

ACTION          ARGUMENTS
--------------- ----------------------------------------------------------------------------------------------------
copy            %ORACLE_HOME%/rdbms/admin/prvtbpw.plb

1 row selected.

conflict details:

PATCHDESC                                          ACTION          ARGUMENTS
-------------------------------------------------- --------------- --------------------------------------------------
20925154 - ORA-39126  WORKER UNEXPECTED FATAL ERRO copy            %ORACLE_HOME%/rdbms/admin/prvtbpw.plb
25740076 -                                         copy            %ORACLE_HOME%/rdbms/admin/prvtbpw.plb

2 rows selected.

In this case the Patch:25740076 with

<oracle .rdbms.dbscripts="" opt_req="O" version="12.1.0.2.0">
  <copy file_name="rdbms/admin/prvtbpm.plb" name="prvtbpm.plb" path="%ORACLE_HOME%/rdbms/admin">
  <copy file_name="rdbms/admin/prvtkupc.plb" name="prvtkupc.plb" path="%ORACLE_HOME%/rdbms/admin">
  <copy file_name="rdbms/admin/prvtbpw.plb" name="prvtbpw.plb" path="%ORACLE_HOME%/rdbms/admin">

conflicts with the ORACLE_HOME which contains
<patchid>25740076</patchid>
whith

<patchactions>
  <patchaction><archive name="libserver12.a" object_name="lib/libserver12.a/kupv.o" path="%ORACLE_HOME%/lib"></archive></patchaction>
  <patchaction><make change_dir="%ORACLE_HOME%/rdbms/lib" make_file="ins_rdbms.mk" make_target="ioracle"></make></patchaction>
  <patchaction><copy file_name="rdbms/admin/prvtbpm.plb" name="prvtbpm.plb" path="%ORACLE_HOME%/rdbms/admin"></copy></patchaction>
  <patchaction><copy file_name="rdbms/admin/prvtkupc.plb" name="prvtkupc.plb" path="%ORACLE_HOME%/rdbms/admin"></copy></patchaction>
  <patchaction><copy file_name="rdbms/admin/prvtbpw.plb" name="prvtbpw.plb" path="%ORACLE_HOME%/rdbms/admin"></copy></patchaction>
...

My Oracle Support comes to the same conclusion in it's tool to check for conflicts.

2016-12-24

interconnect fragmentation kills the cluster

On a particular Oracle 2 node cluster (12.1) we faced random instances failing. Servicerequests at Oracle were open with limited result, as it was quite random and we could not link it to any trigger.
As it looked somehow like a communication problem between the 2 nodes, network team has checked the switches involved - without any outcome.
Even crashing instances were a problem already, it get worse one day when one node rebooted (according to the clusters alert.log and cssd.log due to network heartbeat issues) and then the clusterstack did not start anymore.

2016-12-12 03:35:34.203 [CLSECHO(54825)]CRS-10001: 12-Dec-16 03:35 AFD-9204: AFD device driver installed or loaded status: 'false' 
2016-12-12 09:17:25.698 [OSYSMOND(1247)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 1247 
2016-12-12 09:17:25.699 [CSSDAGENT(1248)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 1248 
2016-12-12 09:17:25.854 [OCSSD(1264)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 1264 
2016-12-12 09:17:26.899 [OCSSD(1264)]CRS-1713: CSSD daemon is started in hub mode 
2016-12-12 09:17:32.220 [OCSSD(1264)]CRS-1707: Lease acquisition for node yyy2 number 2 completed 
2016-12-12 09:17:33.280 [OCSSD(1264)]CRS-1605: CSSD voting file is online: ORCL:ASM_OCR_VOTE_1; details in /xxx/app/grid/diag/crs/yyy1/crs/trace/ocssd.trc. 
2016-12-12 09:17:33.289 [OCSSD(1264)]CRS-1672: The number of voting files currently available 1 has fallen to the minimum number of voting files required 1. 
2016-12-12 09:27:25.925 [CSSDAGENT(1248)]CRS-5818: Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:22951} in /xxx/app/grid/diag/crs/yyy2/crs/trace/ohasd_cssdagent_root.trc. 
2016-12-12 09:27:25.925 [OCSSD(1264)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /xxx/app/grid/diag/crs/yyy2/crs/trace/ocssd.trc 
2016-12-12 09:27:25.926 [OCSSD(1264)]CRS-1603: CSSD on node yyy2 shutdown by user. 
Mon Dec 12 09:27:30 2016 
Errors in file /xxx/app/grid/diag/crs/yyy2/crs/trace/ocssd.trc (incident=857): 
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] [] 
Incident details in: /xxx/app/grid/diag/crs/yyy/crs/incident/incdir_857/ocssd_i857.trc

CSS trace is filled with messages reporting no connectivity with node1:

2016-12-12 09:27:20.375584 : CSSD:3154114304: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 
2016-12-12 09:27:20.585624 :GIPCHALO:3141216000: gipchaLowerSendEstablish: sending establish message for node '0x7f7f900a37e0 { host 'yyy1', haName '480e-0dfa-bf94-bbda', srcLuid c33a92f9-675f2c44, dstLuid 00000000-00000000 numInf 1, sentRegister 0, localMonitor 0, baseStream 0x7f7f9009b110 type gipchaNodeType12001 (20), nodeIncarnation 9ec9e8e8-682809fa incarnation 2 flags 0x102804}' 
2016-12-12 09:27:20.633907 : CSSD:3635484416: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 
2016-12-12 09:27:20.633912 : CSSD:3635484416: clsssc_CLSFAInit_CB: clsfa fencing not ready yet 
2016-12-12 09:27:20.656587 : CSSD:3124418304: clssnmvDHBValidateNCopy: node 1, yyy1, has a disk HB, but no network HB, DHB has rcfg 371663236, wrtcnt, 11120596, LATS 232008644, lastSeqNo 11120595, uniqueness 1476197219, timestamp 1481534839/2789302712 
2016-12-12 09:27:20.868210 : CSSD:3119687424: clssnmSendingThread: Connection pending for node yyy1, number 1, flags 0x00000002 
2016-12-12 09:27:21.375702 : CSSD:3154114304: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 
2016-12-12 09:27:21.585813 :GIPCHALO:3141216000: gipchaLowerSendEstablish: sending establish message for node '0x7f7f900a37e0 { host 'yyy1', haName '480e-0dfa-bf94-bbda', srcLuid c33a92f9-675f2c44, dstLuid 00000000-00000000 numInf 1, sentRegister 0, localMonitor 0, baseStream 0x7f7f9009b110 type gipchaNodeType12001 (20), nodeIncarnation 9ec9e8e8-682809fa incarnation 2 flags 0x102804}' 
2016-12-12 09:27:21.634038 : CSSD:3635484416: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 
2016-12-12 09:27:21.634046 : CSSD:3635484416: clsssc_CLSFAInit_CB: clsfa fencing not ready yet 
2016-12-12 09:27:21.657538 : CSSD:3124418304: clssnmvDHBValidateNCopy: node 1, yyy1, has a disk HB, but no network HB, DHB has rcfg 371663236, wrtcnt, 11120597, LATS 232009644, lastSeqNo 11120596, uniqueness 1476197219, timestamp 1481534840/2789303712 
2016-12-12 09:27:21.868336 : CSSD:3119687424: clssnmSendingThread: Connection pending for node yyy1, number 1, flags 0x00000002 
2016-12-12 09:27:22.375830 : CSSD:3154114304: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 
2016-12-12 09:27:22.586063 :GIPCHALO:3141216000: gipchaLowerSendEstablish: sending establish message for node '0x7f7f900a37e0 { host 'yyy1', haName '480e-0dfa-bf94-bbda', srcLuid c33a92f9-675f2c44, dstLuid 00000000-00000000 numInf 1, sentRegister 0, localMonitor 0, baseStream 0x7f7f9009b110 type gipchaNodeType12001 (20), nodeIncarnation 9ec9e8e8-682809fa incarnation 2 flags 0x102804}' 
2016-12-12 09:27:22.634195 : CSSD:3635484416: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 
2016-12-12 09:27:22.634203 : CSSD:3635484416: clsssc_CLSFAInit_CB: clsfa fencing not ready yet

After even more investigation on the Network another SR was filed.

Due to previous SRs oswatcher was installed already, and there we found the important information in the netstats segment:

zzz ***Fri Dec 9 14:54:54 GMT 2016 
Ip: 
13943376329 total packets received 
129843 with invalid addresses 
0 forwarded 
0 incoming packets discarded 
11934989273 incoming packets delivered 
11631767391 requests sent out 
2 outgoing packets dropped 
148375 fragments dropped after timeout 
2498052793 reassemblies required 
494739589 packets reassembled ok 
353229 packet reassembles failed 
411073325 fragments received ok 
2109526776 fragments created

and after 2 minutes:

zzz ***Fri Dec 9 14:56:55 GMT 2016 
Ip: 
13943469180 total packets received 
129849 with invalid addresses 
0 forwarded 
0 incoming packets discarded 
11935067348 incoming packets delivered 
11631828206 requests sent out 
2 outgoing packets dropped 
148375 fragments dropped after timeout 
2498069258 reassemblies required 
494741345 packets reassembled ok 
359542 packet reassembles failed 
411073565 fragments received ok 
2109528513 fragments created

The important part are the 6313 packet reassembles failed. In comparison to 16465 reassemblies required.

This led to some notes which describe both our symptoms (instance and cluster stack failure)

RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassembles failure (Doc ID 2008933.1)

and

The CRSD is Intermediate State and Not Joining to the Cluster (Doc ID 2168576.1)

Reassembly happens when the sender wants so send more data than fits into a single packet. In this cluster the MTU size is 1500 - and in our example we had 16465 datagrams which needed to be reassembled, but 6131 failed. There are some variables in the Linux kernel, they can affect the buffer used in kernel to reassembly fragmented datagrams.

The solution for our system was to increase 2 parameters:

net.ipv4.ipfrag_high_thresh = 16777216
net.ipv4.ipfrag_low_thresh = 15728640

These can be changed in the running system in

/proc/sys/net/ipv4/ipfrag_low_thresh
/proc/sys/net/ipv4/ipfrag_high_thresh

and for persistent changes in sysctl.conf

Unfortunately these parameters were not mentioned in any of the prerequisit scripts I found.

With all these knowledge, we identified an important difference to other clusters: This one is the only with MTU 1500 - so much more fragmented packages needed carehere.

After the issue itself was solved, I wondered if it can be found on a vanilla 12.1 crs installation.
(vanilla in comparison to our setup where oswatcher was installed due to the first SRs).
Yes, our beloved -MGMTDB holds the information already! It's in the documentation as well (Troubleshooting Oracle Clusterware) and in the output of oclumon dumpnodeview I can see
IPReasFail - Number of failures detected by the IPv4 reassembly algorithm

Node: yyy1 Clock: '16-11-26 06.55.27 Etc/GMT' SerialNo:443 
NICS: 
bond0 netrr: 159.021 netwr: 181.688 neteff: 340.709 nicerrors: 0 pktsin: 412 pktsout: 358 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 412 innonunicast: 0 type: PUBLIC 
lo netrr: 37.722 netwr: 37.722 neteff: 75.443 nicerrors: 0 pktsin: 95 pktsout: 95 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 95 innonunicast: 0 type: PUBLIC 
bond1 netrr: 2350.313 netwr: 42989.510 neteff: 45339.823 nicerrors: 0 pktsin: 1927 pktsout: 31345 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 1927 innonunicast: 0 type: PRIVATE 
PROTOCOL ERRORS: 
IPHdrErr: 0 IPAddrErr: 102203 IPUnkProto: 0 IPReasFail: 59886 IPFragFail: 0 TCPFailedConn: 12598 TCPEstRst: 335559 TCPRetraSeg: 67276584 UDPUnkPort: 40134 UDPRcvErr: 0

Unfortunately the format is kind of clumsy - I will need to dig into it's tables for a better output - especially for quick but powerful reports during problems.

During my research, I discovered it's not an oracle-only problem, others are affected as well (and provide a great description).

2016-01-15

Instance parameters derived from cpu_count - 12.1.0.2

About 6 years ago I wanted to know which instance parameters are derived from cpu_count. So it tested a 11.1.0.7 DB - in that version there 21 parameters changed based on the value of cpu_count.
Some DB versions passed by so I decided it's time for another check. This time it's 12.1.0.2 without any PSUs/patches. The machine is the same class as previous, so it took some time.

the script is nearly the same today:

!/usr/bin/ksh  -x

integer i=1
while ((i <= 128));
do
  print " $i ";

  echo "*.audit_file_dest='/appl/oracle/admin/BERX3_GLOBAL/adump'
*.audit_trail='db'
*.compatible='12.1.0.2.0'
*.control_files='/appl/oracle/oradata/BERX3_GLOBAL/controlfile/o1_mf_c9h3bxlw_.ctl','/appl/oracle/fast_recovery_area/BERX3_GLOBAL/controlfile/o1_mf_c9h3bxpg_.ctl'
*.db_block_size=8192
*.db_create_file_dest='/appl/oracle/oradata'
*.db_name='BERX3_GL'
*.db_recovery_file_dest='/appl/oracle/fast_recovery_area'
*.db_recovery_file_dest_size=4815m
*.db_unique_name='BERX3_GLOBAL'
*.diagnostic_dest='/appl/oracle'
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=13g
*.undo_tablespace='UNDOTBS1'
*.cpu_count=$i" > /appl/home/oracle/instance_parameters/cpus/initBERX2_$i.ora

  echo "startup pfile='/appl/home/oracle/instance_parameters/cpus/initBERX2_$i.ora' ;

insert into init_parameters
select $i         ,
       a.ksppinm  ,
       b.ksppstvl ,
       c.ksppstvl
from x\$ksppi a, x\$ksppcv b, x\$ksppsv c
where a.indx = b.indx and a.indx = c.indx;

commit;

select distinct(\"CPUs\") from init_parameters;

shutdown;

exit;"> cpus/doit$i.sql

sqlplus "/ as sysdba" @cpus/doit$i.sql >cpus/log$i.log

sleep 5

  (( i = i + 1));
done

And all the values interesting is available again.
There are 40 parameters now changing value based on cpu_count - so a lot more of things which will change, when you "only" add/remove CPUs or alter instance caging.

Here are the graphs for 12.1.0.2 - and 11.1.0.7 if they exist already. Some look quite different.

__db_cache_size