Freitag, 7. Juli 2017

The Slow Tires

Once there was a man with a car. On this car he had 4 tires.
As the car is a modern one it has a nice board computer which collects many measurements. One is the rotation per minute of the tires.
One day it took more time for the man to get home than usually. So he decided to check his car's computer for any values which could lead to that delay.
After some minutes searching he made an alarming observation: This days average rotations per minute of the tires were much lower than the values of the day before. So he did a quick check if he found some errors in the computers log, but as everything seemed ok he was puzzled.
The shop where he bought the tires was still open and so he went there and ask a technician to check if the tires were ok, as they were slower today than yesterday. A technician checked correct pressure and even made sure the tire balancing was ok.
But there still was the lower rotation per minute for this day, and no explanation.
Next to the tire retailer there was a garage, which owner was friends with the technician of the tire shop. He agreed to check the car, if there was anything he could find about this days slow tires. But there wasn't anything. No errors, motor control all ok, the shock absorbers were even less utilised this day than the previous day.
During the investigation our man tried to remember if anything has been different to the previous day. He remembered some things: the day before he had lunch, which he skipped on this day to get home early. And even the coke in the cup holder was empty. So in theory the car should even be lighter than the day before. He told the two engineers about his memories, but they didn't care about some decagram changes in weight. At least the mechanic confirmed the average rotation was slower on this day. It matched the slower rotation in the gearbox, something which was not available for a customer in the cars memory. So the measurement was correct.
All three men were puzzled. But the mechanic had a friend in highway administration who had access to the highways measurements. Maybe there wa a technical issue there?
But the highway was in good shape, quite similar as the day before. Nothing was broken, all signs working ok. No accident reported. Asphalt temperature was slightly higher than the day before, but perfect within the tires comfort zone. The mechanic got one more information from his friend: Highway administration was more pleased about this day, as the utilisation was higher when the man went home on this day compared to his yesterdays commute, but still within acceptable boundaries to avoid heavy complains.
Now the men had checked virtually everything which came to their mind: tires, many components of the car, highway. Everything fine, no errors or accidents, totally within acceptable boundaries.
Big data, big confusion.
The tire retailer grabbed his smartphone in frustration to check new gossip in the app of his preferred local radio station. By accident he hit the traffic tab with his calloused fingers and saw a nice histogram of the traffic in highway segments around the city. At the travel time of our man on this day there were much more cars in his highway segment than at the day before. He showed this nice graph to the others and slowly it all made sense:
They all know from own experience, more cars on the highway will lead to slower average speed at some filling level. Traffic is still smooth but the individual just can't do high speed anymore, regardless of legal limitations or police.
Of course at this point the utilization increases, as from the highway administration perspective, there are more tires on the road now. Even when each one rotates slightly slower, the total sum of all of them is higher than a single car at highest speed ever could create.
Our man just could not drive as fast as yesterday as many other cars were in his way.
And the tires could not rotate faster for exactly that reason.
With all this insight our mechanic once again called his friend and asked him if he know a way to improve the situation. The answer was simple:
Build a dedicated highway for you. Never share it with anyone. Not even with your best friend. IT will go perfect most of the time. But the one day you want to do the surprise birthday party, exactly this friend will be on the highway ahead of you and block all your best intended efforts.

I'm sorry for all who love the english language - if you want to translate this to proper english, please do so and inform me, I'll remove this content and link to yours!


Sonntag, 25. Juni 2017

Hints, up and down

Last week was AOUG conference 2017. There I attended Lothar Flatz' Any Hint, anywhere. There Lothar mentioned it's not required to have hints in the first / topmost SELECT. Even the sentence itself was clear for me, the consequences and possibilities were not at that time.
During the presentation and some discussion with Lothar afterwards, some possibilities were shown.



Preparation:


drop table T1;
create table T1 as 
select rownum as rn, 'A' as const from dual connect by rownum < 1000000;

create unique index T1I1 on T1 (rn);
create index T1I2 on T1 (const);

update T1 set const='B' where rn=42;
create view V1 as select * from T1;


SQLs with direct hints:


select /*+ qb_name(test1) */ const from V1
where rn = 42;
with plan

Plan hash value: 4265153043
 
-----------------------------------------------------
| Id  | Operation                   | Name | E-Rows |
-----------------------------------------------------
|   0 | SELECT STATEMENT            |      |        |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1   |      1 |
|*  2 |   INDEX UNIQUE SCAN         | T1I1 |      1 |
-----------------------------------------------------

and with another query
select /*+ qb_name(test2) */ avg(rn) average
from V1 
where const='A';
Plan hash value: 3724264953
 
--------------------------------------------
| Id  | Operation          | Name | E-Rows |
--------------------------------------------
|   0 | SELECT STATEMENT   |      |        |
|   1 |  SORT AGGREGATE    |      |      1 |
|*  2 |   TABLE ACCESS FULL| T1   |    999K|
--------------------------------------------

the hints work like a charm:
select /*+ qb_name(test1) FULL(@"SEL$63695B56" "T1") */ const from V1
where rn = 42;

Plan hash value: 3617692013
 
-------------------------------------------
| Id  | Operation         | Name | E-Rows |
-------------------------------------------
|   0 | SELECT STATEMENT  |      |        |
|*  1 |  TABLE ACCESS FULL| T1   |      1 |
-------------------------------------------

and
select /*+ qb_name(test2) INDEX(@"SEL$0E4D0AA2" "T1"@"SEL$1" ("T1"."CONST")) */ avg(rn) average
from V1 
where const='A';
Plan hash value: 2287954029
 
--------------------------------------------------------------
| Id  | Operation                            | Name | E-Rows |
--------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |        |
|   1 |  SORT AGGREGATE                      |      |      1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T1   |    999K|
|*  3 |    INDEX RANGE SCAN                  | T1I2 |    999K|
--------------------------------------------------------------


Hints hidden in the view:


When I "hide" the hints inside the view definition:
create or replace view V1 as 
select /*+ FULL(@"SEL$63695B56" "T1") INDEX(@"SEL$0E4D0AA2" "T1"@"SEL$1" ("T1"."CONST")) */ * from T1;

they still work:

select /*+ qb_name(test1) */ const from V1
where rn = 42;

Plan hash value: 3617692013
 
-------------------------------------------
| Id  | Operation         | Name | E-Rows |
-------------------------------------------
|   0 | SELECT STATEMENT  |      |        |
|*  1 |  TABLE ACCESS FULL| T1   |      1 |
-------------------------------------------

and

select /*+ qb_name(test2) */ avg(rn) average
from V1 
where const='A';

Plan hash value: 2287954029
 
--------------------------------------------------------------
| Id  | Operation                            | Name | E-Rows |
--------------------------------------------------------------
|   0 | SELECT STATEMENT                     |      |        |
|   1 |  SORT AGGREGATE                      |      |      1 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T1   |    999K|
|*  3 |    INDEX RANGE SCAN                  | T1I2 |    999K|
--------------------------------------------------------------


When you can edit only parts of the query:

Sometimes an application allows only changes on parts of the query, like additional filter in the WHERE clause:
(let's reset the view:)
create or replace view V1 as select * from T1;

As the hint can only be in the SELECT part, we need to create an "artificial" select:
select /*+ qb_name(test1) */ const from V1
where rn = 42
  and 1 = (select /*+ FULL(@"SEL$63695B56" "T1") */1 from dual);

so it works as well:
Plan hash value: 317588836
 
--------------------------------------------
| Id  | Operation          | Name | E-Rows |
--------------------------------------------
|   0 | SELECT STATEMENT   |      |        |
|*  1 |  FILTER            |      |        |
|*  2 |   TABLE ACCESS FULL| T1   |      1 |
|   3 |   FAST DUAL        |      |      1 |
--------------------------------------------


Conclusion:

There are many ways to place a hint, so as long as you can manipulate any part of the query, ther eis a chance to inject a hint.

Samstag, 24. Juni 2017

Method R Profiler 7.0 available

I'm seldom doing product announcements on my blog. So there must be a reason why I do not follow this rule.

Method-R recently released a new version of it's powerful SQL-Trace profiler.
Of course there are many improvements on the profiler engine itself, but as the prodict was very stable already they will only help in rare edge cases, most people would not benefit dramatically.

But there is a new featuere which brings a lot of added value to the tool and improves time to repair/fix/enlightenment for many readers of the trace file:
They added an explanatory text to many sections.
Here an example (from their sample file - slightly edited to better fit the size of this blog):

Your experience duration was dominated by “cell single block physical read” and “CPU: FETCH dbcalls” calls. These are all high-productivity calls; their dominance means that the Oracle Database is working hard for your application. Now you should determine whether all of that work is actually necessary. Consult the Profile by Cursor and Profile by Statement sections for the next step of your diagnosis.
SubroutineDurationCallsDuration per call (seconds)
secondsRmeanminskewmax
1cell single block physical read4.64469.9%6,8990.00070.0005▁▁▁█▁▁▁▁▁▁▁0.028
2CPU: FETCH dbcalls2.38135.8%7,9470.00030.0000▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁█▁▁2.071

Unfortunately I did not manage to copy all the nice formatting into this blog, so i created a screenshot to show the original picture.

As you can see there is a new text with more in detail explanation about the Subroutine and where to go next for deeper analysis. I see this as a big advantage for everyone who does not work with SQL-Trace files every day. (well, even for those it's helpful).

There is a very nice demo version available, so if you have a trace file to analyze and want to check, if you see added value by this suggestion, give it a try.

Beside the new Profiler, Method-R has redone their web page as well (and I expect more news to follow). It's worth to invest some minutes and have a glimpse.

Dienstag, 11. April 2017

search for the missing ORA-04068

Oracle Active Dataguard is a great software for offloading read only activities from the primary nodes to nodes which is not utilized anyhow (at least during non-disaster times).
A very good example for doing so is regular extraction of data for processing in a data warehouse. That's also the case which led to this post.

The setting is just slightly more complex: both, primary and standby site are RACs - of course when you plan to switchover/failover, they should be somehow similar. (And in case of a disaster, it's planned to failover an disable all not-so-important activities; warehouse extract would be one of those).

Also the offloaded activities are slightly more complex than a simple query. In this case PL/SQL code is included. According to the documentation, that's not an issue at all - it's still read only. But in this DB (11.2.0.4), it was an issue: Sometimes the result was not as expected.

Now I first need to define the expectation: in this special case, the ERROR is the expectation, and no error the wrong result. Whenever a package (with global variables) is changed in a session, all other sessions which initialized the package before the change, but called it afterwards, must get

04068, 00000, "existing state of packages%s%s%s has been discarded"
// *Cause:  One of errors 4060 - 4067 when attempt to execute a stored
//          procedure.
// *Action: Try again after proper re-initialization of any application's
//          state.

Then the application can catch this error, re-initiate the package and continue.
that's how it should be.

But we sometimes had strange results in the test environment. After some investigation, we found it and I simplified it to this Testcase:

Testcase: 
(results are only shown when of any interest) 
INST1_ ... Instance1
INST2_ ... Instance2
PRI ... Primary DB 
ADG ... Active DataGuard 
it's important the sessions are not closed within the test!

prepare user:
SYS@INST1_PRI:
=============
-- create the user 
create user SR_TEST1 identified by "xxx" default tablespace users temporary tablespace temp; 
grant connect, create session, create procedure, create table to SR_TEST1; 
alter user SR_TEST1 quota unlimited on users; 

SESSION1@INST1_PRI:
==================
-- connect & create objects 
connect SR_TEST1/"xxx" 
set serveroutput on 

create table tab1 (id number, ver varchar2(30)); 
insert into tab1 values (1, '_'); 
commit; 

create or replace PACKAGE PACK AS 

first_load date := to_date('3333-01-01 01:01:01', 'YYYY-MM-DD HH24:MI:SS'); 

PROCEDURE proc; 

END PACK; 
/ 

CREATE OR REPLACE 
PACKAGE BODY PACK AS 

PROCEDURE proc AS 
BEGIN 
if first_load > sysdate then 
first_load := sysdate; 
end if; 
DBMS_OUTPUT.PUT_LINE('1: loaded first on ' || to_char(first_load,'YYYY-MM-DD HH24:MI:SS') ); 
NULL; 
END proc; 

END PACK; 
/ 
update tab1 set ver='1' where id=1; 
commit; 

create or replace procedure proc1 
is 
begin 
SR_TEST1.pack.proc; 
end; 
/ 

-- test & initiate 
set serveroutput on 
exec SR_TEST1.proc1 
select * from tab1; 

> 1: loaded first on 2017-03-15 09:52:00 
> 
> PL/SQL procedure successfully completed. 
> 
select * from tab1; 
> 
> ID VER 
>---------- ------------------------------ 
> 1 1 

SESSION2@INST2_PRI:
==================
-- connect & create objects 
connect SR_TEST1/"xxx" 
set serveroutput on 

-- test & initiate 
set serveroutput on 
exec SR_TEST1.proc1 
select * from tab1; 

> 1: loaded first on 2017-03-15 09:53:00 
> 
> PL/SQL procedure successfully completed. 
> 
select * from tab1; 
> 
> ID VER 
>---------- ------------------------------ 
> 1 1 

SESSION3@INST1_ADG:
==================
-- connect & create objects 
connect SR_TEST1/"xxx" 
set serveroutput on 
-- here the redo apply is running
select sid from v$session where program like '%MRP0%';

>       SID
>----------
>         2


-- test & initiate 
set serveroutput on 
exec SR_TEST1.proc1 

> 1: loaded first on 2017-03-15 09:53:21 
> 
> PL/SQL procedure successfully completed. 
> 
> select * from tab1; 
> 
> ID VER 
> ---------- ------------------------------ 
> 1 1 

SESSION4@INST1_ADG:
==================
-- connect & create objects 
connect SR_TEST1/"xxx" 
set serveroutput on 
-- NO redo apply is running hereselect inst_id, sid from gv$session where program like '%MRP0%';

> no rows selected

-- test & initiate 
set serveroutput on 
exec SR_TEST1.proc1 

> 1: loaded first on 2017-03-15 09:54:00 
> 
> PL/SQL procedure successfully completed. 
> 
> select * from tab1; 
> 
> ID VER 
> ---------- ------------------------------ 
> 1 1 

SESSION1@INST1_PRI:
==================
-- change package body 

CREATE OR REPLACE 
PACKAGE BODY PACK AS 

PROCEDURE proc AS 
BEGIN 
if first_load > sysdate then 
first_load := sysdate; 
end if; 
DBMS_OUTPUT.PUT_LINE('2: loaded first on ' || to_char(first_load,'YYYY-MM-DD HH24:MI:SS') ); 
NULL; 
END proc; 

END PACK; 
/ 
update tab1 set ver='2' where id=1; 
commit; 


-- test 
set serveroutput on 
exec SR_TEST1.proc1 
select * from tab1; 

> 2: loaded first on 2017-03-15 09:55:51 
> 
> PL/SQL procedure successfully completed. 
> 
> 
> ID VER 
> ---------- ------------------------------ 
> 1 2 

SESSION2@INST2_PRI:
==================
-- test 
set serveroutput on 
exec SR_TEST1.proc1 
SR_TEST1@EBSSID051 > BEGIN SR_TEST1.proc1; END; 

* 
ERROR at line 1: 
ORA-04068: existing state of packages has been discarded 
ORA-04061: existing state of package body "SR_TEST1.PACK" has been invalidated 
ORA-04065: not executed, altered or dropped package body "SR_TEST1.PACK" 
ORA-06508: PL/SQL: could not find program unit being called: "SR_TEST1.PACK" 
ORA-06512: at "SR_TEST1.PROC1", line 4 
ORA-06512: at line 1 


> select * from tab1; 

ID VER 
---------- ------------------------------ 
1 2 

-- THIS Is the expected result 

SESSION3@INST1_ADG:
==================
-- test & initiate 
set serveroutput on 
exec SR_TEST1.proc1 

* 
ERROR at line 1: 
ORA-04068: existing state of packages has been discarded 
ORA-04061: existing state of package body "SR_TEST1.PACK" has been invalidated 
ORA-04065: not executed, altered or dropped package body "SR_TEST1.PACK" 
ORA-06508: PL/SQL: could not find program unit being called: "SR_TEST1.PACK" 
ORA-06512: at "SR_TEST1.PROC1", line 4 
ORA-06512: at line 1 

select * from tab1; 

ID VER 
---------- ------------------------------ 
1 2 

SESSION4@INST1_ADG:
==================
-- test & initiate 
set serveroutput on 
exec SR_TEST1.proc1 

1: loaded first on 2017-03-15 09:54:00 

PL/SQL procedure successfully completed. 

select * from tab1; 

ID VER 
---------- ------------------------------ 
1 2 
-- HERE you see the update on the table is applied on ADG already, but the procedure output is still prefixed with 1: - not 2: as it should be.

SESSION4@INST2_ADG:
==================
-- to check: the package is there! 
set pages 99 
select text from all_source where name = 'PACK' 
and type = 'PACKAGE BODY' 
3 order by line; 

TEXT 
------------------------
PACKAGE BODY PACK AS 

PROCEDURE proc AS 
BEGIN 
if first_load > sysdate then 
first_load := sysdate; 
end if; 
DBMS_OUTPUT.PUT_LINE('2: loaded first on ' || to_char(first_load,'YYYY-MM-DD HH24:MI:SS') ); 

NULL; 
END proc; 

END PACK; 

12 rows selected. 

Even the testcase is clear and simple, it was not that easy to identify the root cause in a more complex life environment. Special thanks to Andy Sayer who helped me to sort and refine my ideas over twitter.

With this testcase I was able to open a SR at Oracle, and after some some Support-ping-pong I got a useful information:
It is known as Bug: 18357555: ADD SUPPORT FOR KGLIV_BRNULPALL TO KGLCLUSTERMESSAGE() / KQLMBIVG()

I'm only aware of one Patch:25808607 (for PSU:11.2.0.4.160419 - together with some other patches) which is available right now. But you can ask for your own, if you have an Active Dataguard and RAC, and Packages with global variables.

Mittwoch, 5. April 2017

Can I flush the shared pool of all RAC Instances in an Active DataGuard?

That's a good question.

And the answer is simple: Yes, but NO!

Well, ok; that's not very helpful. But whenever is anything simple in modern IT?

So I'll show why Yes, it's possible:

first as SYS create a dummy user without special permissions and a function to drop the shared pool:

create user berx identified by xxx;
grant connect to berx;
grant select on v_$instance to berx;
    
create or replace function sys.flush_shared_pool_bx return varchar2
as
begin
        execute immediate 'alter system flush shared_pool';
  return 'Flush_Shared_Pool executed on ' || sysdate || ' by ' || sys_context('USERENV', 'SESSION_USER');

end;
/
grant execute on sys.flush_shared_pool_bx  to public;

SELECT * FROM TABLE(gv$(CURSOR(
  SELECT INSTANCE_NUMBER, sys.flush_shared_pool_bx flush_result
  FROM v$instance)));  

Unfortunately I can not use the gv$ function as user berx directly.

So a view does the trick:
create view v_flush_shared_pool_bx
as 
SELECT * FROM TABLE(gv$(CURSOR(
  SELECT INSTANCE_NUMBER, sys.flush_shared_pool_bx flush_result
  FROM v$instance))); 

grant select on v_flush_shared_pool_bx to public;

create or replace
function f_gv_flush
return DBMS_DEBUG_VC2COLL
PIPELINED 
IS
  res varchar2(120);
   cursor c1 is
     SELECT INSTANCE_NUMBER || ' - ' || FLUSH_RESULT as in_result
 FROM TABLE(gv$(CURSOR(
  SELECT INSTANCE_NUMBER, sys.flush_shared_pool_bx flush_result
  FROM v$instance)));
begin
  FOR res_row in c1
  LOOP
      pipe row (to_char(res_row.in_result)) ;
  END LOOP;
  return;
end;
/

and now it's so simple to flush on all instances:

connect berx/

select * from sys.v_flush_shared_pool_bx;
  
INSTANCE_NUMBER FLUSH_RESULT
--------------- --------------------------------------------------
              1 Flush_Shared_Pool executed on 10-APR-17 by BERX
              3 Flush_Shared_Pool executed on 10-APR-17 by BERX
              2 Flush_Shared_Pool executed on 10-APR-17 by BERX


or

select * from table(sys.f_gv_flush);

COLUMN_VALUE
-----------------------------------------------------
1 - Flush_Shared_Pool executed on 10-APR-17 by BERX
2 - Flush_Shared_Pool executed on 10-APR-17 by BERX
3 - Flush_Shared_Pool executed on 10-APR-17 by BERX

So obviously the answer if I an flush shared pool in all RAC instances - even in Active DataGuard is: Yes.

But at the same time the answer is NO, as I'm using the gv$ function - and this is not documented, so it's not wise to use it.

If it was not about an Active DataGuard, I'd use a scheduler job to run the flush shared_pool on all instances, but as the ADG is read only, I can't use it.

If you have another idea, how to do this (only with oracle means, no cronjob or similar), please let me know!

Mittwoch, 8. März 2017

force connections to use SERVICE_NAME

During the setup of a project I had a small observation: The Oracle Database is running on a 4 node RAC, but we (DBAs) wanted to provide only 2 instances to the application team.

So we created a dedicated service with Preferred instances: inst1,inst2 and Available instances: inst3.inst4

But after some minutes I saw sessions running on all 4 instances. Those sessions used the Service SYS$USERS - so they managed to "guess" the SID and created a connetionstring with SID, not SERVICE_NAME.

We contacted them and asked to use the connection string we provided (with SERVICE_NAME) which was no problem at all.

But the question remains - how to avoid connections with SID?

One idea was to create a after logon on database trigger and there check for special allowed users, hosts, or whatever. This seems fine, but there are situations, where SYS$USERS are used by default even from users, which should not use it for connections. One example are scheduler jobs. this can be solved by changing the SERVICE of DEFAULT_JOB_CLASS to the desired service:
DBMS_SCHEDULER.SET_ATTRIBUTE ('DEFAULT_JOB_CLASS','SERVICE','PREFERRED_SERVICE');

This requires additional actions and care - and a good documentation to avoid future issues.

As my initial question was to stop connections using SID through the listener, not interfere with generic bequeath connections. So I focused on the listener:

Even the SID is registered automatically when an instance registers to the listener, it can be specified manually as well.
So I tried to abuse this feature a little bit.

My config:
SID_LIST_LISTENER=
  (SID_LIST=
    (SID_DESC=
      (ORACLE_HOME=/tmp/berx1)
      (SID_NAME=ORCLSID))
)

but the listener searches for an oracle binary in this fake ORACLE_HOME
TNS-01201: Listener cannot find executable /tmp/berx1/bin/oracle for SID EDWP123
so I created a binary there:
ln -s /bin/false  /tmp/berx1/bin/oracle
and now the listener started fine.

A connection attempt using the SID fails now with
ERROR:
ORA-12537: TNS:connection closed

and the listener.log shows the errors
08-MAR-2017 11:14:20 * (CONNECT_DATA=(SID=ORCLSID)(CID=(PROGRAM=sqlplus)(HOST=myhost)(USER=oracle))) * (ADDRESS=(PROTOCOL=tcp)(HOST=1.2.3.4)(PORT=50714)) * establish * ORCLSID * 12518
TNS-12518: TNS:listener could not hand off client connection
 TNS-12547: TNS:lost contact
  TNS-12560: TNS:protocol adapter error
   TNS-00517: Lost contact
    Linux Error: 32: Broken pipe

That's more or less a solution to my question.
Of course I do not recommend to implement it, at least not without proper testing.
Especially in a DataGuard setup a SID connection is required for some activities, so there  a dedicated listener for DataGuard activities is required in such a configuration.

Samstag, 24. Dezember 2016

interconnect fragmentation kills the cluster

On a particular Oracle 2 node cluster (12.1) we faced random instances failing. Servicerequests at Oracle were open with limited result, as it was quite random and we could not link it to any trigger.
As it looked somehow like a communication problem between the 2 nodes, network team has checked the switches involved - without any outcome.
Even crashing instances were a problem already, it get worse one day when one node rebooted (according to the clusters alert.log and cssd.log due to network heartbeat issues) and then the clusterstack did not start anymore.

2016-12-12 03:35:34.203 [CLSECHO(54825)]CRS-10001: 12-Dec-16 03:35 AFD-9204: AFD device driver installed or loaded status: 'false' 
2016-12-12 09:17:25.698 [OSYSMOND(1247)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 1247 
2016-12-12 09:17:25.699 [CSSDAGENT(1248)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 1248 
2016-12-12 09:17:25.854 [OCSSD(1264)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 1264 
2016-12-12 09:17:26.899 [OCSSD(1264)]CRS-1713: CSSD daemon is started in hub mode 
2016-12-12 09:17:32.220 [OCSSD(1264)]CRS-1707: Lease acquisition for node yyy2 number 2 completed 
2016-12-12 09:17:33.280 [OCSSD(1264)]CRS-1605: CSSD voting file is online: ORCL:ASM_OCR_VOTE_1; details in /xxx/app/grid/diag/crs/yyy1/crs/trace/ocssd.trc. 
2016-12-12 09:17:33.289 [OCSSD(1264)]CRS-1672: The number of voting files currently available 1 has fallen to the minimum number of voting files required 1. 
2016-12-12 09:27:25.925 [CSSDAGENT(1248)]CRS-5818: Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:22951} in /xxx/app/grid/diag/crs/yyy2/crs/trace/ohasd_cssdagent_root.trc. 
2016-12-12 09:27:25.925 [OCSSD(1264)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /xxx/app/grid/diag/crs/yyy2/crs/trace/ocssd.trc 
2016-12-12 09:27:25.926 [OCSSD(1264)]CRS-1603: CSSD on node yyy2 shutdown by user. 
Mon Dec 12 09:27:30 2016 
Errors in file /xxx/app/grid/diag/crs/yyy2/crs/trace/ocssd.trc (incident=857): 
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] [] 
Incident details in: /xxx/app/grid/diag/crs/yyy/crs/incident/incdir_857/ocssd_i857.trc 

CSS trace is filled with messages reporting no connectivity with node1:
2016-12-12 09:27:20.375584 : CSSD:3154114304: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 
2016-12-12 09:27:20.585624 :GIPCHALO:3141216000: gipchaLowerSendEstablish: sending establish message for node '0x7f7f900a37e0 { host 'yyy1', haName '480e-0dfa-bf94-bbda', srcLuid c33a92f9-675f2c44, dstLuid 00000000-00000000 numInf 1, sentRegister 0, localMonitor 0, baseStream 0x7f7f9009b110 type gipchaNodeType12001 (20), nodeIncarnation 9ec9e8e8-682809fa incarnation 2 flags 0x102804}' 
2016-12-12 09:27:20.633907 : CSSD:3635484416: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 
2016-12-12 09:27:20.633912 : CSSD:3635484416: clsssc_CLSFAInit_CB: clsfa fencing not ready yet 
2016-12-12 09:27:20.656587 : CSSD:3124418304: clssnmvDHBValidateNCopy: node 1, yyy1, has a disk HB, but no network HB, DHB has rcfg 371663236, wrtcnt, 11120596, LATS 232008644, lastSeqNo 11120595, uniqueness 1476197219, timestamp 1481534839/2789302712 
2016-12-12 09:27:20.868210 : CSSD:3119687424: clssnmSendingThread: Connection pending for node yyy1, number 1, flags 0x00000002 
2016-12-12 09:27:21.375702 : CSSD:3154114304: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 
2016-12-12 09:27:21.585813 :GIPCHALO:3141216000: gipchaLowerSendEstablish: sending establish message for node '0x7f7f900a37e0 { host 'yyy1', haName '480e-0dfa-bf94-bbda', srcLuid c33a92f9-675f2c44, dstLuid 00000000-00000000 numInf 1, sentRegister 0, localMonitor 0, baseStream 0x7f7f9009b110 type gipchaNodeType12001 (20), nodeIncarnation 9ec9e8e8-682809fa incarnation 2 flags 0x102804}' 
2016-12-12 09:27:21.634038 : CSSD:3635484416: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 
2016-12-12 09:27:21.634046 : CSSD:3635484416: clsssc_CLSFAInit_CB: clsfa fencing not ready yet 
2016-12-12 09:27:21.657538 : CSSD:3124418304: clssnmvDHBValidateNCopy: node 1, yyy1, has a disk HB, but no network HB, DHB has rcfg 371663236, wrtcnt, 11120597, LATS 232009644, lastSeqNo 11120596, uniqueness 1476197219, timestamp 1481534840/2789303712 
2016-12-12 09:27:21.868336 : CSSD:3119687424: clssnmSendingThread: Connection pending for node yyy1, number 1, flags 0x00000002 
2016-12-12 09:27:22.375830 : CSSD:3154114304: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 
2016-12-12 09:27:22.586063 :GIPCHALO:3141216000: gipchaLowerSendEstablish: sending establish message for node '0x7f7f900a37e0 { host 'yyy1', haName '480e-0dfa-bf94-bbda', srcLuid c33a92f9-675f2c44, dstLuid 00000000-00000000 numInf 1, sentRegister 0, localMonitor 0, baseStream 0x7f7f9009b110 type gipchaNodeType12001 (20), nodeIncarnation 9ec9e8e8-682809fa incarnation 2 flags 0x102804}' 
2016-12-12 09:27:22.634195 : CSSD:3635484416: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 
2016-12-12 09:27:22.634203 : CSSD:3635484416: clsssc_CLSFAInit_CB: clsfa fencing not ready yet 

After even more investigation on the Network another SR was filed.

Due to previous SRs oswatcher was installed already, and there we found the important information in the netstats segment:

zzz ***Fri Dec 9 14:54:54 GMT 2016 
Ip: 
13943376329 total packets received 
129843 with invalid addresses 
0 forwarded 
0 incoming packets discarded 
11934989273 incoming packets delivered 
11631767391 requests sent out 
2 outgoing packets dropped 
148375 fragments dropped after timeout 
2498052793 reassemblies required 
494739589 packets reassembled ok 
353229 packet reassembles failed 
411073325 fragments received ok 
2109526776 fragments created 

and after 2 minutes:
zzz ***Fri Dec 9 14:56:55 GMT 2016 
Ip: 
13943469180 total packets received 
129849 with invalid addresses 
0 forwarded 
0 incoming packets discarded 
11935067348 incoming packets delivered 
11631828206 requests sent out 
2 outgoing packets dropped 
148375 fragments dropped after timeout 
2498069258 reassemblies required 
494741345 packets reassembled ok 
359542 packet reassembles failed 
411073565 fragments received ok 
2109528513 fragments created 

The important part are the 6313 packet reassembles failed. In comparison to 16465 reassemblies required.

This led to some notes which describe both our symptoms (instance and cluster stack failure)

RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassembles failure (Doc ID 2008933.1)

and

The CRSD is Intermediate State and Not Joining to the Cluster (Doc ID 2168576.1)



Reassembly happens when the sender wants so send more data than fits into a single packet. In this cluster the MTU size is 1500 - and in our example we had 16465 datagrams which needed to be reassembled, but 6131 failed. There are some variables in the Linux kernel, they can affect the buffer used in kernel to reassembly fragmented datagrams.

The solution for our system was to increase 2 parameters:

net.ipv4.ipfrag_high_thresh = 16777216
net.ipv4.ipfrag_low_thresh = 15728640 

These can be changed in the running system in
/proc/sys/net/ipv4/ipfrag_low_thresh
/proc/sys/net/ipv4/ipfrag_high_thresh
and for persistent changes in sysctl.conf

Unfortunately these parameters were not mentioned in any of the prerequisit scripts I found.

With all these knowledge, we identified an important difference to other clusters: This one is the only with MTU 1500 - so much more fragmented packages needed carehere.

After the issue itself was solved, I wondered if it can be found on a vanilla 12.1 crs installation.
(vanilla in comparison to our setup where oswatcher was installed due to the first SRs).
Yes, our beloved -MGMTDB holds the information already! It's in the documentation as well (Troubleshooting Oracle Clusterware) and in the output of oclumon dumpnodeview I can see
IPReasFail - Number of failures detected by the IPv4 reassembly algorithm
Node: yyy1 Clock: '16-11-26 06.55.27 Etc/GMT' SerialNo:443 
NICS: 
bond0 netrr: 159.021 netwr: 181.688 neteff: 340.709 nicerrors: 0 pktsin: 412 pktsout: 358 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 412 innonunicast: 0 type: PUBLIC 
lo netrr: 37.722 netwr: 37.722 neteff: 75.443 nicerrors: 0 pktsin: 95 pktsout: 95 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 95 innonunicast: 0 type: PUBLIC 
bond1 netrr: 2350.313 netwr: 42989.510 neteff: 45339.823 nicerrors: 0 pktsin: 1927 pktsout: 31345 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 1927 innonunicast: 0 type: PRIVATE 
PROTOCOL ERRORS: 
IPHdrErr: 0 IPAddrErr: 102203 IPUnkProto: 0 IPReasFail: 59886 IPFragFail: 0 TCPFailedConn: 12598 TCPEstRst: 335559 TCPRetraSeg: 67276584 UDPUnkPort: 40134 UDPRcvErr: 0 

Unfortunately the format is kind of clumsy - I will need to dig into it's tables for a better output - especially for quick but powerful reports during problems.


During my research, I discovered it's not an oracle-only problem, others are affected as well (and provide a great description).