Samstag, 9. Mai 2009

odyssey of a session - Part 2

An hour after I finished my work on explaining and supporting one poor session, I got a call from my honoured colleague again.
'The session stopped doing any sorts in v$session_longops and also no more segments on the indices are written. What's going on?'
My first guess was 'Did it just finish?', but that was not the case. The table still seemed to be empty (so no commit yet). The session was still active, but SQL_HASH_VALUE was NULL. Also sampling v$session_wait showed diferent events, so the process was doing anything. We just did not know, what.
Even I didn't know, what's going on, I knew something is going on. I couldn't pin it down with the knowledge I have using plain oracle methods, so I decided to switch to another method. First OStackProf came to my mind, but unfortunately I was sitting on an Apple Computer at this moment, and I didn't want to care about VBS on Apple. Not a big deal at all, because after a short crawling through Tanles box of marvelous miracles, I grabbed os_explain.
Let's have a look what the process is doing.
After running
pstack 11460 | ./os_explain
some times (5 or 6 where enough) it was clear there is no ordinary execution plan running. We got 2 stacks ( with little changes in the end, but this didn't count). Putting these side by side for easier camparison:
ksupop                  ksupop      
ksudlc ksudlc
ktcrsp ktcrsp
ktursp ktursp
ktubko ktubko
kcbgcur ktundo
kcbzib kqrpre
kcfrbd kqrpre1
ksfdread kslfre
I highlighted the first diference:
kcbgcur vs. ktundo.
To decipher these names (ok, in this particular case it wasn't too difficult, even for me) we had a look in Note:175982.1 - ORA-600 Lookup Error Categories. kcbgcur will be more or less Kernel Cache Buffer Current (don't know about the g) and ktundo is about undo and rollback management.
This still didn't answer all my questions, but what can a session be doing if it doesn't execute a SQL, (there is no sign for PL/SQL, java or similar currently executed by the session) and there is a lot of work performed in Cache and Undo? The next idea was: Rollback!
v$transaction came into my mind. And a little bit googling led me to this query:

SELECT a.sid, a.username, b.xidusn, b.used_urec, b.used_ublk
FROM v$session a, v$transaction b
WHERE a.saddr = b.ses_addr;

Yes, USED_UREC was decreasing! So the poor session, which worked for mor than 12 hours now, did a rollback.
Some phone calls later, the operations-guy who started the whole script told us 'Yes, about noon there where some hanging situations on my PC, but Toad looks ok again, I'm just waiting for the script to finish'.
Poor guy, and lucky me for enabling Dead connection Detection in sqlnet.ora. Otherwise the whole statement would have run until the end and started the rollback THEN.

This was the last chapter in the odyssey of my little poor session. After the rollback finished, it just closed (as the client connection died before).

Some words at the end?
I assume, the script which caused the problem was never tested properly.
Even with limited knowledge, a more or less systematic aproach is possible.
There is a lot of free information and tools out there. Just try to know about them before you need them.
I should try to learn more about sessions in rollback.
No fancy Oracle-features where needed. No statspack, no AWR (as the DB is 9i - did I mention that?). Not even dtrace.
It was a great day for me, working with a DB again and not only talking about DBs in endless meetings.

1 Kommentar:

Tanel Poder hat gesagt…

Yep ktursp means Kernel Transaction Undo Rollabck to SavePoint, which happens when a database call fails and has to be rolled back (to the implicit savepoint which was created when that call started).

Snapper would have probably shown that this session was doing rollback - by incrementing this statistic: "rollback changes - undo records applied"

Btw, the first fundamental principle of systematic troubleshooting is "Understand first, then fix". The exact tools you use are of secondary importance, but you can save time/effort by knowing the best tool for given problem. I always start from snapper (or plain query against v$session_wait) and only rarely need to look into process stack, Oracle's built instrumentation is usually good enough...