Samstag, 1. Juni 2019

deleting orphaned CommVault backups


In the current environment Commvault is used as a central backup solution. The setup is not very common, as CommVaults scheduler is not used for any activities. All backups are scheduled by a system which also takes some infrastructural limitations into consideration. These are (among others): shared IO resources of consolidated data­bases on a cluster, shared backup network resources of clusters, shared media agents in different datacenters, and some other constraints as well.
Also retention policies are handled only by RMAN, the retention policy of CommVaults Storage Pools is set to infinite.
Unfortunately, for some (yet not 100% clear) reasons, there are some orphaned backup pieces in CommVault which are not visible in RMAN catalog anymore. Still they need to be deleted to regain storage (and reduce number of entries in the precious Dedup DataBase).
Unfortunately, RMAN can not delete a backup piece which it doesn't know about. So we must mimic it's activities regarding the SBT_TAPE interface to send the information to delete - again.

First it's needed to get the list of backup pieces from CommVault. To do so, a query on CommVaults SQLServer DB is required:
select cbi.clientname
     , cbi.instance
  , cbi.startdate
  , cbi.enddate
  , cbi.jobid
  , ora.archFileName
--  , obi.storagePolicy
--  , obi.isAged
--  , obi.Copyname
from [CommServ].[dbo].[CommCellBackupInfo] cbi
  join [CommServ].[dbo].[archFileOracle] ora on cbi.jobid = ora.jobId
  join [CommServ].[dbo].[CommCellOracleBackupInfo] obi on cbi.jobid= obi.jobid
where cbi.isAged = 0
  and obi.isAged = 0 
  and cbi.enddate <= getdate()-60
--  and cbi.jobid = &jobid
order by cbi.clientname, cbi.instance, cbi.startdate

In this query some lines are commented but they might be useful for different questions. cbi.enddate is filtered to <= getdate()-60. RMAN retention is set to 40 days in this context and some spare days are used in the query.
Every archFileName must be crosschecked with RMAN catalogs RC_BACKUP_PIECE.HANDLE first. If it is still known by RMAN, there is no need to delete the corresponding CommVault Job.
(A Job in Commvault can contain several RMAN backup pieces. But the Job is/can only deleted if all BPs are send to be deleted by RMAN).

As I wrote earlier, an unknown backup piece can not be deleted by RMAN. There is no officially supported way to do so, but in Bug 1419888 : BACKUP ON ERROR DOES NOT DELETE PIECES GENERATED there is a clue how it can be done:
connect / as sysdba;
declare
  dn varchar2(255);
begin
  dn := dbms_backup_restore.deviceAllocate(type=>'sbt_tape', noio=>TRUE);
  dbms_backup_restore.deleteBackupPiece(0,0,'04c9av2p', 0,0,1);
  dbms_backup_restore.devicedeAllocate();
end;

There is also a params parameter to DBMS_BACKUP_RESTORE.DeviceAllocate which can be used to set exactly those SBT_TAPE parameters which otherwise would be used in RMAN script.


In a perfect world, this post would end here as the backup pieces would disappear.
Unfortunately it doesn't.

For any (yet not clear at all) reason, these backup pieces still exist on CommVault. All logs in $ORACLE_HOME/rdbms/log/sbtio.log looks ok:
440786 6b9d2 05/23 11:29:21 ### sbtinit: define trace file name=~~~ORACLE_HOME~~~/rdbms/log/sbtio.log
440786 6b9d2 05/23 11:29:21 ### sbtinit2: changing a trace level to: 0
440786 6b9d2 05/23 11:29:21 ### sbtinit2: API Client supports Oracle MM API Version: 2.0
440786 6b9d2 05/23 11:29:21 ### sbtinit2: Got argc=0.
440786 6b9d2 05/23 11:29:21 ### sbtinit2: exit - PID:440784, TID:440786; retVal=0
440786 6b9d2 05/23 11:29:21 ### sbtremove2: enter - PID:440784, TID:440786.
440786 6b9d2 05/23 11:29:21 ### sbtremove2: removing file: [~~~DBNAME~~~_1113069663_DF0_20181219_dbtl6i8l_1_1]...
440786 6b9d2 05/23 11:29:21 ### sbtremove2: exit - PID:440784, TID:440786; retVal=0
440786 6b9d2 05/23 11:29:21 ### sbtremove2: enter - PID:440784, TID:440786.
440786 6b9d2 05/23 11:29:21 ### sbtremove2: removing file: [~~~DBNAME~~~_1113069663_DF0_20181219_d6tl6i8g_1_1]...
440786 6b9d2 05/23 11:29:21 ### sbtremove2: exit - PID:440784, TID:440786; retVal=0
440786 6b9d2 05/23 11:29:21 ### sbtremove2: enter - PID:440784, TID:440786.
440786 6b9d2 05/23 11:29:21 ### sbtremove2: removing file: ~~~DBNAME~~~_1113069663_DF0_20181219_d8tl6i8i_1_1]...
440786 6b9d2 05/23 11:29:21 ### sbtremove2: exit - PID:440784, TID:440786; retVal=0
440786 6b9d2 05/23 11:29:21 ### sbtremove2: enter - PID:440784, TID:440786.
440786 6b9d2 05/23 11:29:21 ### sbtremove2: removing file: [~~~DBNAME~~~_1113069663_DF0_20181219_dbtl6i8l_2_1]...
440786 6b9d2 05/23 11:29:21 ### sbtremove2: exit - PID:440784, TID:440786; retVal=0
440786 6b9d2 05/23 11:29:21 ### CvHandleClean: before clean cv_handle=c2113170 ThreadNumber=1
440786 6b9d2 05/23 11:29:21 ### CvHandleClean: ThreadDeleteTidN [440786]
440786 6b9d2 05/23 11:29:21 ### sbtend: exit  - PID:440784, TID:440786; retVal=0 ThreadNumber=0>
<.........
440786 6b9d2 05/23 11:29:21 ### sbtend: ***** BEGIN SBT PERFORMANCE STATISTICS(SBTPS) for Client API Session: PID:440784, TID:440786   ********
(I'm sorry for the redacted entries - easily visible by ~~~)

So back to the 1st query in this post, the one to run on CommVault SQL server. It contains a JOBID as well.

As dbms_backup_restore.deleteBackupPiece and CommVault are no friends, I still need to get rid of the orphaned backups. CommVault has a command to delete a specific backup job:
/opt/commvault/Base64/qoperation agedata -delbyjobid ‑j <JOBID> ‑delJobsOnAllCopies ‑ft Q_DATA_LOG
Unfortunately this command requires quite high privileges (the owner/creator of the job is not sufficient):

(Full) Administrative Management permission - but this is another story.

Beside these complications which makes any security manager will go crazy, the tool itself still tries to make life harder than required. When deleting a specific job, these answers must be given:
This command deletes the job on all copies, do you want to continue (y/n) ? [n] : y
To confim this action, type "Delete jobs on all copies"  : Delete jobs on all copies

Obviously, (beside the typo in confim) there are no parameter to qoperation to claim "yes, I'm an adult". It needs to ask (!?)
But for every problem, there is a solution. In this case it's called expect. It literally can simulate any input-output conversation with a uncooperative program. In this case, the script for expect looks like:
#!/usr/bin/expect -f
set job_id [lindex $argv 0];

set timeout -1
spawn /opt/commvault/Base64/qoperation agedata -delbyjobid -j $job_id -delJobsOnAllCopies -ft Q_DATA_LOG
match_max 100000
expect -exact "This command deletes the job on all copies, do you want to continue (y/n) ? \[n\] : "
send -- "y\r"
expect -exact "y\r
To confim this action, type \"Delete jobs on all copies\"  : "
send -- "Delete jobs on all copies\r"
expect eof
It takes the Job_ID as first parameter and steers qoperation through all its demands.

At the bottom of this post, it shows how to circumvent complications which should not exist, based on problems which should not exist also.
This says a lot about IT in 2019.

Keine Kommentare: