www.joostdelijser.com
SPECTRUM PROTECT / TSM

TSM for VMware and Hyper-V Datamovers monitoring

For quite some time i had been looking for a simple,  efficient, and centralized way to monitor if TSM datamovers were alive and refreshing their
schedules with the TSM / Spectrum Protect server. When a datamover instance hangs, it typically causes all of the next days vm backups to miss.
If it happens in the middle of a schedule, having lots of missed vm's will alert you that it's hanging, But more often than not, a datamover scheduler service hangs on a
single problematic vm, or a timeout with the hypervisor, and it's usually near the end of it's backup window, and with only 1 or a few vm's failed or missed,
you don't suspect that it may be hanging, until the next day, all vm's for that datamover instance are all missed...
So i dug into the tsm tables and logs to see if there was any way to detect if the schedule was alive and refreshing its schedule with the TSM server.

I found that in the actlog table, the Scheduler service of a datamover instance, identifies to the TSM server as TDP for VMware or Hyper-V,
whereas the Client Acceptor Daemon identifies as a Baclient. This allows us to monitor datamover instances on their schedule refresh times.
The default refreshinterval of the Client Acceptor Daemons (which restart the scheduler service, and initiate a new refresh with TSM) is 4 hours
above a certain version (8.1.?),earlier versions refresh in 8 hour intervals. So by checking if the datamovers scheduler service has contacted
the TSM server, we can check if it's alive or hanging. I've used this monitoring for over 6 months now, and it's really effective, detecting
all hanging datamovers, and avoiding next day missed vm backups in both VMware and Hyper-V environments.

To check a single datamover instance, use the TSM SQL command:

select message from actlog where message like '%<YOURDATAMOVERNAME>%TDP%' and DATE_TIME>current_timestamp-5 hours and msgno='406' and severity='I' 

If you have only a few datamover instances, you can easily add this query to a daily OC report mail, and run it at a time when you are sure the backup window is done,

If you get a "no match found" for a datamover, then it is either hanging, or long running (which is also good to know)

You can check the TSM server to see if it still has any active sessions, or check the local schedule log to see if the timestamp is recent.
If not, it is hanging, and you have to stop the scheduler service, and restart the client acceptor daemon to make it refresh its schedule.

To not get false positives for long running datamovers, you may have to increase the
'-5 hours" part of the oommand, or the time of the day that you check.


If like in our environment, you have a lot of datamovers, you can create a shell script to launch this TSM command for instance hourly, on your various servers,
and optionally you can add the datamovers to a nodegroup and query the names of the datamovers from that nodegroup.
for insance:

select node_name from nodes where nodegroup='DATAMOVERS'

output that to a txt file, and  then use something like

while IFS= read -r line; do

to launch the first select command for every datamover in the list
.