www.joostdelijser.com
SPECTRUM PROTECT / TSM
TSM for VMware and Hyper-V Datamovers monitoring
For
quite some time i had been looking for a simple, efficient, and
centralized way to monitor if TSM datamovers were alive and
refreshing their
schedules with the TSM / Spectrum Protect server. When
a datamover instance hangs, it typically causes all of the next
days vm backups to miss.
If it happens in the middle of a schedule,
having lots of missed vm's will alert you that it's hanging, But more
often than not, a datamover scheduler service hangs on a
single
problematic vm, or a timeout with the hypervisor, and it's usually near
the end of it's backup window, and with only 1 or a few vm's failed or
missed,
you don't suspect that it may be hanging, until the next day,
all vm's for that datamover instance are all missed...
So i dug into
the tsm tables and logs to see if there was any way to detect if the
schedule was alive and refreshing its schedule with the TSM server.
I
found that in the actlog table, the Scheduler service of a datamover
instance, identifies to the TSM server as TDP for VMware or Hyper-V,
whereas the Client Acceptor Daemon identifies as a Baclient. This
allows us to monitor datamover instances on their schedule refresh
times.
The default refreshinterval of the Client Acceptor Daemons
(which restart the scheduler service, and initiate a new refresh with
TSM) is 4 hours
above a certain version (8.1.?),earlier versions
refresh in 8 hour intervals. So by checking if the datamovers
scheduler service has contacted
the TSM server, we can check if it's
alive or hanging. I've used this monitoring for over 6 months now, and
it's really effective, detecting
all hanging datamovers, and avoiding
next day missed vm backups in both VMware and Hyper-V environments.
To check a single datamover instance, use the TSM SQL command:
select message from actlog where message like '%<YOURDATAMOVERNAME>%TDP%' and
DATE_TIME>current_timestamp-5 hours and msgno='406' and
severity='I'
If
you have only a few datamover instances, you can easily add this query
to a daily OC report mail, and run it at a time when you are sure the
backup window is done,
If you get a "no match found" for a datamover, then it is either hanging, or long running (which is also good to know)
You
can check the TSM server to see if it still has any active sessions, or
check the local schedule log to see if the timestamp is recent.
If
not, it is hanging, and you have to stop the scheduler service, and
restart the client acceptor daemon to make it refresh its schedule.
To not get false positives for long running datamovers, you may have to increase the '-5 hours" part of the oommand, or the time of the day that you check.
If
like in our environment, you have a lot of datamovers, you can create a
shell script to launch this TSM command for instance hourly, on your
various servers,
and optionally you can add the datamovers to a nodegroup and query the names of the datamovers from that nodegroup.
for insance:
select node_name from nodes where nodegroup='DATAMOVERS'
output that to a txt file, and then use something like
while IFS= read -r line; do
to launch the first select command for every datamover in the list.