Hello,
After a DMA restart (because of a windows update, for example) what should we check in the system to make sure he is completely available? (services, logs..)
Is there any documentation on the subject that you can provide?
Thank you.
Best regards
Bruno Sousaa
Hi Bruno,
As far as I know there is no specific documentation that covers a DataMiner restart, however you could use as reference the following MOP:
This document (together with the recommendations from other members of the community) can give you a baseline about what to check after restarting the DMA. On top of that, you will probably need to add extra checks specific to your integration.
Hi Bruno,
From the top of my head:
- Verify there are no (new) errors or notices in the alarm console
- Verify all (critical) elements, services, and redundancy groups are available and active in the surveyor.
- Verify the critical elements are successfully polling data (e.g. by opening the trending, confirming parameter updates or looking at the streamviewer). It could also be good to check the element logging for no (new) errors.
- If this is a cluster, verify in System Center > Agents that all agents are running.
- If this is a failover system, verify failover status (right click the agent name) is green.
- Verify the SLWatchdog2.txt log file does not contain any RTE's (Run-Time Errors), they will only appear after 15 minutes or more.
Other than this you can verify the most commonly used apps (e.g. booking manager, resources,... are working as expected). But this will heavily depend on your system.
Good list – I’d include a check of the Microsoft Element polling the restarted DMA: in production environments, it’s worth monitoring a few key processes (e.g. the SL* ones) and keep an eye on the trends for server KPIs, such as CPU, VM size and similar (it usually helps to have a view with all the DMAs to aggregate the relevant info).
SLErrors.txt is besides SLWatchdog2.txt a good place to look at.