Hello,
After a DMA restart (because of a windows update, for example) what should we check in the system to make sure he is completely available? (services, logs..)
Is there any documentation on the subject that you can provide?
Thank you.
Best regards
Bruno Sousaa
Hi Bruno,
From the top of my head:
- Verify there are no (new) errors or notices in the alarm console
- Verify all (critical) elements, services, and redundancy groups are available and active in the surveyor.
- Verify the critical elements are successfully polling data (e.g. by opening the trending, confirming parameter updates or looking at the streamviewer). It could also be good to check the element logging for no (new) errors.
- If this is a cluster, verify in System Center > Agents that all agents are running.
- If this is a failover system, verify failover status (right click the agent name) is green.
- Verify the SLWatchdog2.txt log file does not contain any RTE's (Run-Time Errors), they will only appear after 15 minutes or more.
Other than this you can verify the most commonly used apps (e.g. booking manager, resources,... are working as expected). But this will heavily depend on your system.
Good list – I’d include a check of the Microsoft Element polling the restarted DMA: in production environments, it’s worth monitoring a few key processes (e.g. the SL* ones) and keep an eye on the trends for server KPIs, such as CPU, VM size and similar (it usually helps to have a view with all the DMAs to aggregate the relevant info).
SLErrors.txt is besides SLWatchdog2.txt a good place to look at.