Skip to content
DataMiner DoJo

More results...

Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Search in posts
Search in pages
Search in posts
Search in pages
Log in
Menu
  • Blog
  • Questions
  • Learning
    • E-learning Courses
    • Open Classroom Training
    • Certification
      • DataMiner Fundamentals
      • DataMiner Configurator
      • DataMiner Automation
      • Scripts & Connectors Developer: HTTP Basics
      • Scripts & Connectors Developer: SNMP Basics
      • Visual Overview – Level 1
      • Verify a certificate
    • Tutorials
    • Video Library
    • Books We Like
    • >> Go to DataMiner Docs
  • Expert Center
    • Solutions & Use Cases
      • Solutions
      • Use Case Library
    • Markets & Industries
      • Media production
      • Government & defense
      • Content distribution
      • Service providers
      • Partners
      • OSS/BSS
    • DataMiner Insights
      • Security
      • Integration Studio
      • System Architecture
      • DataMiner Releases & Updates
      • DataMiner Apps
    • Agile
      • Agile Webspace
      • Everything Agile
        • The Agile Manifesto
        • Best Practices
        • Retro Recipes
      • Methodologies
        • The Scrum Framework
        • Kanban
        • Extreme Programming
      • Roles
        • The Product Owner
        • The Agile Coach
        • The Quality & UX Coach (QX)
    • DataMiner DevOps Professional Program
  • Downloads
  • More
    • Feature Suggestions
    • Climb the leaderboard!
    • Swag Shop
    • Contact
      • General Inquiries
      • DataMiner DevOps Support
      • Commercial Requests
    • Global Feedback Survey
  • PARTNERS
    • All Partners
    • Technology Partners
    • Strategic Partner Program
    • Deal Registration
  • >> Go to dataminer.services

Automatic failover failed

Solved432 views28th January 2025failover ping-pong deadlock
2
Joerg Stumpf [DevOps Advocate]147 2nd October 2024 1 Comment

Hi Dojo

We have a DataMiner Failover setup (DMA-1 and DMA-2), DMA-1 is active.
Today some SL* processes crashed on one DMA-1 and restarted itself.

The VIP was released by the online DMA-1, but neither DMA-1 nor DMA-2 did go online.

DMA-2 detected that DMA-1 failed and reported in the SLFailover.txt:
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|18|Previous check for x.x.x.x.1 (thread 137) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|ActionsOnFirstFail|ERR|0|18|First Failure
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|18|Refreshing Failover Config...
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|18|Refreshed Failover Config
SLNet.exe|SendHeartBeat|ERR|0|137|!!!WARNING!!! Agent 'x.x.x.x.1' is not able to correctly sync at the moment.
SLNet.exe|SendHeartBeat|ERR|0|137|SLNet to 'x.x.x.x.1' failed: Remote Agent is Not Running (but still reachable)
SLNet.exe|Scheduler_OnTimer_Inner|INF|0|137|SLNet x.x.x.x.1 = NOK
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|137|Refreshing Failover Config...
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|137|Refreshed Failover Config
SLNet.exe|SendHeartBeat|ERR|0|137|!!!WARNING!!! Agent 'x.x.x.x.1' is not able to correctly sync at the moment.
SLNet.exe|SendHeartBeat|ERR|0|137|SLNet to 'x.x.x.x.1' failed: Remote Agent is Not Running (but still reachable)
SLNet.exe|Scheduler_OnTimer_Inner|INF|0|137|SLNet x.x.x.x.1 = NOK
SLNet.exe|SendHeartBeat|ERR|0|137|!!!WARNING!!! Agent 'x.x.x.x.1' is not able to correctly sync at the moment.
SLNet.exe|SendHeartBeat|ERR|0|137|SLNet to 'x.x.x.x.1' failed: Remote Agent is Not Running (but still reachable)
SLNet.exe|Scheduler_OnTimer_Inner|INF|0|137|SLNet x.x.x.x.1 = NOK
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|18|Previous check for x.x.x.x.1 (thread 137) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|137|Refreshing Failover Config...
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|137|Refreshed Failover Config
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|18|Previous check for x.x.x.x.1 (thread 137) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|HandleSetDrsState|CRU|0|109|Ignored SetDrsState request by Redundancy on DMA-1 because all heartbeat paths are failing (prevent ping-pong)
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|18|Previous check for x.x.x.x.1 (thread 137) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|DoFail|INF|0|18|Reached 6 failures via x.x.x.x.1 => checking whether DataMiner Failover state needs to change
SLNet.exe|NotifyMaxFailuresReached|CRU|0|18|GOING ONLINE (AUTOMATIC DATAMINER FAILOVER)
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|18|Failover Status => Offline
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|18|Failover Status => Preparing to go online
SLNet.exe|ForceOtherAgentsToOffline|INF|0|18|Notifying buddy agent to go offline (DMA-2 (x.x.x.x.2) wants to go online)...
SLNet.exe|ForceOtherAgentsToOffline|INF|0|18|Trying to notify other agent to go offline via one of x.x.x.x.1;x.x.x.x.1
SLNet.exe|ForceOtherAgentsToOffline|INF|0|18|Failed setting state for 'x.x.x.x.1' to Offline: x.x.x.x.1 ignored request: All heartbeat paths are failing. Preventing ping-pong.
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|18|Failover Status => Offline (couldn't force other agent to go offline)
SLNet.exe|DoSwitch|CRU|0|18|NOT switching state: failed to force buddy agent to offline
SLNet.exe|DoSwitch|CRU|0|18|Local agent = Offline

DMA-1 started and tried to connect to the running DMA-2, but reports in the SLFailover.txt:

SLNet.exe|SendHeartBeat|CRU|0|77|Going online because partner x.x.x.x.2 is offline (partner lastonline: 2024-02-06 10:43:30 < local lastonline: 2024-02-12 09:35:45)
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|77|Failover Status => Offline
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|77|Failover Status => Preparing to go online
SLNet.exe|ForceOtherAgentsToOffline|INF|0|77|Notifying buddy agent to go offline (DMA-1 (x.x.x.x.1) wants to go online)...
SLNet.exe|ForceOtherAgentsToOffline|INF|0|77|Trying to notify other agent to go offline via one of x.x.x.x.2 (agent appears to be missing)
SLNet.exe|ForceOtherAgentsToOffline|INF|0|77|Failed setting state for 'x.x.x.x.2' to Offline: x.x.x.x.2 ignored request: All heartbeat paths are failing. Preventing ping-pong.
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|77|Failover Status => Offline (couldn't force other agent to go offline)
SLNet.exe|DoSwitch|CRU|0|77|NOT switching state: failed to force buddy agent to offline
SLNet.exe|DoSwitch|CRU|0|77|Local agent = Offline
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|77|Refreshing Failover Config...
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|77|Refreshed Failover Config
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|98|Previous check for x.x.x.x.2 (thread 77) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|98|Previous check for x.x.x.x.2 (thread 77) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|HandleSetDrsState|CRU|0|21|Ignored SetDrsState request by Redundancy on DMA-2 because all heartbeat paths are failing (prevent ping-pong)
SLNet.exe|SendHeartBeat|ERR|0|77|!!!WARNING!!! Agent 'x.x.x.x.2' is not able to correctly sync at the moment.

By the way the log entry (partner lastonline: 2024-02-06 10:43:30 < local lastonline: 2024-02-12 09:35:45) is confusing. 2024-02-12 was the date of the last update of the "failover config" on DMA-1.

024/02/12 11:11:38.791|SLNet.exe|ActionsOnFirstNonFail|INF|0|173|First success after failures
2024/02/12 11:11:38.791|SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|173|Refreshing Failover Config...
2024/02/12 11:11:38.792|SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|173|Refreshed Failover Config

Both DMA did not acquire the VIP so both DMA were offline until I stopped DMA-2 and restarted DMA-1.

How can I prevent the failover system from such a dead-lock ?

Joerg Stumpf [DevOps Advocate] Selected answer as best 28th January 2025
Marieke Goethals [SLC] [DevOps Catalyst] commented 28th January 2025

I see that this question has been inactive for some time. Do you still need help with this? If not, could you select the answer (using the ✓ icon) to indicate that the question is resolved?

1 Answer

  • Active
  • Voted
  • Newest
  • Oldest
1
Robin Devos [SLC] [DevOps Advocate]2.46K Posted 15th October 2024 1 Comment

Hi Joerg

I'll need to make some assumptions and guesses without knowing the full details how the failover is configured (sync, heartbeat) and without all the log files (mainly SLNet and NATS).
Both agents are indicating that "all heartbeat paths are failing (prevent ping-pong)"

In a normal scenario, when the offline agent detects a failing heartbeat on the online agent, it'll take over.
Now all heartbeats are failing, meaning it's possible that both agents would constantly be switching (i.e. ping-pong).
=> To prevent this, both agents remain in their offline or online state. Which was the case here.

I would advise you to re-evaluate the Failover configuration, the following links will surely help with this:

Preferred configuration using virtual IP addresses (best practice) | DataMiner Docs
Advanced Failover options | DataMiner Docs

Note - depending on the DMA version, it might also be the following known issue
Failover Agents remain offline after upgrade | DataMiner Docs

Joerg Stumpf [DevOps Advocate] Selected answer as best 28th January 2025
Joerg Stumpf [DevOps Advocate] commented 28th January 2025

Thanks Robin

An upgrade to the current release resolved the issue.
I appreciate your support

You are viewing 1 out of 1 answers, click here to view all answers.
Please login to be able to comment or post an answer.

My DevOps rank

DevOps Members get more insights on their profile page.

My user earnings

0 Dojo credits

Spend your credits in our swag shop.

0 Reputation points

Boost your reputation, climb the leaderboard.

Promo banner DataMiner DevOps Professiona Program
DataMiner Integration Studio (DIS)
Empower Katas

Recent questions

Alarm Dashboard PDF/CSV Export 1 Answer | 0 Votes
Is the Microsoft SharePoint Connector Still Usable 0 Answers | 0 Votes
Is the Microsoft SharePoint Connector Still Usable 0 Answers | 0 Votes

Question Tags

adl2099 (115) alarm (62) Alarm Console (82) alarms (100) alarm template (83) Automation (223) automation scipt (111) Automation script (167) backup (71) Cassandra (180) Connector (108) Correlation (68) Cube (150) Dashboard (194) Dashboards (188) database (83) DataMiner Cube (57) DIS (81) DMS (71) DOM (139) driver (65) DVE (55) Elastic (83) Elasticsearch (115) elements (80) Failover (104) GQI (159) HTTP (76) IDP (74) LCA (151) low code app (166) low code apps (93) lowcodeapps (75) MySQL (53) protocol (203) QAction (83) security (88) services (51) SNMP (86) SRM (337) table (54) trending (87) upgrade (62) Visio (539) Visual Overview (345)
Privacy Policy • Terms & Conditions • Contact

© 2025 Skyline Communications. All rights reserved.

DOJO Q&A widget

Can't find what you need?

? Explore the Q&A DataMiner Docs

[ Placeholder content for popup link ] WordPress Download Manager - Best Download Management Plugin