Masters disappear...randomly
Duncan Ellis
Posts: 162
Hi All
Anyone come across a situation where masters in a system look like they are offline - they are ping-able but no comms on 80/23/1319.
When you try and login to the offending masters on telnet, you just get a blank screen and a cursor but no login prompt.
There are 22 masters in this system and they have been running fine for about 5 years then suddenly this started to happen 3 months ago. The masters are all NI700's plus 1 x NI4100. All up to date on their last firmware available (v3 not 4) and they are all password secured.
We are not in control of the network, we are on our own VLAN and our access if effectively local, via Teamviewer.
Any thoughts are welcome.....
0
Comments
Hi @Duncan Ellis , try grabbing some logs from the masters.
The best way to do so on the NI master is to establish a telnet connection and send the command "msg on all". Make sure your telnet session is logging, I would suggest using something like Putty.
Once you catch it happening on the logs it will be a lot easier to tell what's going on up to the moment the master falls offline.
Another thing you could try doing, if your network team allows it, is to do a wireshark trace while mirroring the network port the master is connected to. This will capture all traffic coming to and from the master and the specific port it is tied into.
While this may not be your issue, it bears discussion in the same space....
We have often seen NI's fail to connect to new IP requests, even while still servicing existing connections. It usually is due to some network irritant that appears to consume connection handles faster than the Netlinx releases them. The typical scenario is outside hackers pummeling the system when the NI is at least partially exposed on the internet. You can watch the endless attempts in TELNET... with IP addresses from all over the world, at hundreds an hour - and only imagine how many more connections are being attempted on ports you can't easily watch. We also have seen it occur when some local device is jittering on and offline rapidly for an extended time. This can be a defective or out of range panel, and can even be caused by duplicate addresses on panels. One of the duplicates will fail to connect due to the existing connection, but will try again forever. Until the NI runs out of resources... from which it generally cannot recover without a reboot. (Newest firmwares and the new NX processors do better, they say...)
AMX's original panel intercom module was good at causing this in large systems. The auto-discovery took long enough that many connections released and started over, until the whole IP system was flapping... then nothing.
You can cause and witness this effect by attempting to FTP a few hundred files in or out of the NI with the default settings on most FTP tools. The tool (like a browser) will attempt to move as many files simultaneously as possible, each one consuming an IP handle. What you see will be a "421 Error" and no more connections will be allowed. It might clear itself in a while, but sometimes needs a reboot. This is why you select FTP's "Limit Number of Simultaneous Connections" to "1" if you do file-intense NI work.
ALSO check in a multi-NI system to be certain you don't have slave-loops. If two masters each have each other in their connection list, they can spend a lot of IP time arguing over whose connection wins.
thanks guys
@John Nagy - the topology is all single links back to the main master and route mode is direct. I've been through all of the masters and there is only one link in each. Thanks for all of the info though - I think the network team have put a device on the network that constantly pings all devices - I have to prove that yet, but the yacht was in for its 5 year update - which we had nothing to do on - but the network had alterations and it seems to span from that. Thank you again!!!!
@Harman_quirk - logs are cleared by reboot though, aren't they? when this happens you cant get near the logs and the masters that this happens to are random. I may have to use wireshark, like you suggested to find this I think. Thank you!!!!
A new network? Look for loops. They do wacky things. A clue is that often the outages will be on a regular interval. We found one that tanked the network every 11 minutes, and another that did 18 minute cycles. Inattention and too many wires resulted in multiple paths to the same switches. Boo.
Not a new network as such just some replacement switches and probably some other kit we are not privvy to.
We are sat on our own VLAN and nothing has changed for us. So I'm going to disconnect us from their network for the christmas trip and the chief eng can reconnect us if they need support and we can see what happens. if nothing, it can be lobbed back to IT for them to sort out.
You might want to do a port scan of the master when it's not working insted of a ping. The NIs communicate with each other on port 1319. If for some reason that port is blocked it will appear offline. But, If it works at all, then I don't think this is the problem - as switches do not tend to block or unblock ports willy-nilly.
In days long past I would see this behavior a lot with cheaper Linksys WRxxxx routers/switches. the only solutions were to a) change out the router/switches or b) write the code to reboot the NIs on a regular basis.
Duncan, do let us know what resolves this for you.
Hi Guys
To answer Erics response - 1319 is also blocked at this point and so is 80. Its only started happening since the Yacht went in for its 5 year service/upgrade.
I still don't know what causes this but its something from the main network. I had the Chief Engineer disconnect us from the main network and everything stayed live. When you reconnect to the main network, it takes about 20mins and then masters start to drop off...I'll be speaking to network admin in the new year....