Connection Manager Issues, NI Master, V3.60 Firmware
Reese Jacobs
Posts: 347
Looking for some ideas on troubleshooting a Connection Manager issue - running a fairly large project on an NI-3100 master which has v3.60 firmware (last of the v3 firmware versions). The project has 20 touch panels of which all but 3 are wired (the wireless panels are MVP-8400s). System is extremely stable and has plenty of volatile memory available (between 10-12Mbytes) so memory leakage is not an issue. No issue with any of the message queues - no messages pending, no messages ignored, looks perfectly normal. Nothing abnormal in the log file right up to the point of system failure.
Periodically, anywhere from 10-21 days after booting the system, the Connection Manager will close all of its socket connections to the touch panels (and any connected EXB devices) and then refuses to accept any new touch panel connection requests. The system is still operating and I can Telnet in to the master and see the log file, memory, queues, etc. Everything looks normal except that all of the socket connections to the touch panels are closed (dead sockets) and no touch panels or EXB devices can connect. In the log file, there are no abnormal events that are logged just prior to the Connection Manager closing the first touch panel connection and it then quickly terminates all of the other touch panel connections until there are no panels or EXB devices online. The Link/Activity light on the Master is still blinking normally and the Status light flashes periodically.
In addition to TCP connections to the master from touch panels and EXB devices being shut down and not permitted to reconnect, the system also has a handful of outgoing TCP connections to devices like the Elk M1 security panel, Lutron P5 processor, Denon AVR receiver and all of these outgoing connections are also terminated and not allowed to reconnect. The master will gladly accept incoming Telnet connections and I can poke around the log file, statistics, and other information but as I said before, it does not seem to be the result of any singular event based on an examination of the log file. System statistics for memory and message queues all indicate everything is normal -- except for the fact that no incoming or outgoing TCP connections will work any longer!!!
Anyone experienced similar problems or have any troubleshooting recommendations?
Periodically, anywhere from 10-21 days after booting the system, the Connection Manager will close all of its socket connections to the touch panels (and any connected EXB devices) and then refuses to accept any new touch panel connection requests. The system is still operating and I can Telnet in to the master and see the log file, memory, queues, etc. Everything looks normal except that all of the socket connections to the touch panels are closed (dead sockets) and no touch panels or EXB devices can connect. In the log file, there are no abnormal events that are logged just prior to the Connection Manager closing the first touch panel connection and it then quickly terminates all of the other touch panel connections until there are no panels or EXB devices online. The Link/Activity light on the Master is still blinking normally and the Status light flashes periodically.
In addition to TCP connections to the master from touch panels and EXB devices being shut down and not permitted to reconnect, the system also has a handful of outgoing TCP connections to devices like the Elk M1 security panel, Lutron P5 processor, Denon AVR receiver and all of these outgoing connections are also terminated and not allowed to reconnect. The master will gladly accept incoming Telnet connections and I can poke around the log file, statistics, and other information but as I said before, it does not seem to be the result of any singular event based on an examination of the log file. System statistics for memory and message queues all indicate everything is normal -- except for the fact that no incoming or outgoing TCP connections will work any longer!!!
Anyone experienced similar problems or have any troubleshooting recommendations?
0
Comments
As a side note: 20 panels on an NI is right at the top of my threshold of pain. I tend to keep it below that. It's not that it's not "technically" doable. It's just a bit taxing on the system enough that it can get buggy. That my opinion.
On your side note, I am sure 20 panels is pushing the NI but unfortunately, there are quite a few ICSNet devices on this system and so the NX controllers are simply not an option. I understand the licensing and cost issue with respect to ICSNet but not having an upgrade path beyond the NI-x100 series processors for those sites with ICSNet devices is problematic. I have also been reluctant to upgrade to v4 series firmware since the 64Mb NI processor is already pushing the memory limit on this project. That is why I had a separate post trying to acquire an NI-3100/256 or NI-4100/256 so I could upgrade to the latest Duet firmware (even though the project is not using any Duet modules) to see if that helps. Based on what I have read from other posts it will not really help the problem but it is worth a try. I do not have sufficient memory head-room to upgrade to v4 on the 64Mb model.
Thanks again.
Other irritations can be local devices including switches that are acting up and making/breaking connections. Most recently at two sites, NetGear ProSafe switches have caused this with intermittent functioning (very hard to locate other than by substitution, worse when the failure is long-cycle). The NI log sometimes gives a clue where to look, with a trail of IP connections dropped and made. We have seen this caused by a panel at the WIFI fringe, connecting and dropping endlessly. (We were unable to make the panel go offline when our techs were on site... repeatedly... then finally the customer showed them where he puts the panel when not in use - tucked neatly under the TV where it is shielded from sight... and from wifi!)
Your circumstance may be something else, but every similar issue we've seen (and there have been lots of them (including the early ENOVA end-point boxes that couldn't stay on line!!!), a network issue was burning available connections at the NetLinx faster than it could clean them up. While we have not used V4 firmware a lot, I think we've seen marginal improvement in network robustness with it. Yes, it gives a memory hit, but if you have 10 meg free now, you should still run in V4. Be sure your DUET is minimized to 3meg (although this will impair the web server in the NI, it should survive with really slow response).
Worst case, built in a reboot every 7 days.
Are the cisco switches SMB SG300-xx or Catalyst class? I've had dhcp server issues with the SG's not assigning IP to devices in the bind tables in earlir fw revs and I think they've been the gremlin causing other issues when client can't connect that I'm still trying to figure out.
On the occasion that you need to find the IP of a wayward NetLinx, and IPSCAN is either inconvenient or inconclusive, waiting on LISTEN in Studio for 5 minutes yields the info you want... with minimal annoyance to the network.
Your mileage may vary.
One of the problems was with an office here in Reykjavik (24 projectors, some QSC Basis DSPs and a few 7" touchpanels and other LAN based bits and pieces). The master had been running the program for over 5 years with no problems and almost no reboots. Then all of a sudden the ethernet part of the project starts to act up. I put in a reboot once a week. The program works for a few months, then acts up, I put in a reboot once every night. Works for a few days and then acts up again, cutting the connections mid-day. I upgraded the firmware to 3.60 from something really old but nothing worked. I looked at logs and even put in a computer with Wyreshark running to scan for something out of the ordinary. No storms or nothing.
I blame the network guys for doing something/everything wrong and they tell me that I am to blame. They had upgraded their Cisco Catalyst system so I just thought they had put in a bad filter. They did not admit to anything and I had no proof.
Something had to be done. I grab my test NX1200 and switch it out with the NI700. Modify the code to put in a timeline. But I use the original code with no reboots. The system was up for 2 weeks without a glitch. I changed my test NX for the customers new NX1200. It has been 6 months now and no troubles.
I don't know if I am experiencing faulty memory cards or something bad with the network part of the NI infrastructure, but this is the 4rd project in a short period of time that has been faulty. All due to network problems. I have had the same problem with an NI2000 and a heavily loaded NI3100.
You might want to look at the power supply into the NI masters, I have had entire sets of adjacent racks each with a NI with what appears to be a network issue fall over thanks to power brownouts.
The short power issue would get the network interface into a state where it would be active on layer 1 but wouldn't negotiate a connection with the switch. The lights on both the NI interface and switch port would appear to be active.
There is a bunch of firmware revisions over the last few years on both the v3 & v4 NI firmware that apparently resolves that type of issue which typically presented itself for us after short power interruptions or drops.
We're running the most recent v4 NI firmware and haven't seen the issue appear again, I am not sure whether the most recent v3 version has solved this one (I went from v3 to v4 before it was resolved for us) so take a look at the times these things drop off and compare to any other device that may log power supply on site.
By the way about power supplies and testing by substitution of the NetLinx.... I've seen variation in how NI's react to low voltage, and some are fine far lower than others. Just saying to always measure (with a load, a bad supply can show 12 or more volts without a load, and drop to 6 with an ordinary load...)...