Home AMX User Forum AMX General Discussion

Connection Manager Issues, NI Master, V3.60 Firmware

Looking for some ideas on troubleshooting a Connection Manager issue - running a fairly large project on an NI-3100 master which has v3.60 firmware (last of the v3 firmware versions). The project has 20 touch panels of which all but 3 are wired (the wireless panels are MVP-8400s). System is extremely stable and has plenty of volatile memory available (between 10-12Mbytes) so memory leakage is not an issue. No issue with any of the message queues - no messages pending, no messages ignored, looks perfectly normal. Nothing abnormal in the log file right up to the point of system failure.

Periodically, anywhere from 10-21 days after booting the system, the Connection Manager will close all of its socket connections to the touch panels (and any connected EXB devices) and then refuses to accept any new touch panel connection requests. The system is still operating and I can Telnet in to the master and see the log file, memory, queues, etc. Everything looks normal except that all of the socket connections to the touch panels are closed (dead sockets) and no touch panels or EXB devices can connect. In the log file, there are no abnormal events that are logged just prior to the Connection Manager closing the first touch panel connection and it then quickly terminates all of the other touch panel connections until there are no panels or EXB devices online. The Link/Activity light on the Master is still blinking normally and the Status light flashes periodically.

In addition to TCP connections to the master from touch panels and EXB devices being shut down and not permitted to reconnect, the system also has a handful of outgoing TCP connections to devices like the Elk M1 security panel, Lutron P5 processor, Denon AVR receiver and all of these outgoing connections are also terminated and not allowed to reconnect. The master will gladly accept incoming Telnet connections and I can poke around the log file, statistics, and other information but as I said before, it does not seem to be the result of any singular event based on an examination of the log file. System statistics for memory and message queues all indicate everything is normal -- except for the fact that no incoming or outgoing TCP connections will work any longer!!!

Anyone experienced similar problems or have any troubleshooting recommendations?

Comments

  • ericmedleyericmedley Posts: 4,177
    This is going back quite a way, but when I worked at a Resi dealer here in Charleston we found a consistent issue with the same behavior if the client had one of those cheap Linksys WG routers or Wireless router. We never found a fix apart from changing out the routers and switches which did make the problem go away. The other solution was not as pretty. At the end of the day we really have no way to manage network connection at the layer 3 level. We cannot send the NIC a command to release the socket. Nor can we tell an unconnected wireless panel to come back to life. (Apart from telnet in into it and rebooting it. I've done this). I'd suggest trying a different switch and/or router or the other alternative is to write a routine to watch for things falling offline and rebooting the master if it gets bad. The 8400s were pesky when it comes to this sort of thing.

    As a side note: 20 panels on an NI is right at the top of my threshold of pain. I tend to keep it below that. It's not that it's not "technically" doable. It's just a bit taxing on the system enough that it can get buggy. That my opinion.
  • Reese JacobsReese Jacobs Posts: 347
    Eric -- thanks for the feedback. This site is using high-end Cisco switches and the network is monitored 24/7 with a Fluke Optiview analysis device and the latter is reporting no issues whatsoever with the network. While I realize this may not eliminate the switches and routers as a problem, the interesting thing is that when the NI goes into this state, you can still Telnet to the Master so the IP Connection Manager in Netlinx seems alive and able to process requests. I can check memory, queues, etc. and nothing appears abnormal except all socket connections to the touch panels have been closed (as well as EXB devices) and none of the panels or devices can reconnect to the master. You can even Telnet to the touch panels and check them out so they are clearly on the network, accessible, and able to communicate fine with everything except the Master of course.

    On your side note, I am sure 20 panels is pushing the NI but unfortunately, there are quite a few ICSNet devices on this system and so the NX controllers are simply not an option. I understand the licensing and cost issue with respect to ICSNet but not having an upgrade path beyond the NI-x100 series processors for those sites with ICSNet devices is problematic. I have also been reluctant to upgrade to v4 series firmware since the 64Mb NI processor is already pushing the memory limit on this project. That is why I had a separate post trying to acquire an NI-3100/256 or NI-4100/256 so I could upgrade to the latest Duet firmware (even though the project is not using any Duet modules) to see if that helps. Based on what I have read from other posts it will not really help the problem but it is worth a try. I do not have sufficient memory head-room to upgrade to v4 on the 64Mb model.

    Thanks again.
  • John NagyJohn Nagy Posts: 1,734
    We've seen the connection refusal issue in several ways, nearly always resulting from network "irritation". The most frequent is when one of our dealers has put the NetLinx on the internet, either as a DMZ or with ports forwarded for remote access. The most obvious action (but probably not the only port being attacked) occurs on telnet, where you can watch or log dozens of connections a minute from all over the planet (looking the incoming IP's up in WHOIS provides amazing variety). Eventually - days or weeks - the NetLinx gives up on making new connections. I have seen panels to be the first to go, while connected panels continue. If they go offline, they can't reconnect. Since panels ordinarily will briefly go offline for internal cleanup every few days (highly variable), it won't get back in. Telnet, strangely, often still works. Then TELNET dies too, and for a short time, STUDIO might still be able to connect. Then nothing. The solution for these is to get the NI off the internet. Or on a VPN. Or at the very least, limit the range of accepted incoming IP addresses in the router.

    Other irritations can be local devices including switches that are acting up and making/breaking connections. Most recently at two sites, NetGear ProSafe switches have caused this with intermittent functioning (very hard to locate other than by substitution, worse when the failure is long-cycle). The NI log sometimes gives a clue where to look, with a trail of IP connections dropped and made. We have seen this caused by a panel at the WIFI fringe, connecting and dropping endlessly. (We were unable to make the panel go offline when our techs were on site... repeatedly... then finally the customer showed them where he puts the panel when not in use - tucked neatly under the TV where it is shielded from sight... and from wifi!)

    Your circumstance may be something else, but every similar issue we've seen (and there have been lots of them (including the early ENOVA end-point boxes that couldn't stay on line!!!), a network issue was burning available connections at the NetLinx faster than it could clean them up. While we have not used V4 firmware a lot, I think we've seen marginal improvement in network robustness with it. Yes, it gives a memory hit, but if you have 10 meg free now, you should still run in V4. Be sure your DUET is minimized to 3meg (although this will impair the web server in the NI, it should survive with really slow response).

    Worst case, built in a reboot every 7 days.
  • viningvining Posts: 4,368
    To expand on what John just said SSH port forwarding was also reported to cause lock ups after repeated hack attempts. So I would see if telnet or SSH are being forwarded through your router's firewall and if they are either delete them and go with a vpn or again try limiting the IPs that router's acl with accept to forward. As a general I don't forward, I set the upd bc rate to 0 and turn off zero config and a few other things that escape me right now.
  • Reese JacobsReese Jacobs Posts: 347
    Thanks for the additional suggestions guys. This particular NI master is on an internal LAN behind a commercial grade Cisco firewall. No ports are being forwarded to the master, no master-to-master, SSH is disabled, UPD broadcast rate set to 0, zero config is OFF, queues and thresholds are tuned to AMX recommended specs, network interfaces set to 10 half duplex, and pretty much all of the other recommendations we have collected as a group over the years have been applied. The log file does not indicate any connection attempts or failures prior to the network connection problem nor are there any devices going offline/online rapidly which can cause problems - the system seemed perfectly normal at the time. The Fluke Optiview also does not record any specific network anomaly at the time of the NI master connection problem. Not quite sure what to look at next but I will keep digging. If I was stressing the NI master with an excessive number of network connections, I could understand. However, with 20 touchpanels and 6-8 other TCP devices connected to the master, the NI should easily be able to handle the load. The system is well tuned with very little taking place in Mainline -- most everything is handled in events. CPU utilization is normal and the input/output lights on the Master flash periodically but do not indicate that the Master is significantly busy with device communication.
  • viningvining Posts: 4,368
    When you reboot is that just the master or the network too?

    Are the cisco switches SMB SG300-xx or Catalyst class? I've had dhcp server issues with the SG's not assigning IP to devices in the bind tables in earlir fw revs and I think they've been the gremlin causing other issues when client can't connect that I'm still trying to figure out.
  • Reese JacobsReese Jacobs Posts: 347
    The reboot is the master only - the network seems fine. In fact, when the master is rebooted, all of the touch panels and EXB devices come back online without any issues. The touch panels and the EXB devices do not need to be power cycled or rebooted - only the Master. We are using the new SG550 managed switches and also a backbone switch to connect them. The network is fairly large and 3 of the SG switches are used along with a backbone switch. The touch panels and the EXB devices are all assigned static IP addresses so DHCP is not an issue at least with the Netlinx devices. All signs point to the Master - something in the IP Connection Manager implementation of Netlinx or a problem in the underlying VxWorks. My next step may be to go ahead and implement a Wireshark protocol decoder for ICSP and see if I can make any sense of the ICSP protocol messages from these devices to the Master. Much of the ICSP protocol is defined in the series of AMX patents which are published online although I am sure there are new protocol requests/replies and options that have been implemented which are not in the patent filings.
  • John NagyJohn Nagy Posts: 1,734
    Side note, after years of recommending setting UDP BC RATE to 0, I now prefer to set it to the max 300 seconds (5 minutes).
    On the occasion that you need to find the IP of a wayward NetLinx, and IPSCAN is either inconvenient or inconclusive, waiting on LISTEN in Studio for 5 minutes yields the info you want... with minimal annoyance to the network.
  • John NagyJohn Nagy Posts: 1,734
    The reboot is the master only - the network seems fine. In fact, when the master is rebooted, all of the touch panels and EXB devices come back online without any issues.
    Note that this is exactly the condition I found in two recent installs where it was in fact the network switch, which worked while I watched and didn't require rebooting... until that one time when I was actually looking at the lights when the system failed... they all blinked in unison for about four minutes and then recovered. The Netlinx had given up all its sockets during the IP storm, and didn't recover without a reboot. In one of these installs, we already had replaced the NetLinx because of the same good logic you suggest. We were wrong. So when I saw the same symptoms on another job, I immediately replaced the network switch in the rack... and it has been fine since.

    Your mileage may vary.
  • viningvining Posts: 4,368
    John Nagy wrote: »
    Note that this is exactly the condition I found in two recent installs where it was in fact the network switch, which worked while I watched and didn't require rebooting... until that one time when I was actually looking at the lights when the system failed... they all blinked in unison for about four minutes and then recovered. The Netlinx had given up all its sockets during the IP storm, and didn't recover without a reboot. In one of these installs, we already had replaced the NetLinx because of the same good logic you suggest. We were wrong. So when I saw the same symptoms on another job, I immediately replaced the network switch in the rack... and it has been fine since.

    Your mileage may vary.
    This is kinda funny, my job with the network weirdness has cisco SG300's 1G switches and I'm sure the 550 is basically the same but 10g and I was planning on swapping it out with a prosafe switch which I already bought, go figure. SG switches have had a lot of issues since being released, a lot of stupid problems that you would never expect from a cisco switch but as most cisco guys would say, these aren't "cisco" they're smb cisco which use the JV team of engineers. For a while you couldn't even bind a mac to an IP if that client responded with a client id instead of mac. You would have to connect a device, see how it responded and then set up the bind with a mac or client id depending what 846059as received. I swear I would have been better off with a dumb switch from best buy but then I couldn't do layer 3, vlans and svi routing. I've spent the last three years complaining about them on the cisco forums.
  • ThorleifurThorleifur Posts: 58
    Over the last year or so, some of my customers masters have been experiencing the same problem as Reese is experiencing.

    One of the problems was with an office here in Reykjavik (24 projectors, some QSC Basis DSPs and a few 7" touchpanels and other LAN based bits and pieces). The master had been running the program for over 5 years with no problems and almost no reboots. Then all of a sudden the ethernet part of the project starts to act up. I put in a reboot once a week. The program works for a few months, then acts up, I put in a reboot once every night. Works for a few days and then acts up again, cutting the connections mid-day. I upgraded the firmware to 3.60 from something really old but nothing worked. I looked at logs and even put in a computer with Wyreshark running to scan for something out of the ordinary. No storms or nothing.
    I blame the network guys for doing something/everything wrong and they tell me that I am to blame. They had upgraded their Cisco Catalyst system so I just thought they had put in a bad filter. They did not admit to anything and I had no proof.
    Something had to be done. I grab my test NX1200 and switch it out with the NI700. Modify the code to put in a timeline. But I use the original code with no reboots. The system was up for 2 weeks without a glitch. I changed my test NX for the customers new NX1200. It has been 6 months now and no troubles.

    I don't know if I am experiencing faulty memory cards or something bad with the network part of the NI infrastructure, but this is the 4rd project in a short period of time that has been faulty. All due to network problems. I have had the same problem with an NI2000 and a heavily loaded NI3100.
  • pdabrowskipdabrowski Posts: 184
    Thorleifur wrote: »
    I don't know if I am experiencing faulty memory cards or something bad with the network part of the NI infrastructure, but this is the 4rd project in a short period of time that has been faulty. All due to network problems. I have had the same problem with an NI2000 and a heavily loaded NI3100.
    I doubt that the network upgrade would have been a contributing factor unless the new config has some overzealous port security. We are a 100% Cisco Catalyst site and have no real issues.

    You might want to look at the power supply into the NI masters, I have had entire sets of adjacent racks each with a NI with what appears to be a network issue fall over thanks to power brownouts.

    The short power issue would get the network interface into a state where it would be active on layer 1 but wouldn't negotiate a connection with the switch. The lights on both the NI interface and switch port would appear to be active.

    There is a bunch of firmware revisions over the last few years on both the v3 & v4 NI firmware that apparently resolves that type of issue which typically presented itself for us after short power interruptions or drops.

    We're running the most recent v4 NI firmware and haven't seen the issue appear again, I am not sure whether the most recent v3 version has solved this one (I went from v3 to v4 before it was resolved for us) so take a look at the times these things drop off and compare to any other device that may log power supply on site.
  • ThorleifurThorleifur Posts: 58
    I looked at the powersupply and even changed it out on the project I wrote about. This is not a power issue as the replacements are on the same supply as the broken ones. Some are also on UPSs. I also know this is not an issue with the network as the NX works. Same Cat cables, same ports used, so I would guess that leaves the NI.
  • John NagyJohn Nagy Posts: 1,734
    I had this same symptom arise today in my own home system. Panels dropping offline, then I couldn't get in via STUDIO either. Reboot and ok... but then I see one panel flapping on and off. At the fringe of wifi, making and breaking for the last day, it appears to have used up enough resources to kill new connection options on 1319, while telnet and ftp remained ok. (I replaced my access point yesterday, it appears to have less range than the outgoing one...)

    By the way about power supplies and testing by substitution of the NetLinx.... I've seen variation in how NI's react to low voltage, and some are fine far lower than others. Just saying to always measure (with a load, a bad supply can show 12 or more volts without a load, and drop to 6 with an ordinary load...)...
Sign In or Register to comment.