AMX controller locked up every couple weeks

yanbin · February 2011

Hi everyone, we having a locked up issue for an AMX system, every couple weeks the controller locked up, needs reboot to make it work again. it had happened for last 6 months. At the beggining we thought it was a controller issue, so we changed controller(Ni3100), but iusse happened again. in this system, autopatch, iport dock, lutron lighting using RS232 contol, rest are IRs, all the AMX devices(Panels, access points, controller) connecting to a D-Link 8 ports switcher then connect to the house 3Com 48 ports ethernet switcher, did anyone have this kind of problem?

Thanks,

yanbin

ericmedley · February 2011

I have had some issues in the past with NI-2000s/3000s/4000s locking up but not in the way that one might think. What would happen is that the master is not locked up per se. The internal functions work just fine. (the rs232s,IR, IOs) But, the network card was dead. So, as far as the outside world is concerned, it's dead. No modero touch panels could connect.

This seemed to happen with the cheaper routers/switched. (particularly the old blue Linksys WG54-whatevers) The only fix was a reboot. I still have a couple systems out there that the owner refuses to upgrade to some kind of a pro level router/switch. My fix there is to program in a nightly reboot. That has fixed the issue.

The real way to see if it's genuinely locked up is the status light is frozen and not blinking.

If this is going on, there are a multitude of possible causes. Bad or corrupt firmware files, problem with the logic board, power issues. The list can be very long.

Jorde_V · February 2011

yanbin wrote: »

Hi everyone, we having a locked up issue for an AMX system, every couple weeks the controller locked up, needs reboot to make it work again. it had happened for last 6 months. At the beggining we thought it was a controller issue, so we changed controller(Ni3100), but iusse happened again. in this system, autopatch, iport dock, lutron lighting using RS232 contol, rest are IRs, all the AMX devices(Panels, access points, controller) connecting to a D-Link 8 ports switcher then connect to the house 3Com 48 ports ethernet switcher, did anyone have this kind of problem?

Thanks,

yanbin

Yes we have 2 systems rolled out just like this. There's 2 projectors that return 'empty' string on occasion "$FF,$FF,$FF,$FF,$FF" that locks up the message queue eventually. Or at least that's what I think. I haven't written the program myself so I'm not entirely positive, but AMX has verified that it shouldn't be the program causing it.

In the other system they do think it's the programming that's causing part of the malfunction, so we're rewriting that. At least they admit that it shouldn't be causing lock-ups though.

Edit:

Forgot to tell you it's the device controller locking up (or rather stuck in a loop) and it can't process the messages it receives.

John Nagy · February 2011

Echoing Eric's main point, probably 90% of "lockups" we see are caused by network issues. The NetLinx isn't dead, just not talking IP anymore. IR and G3 (AXLINK) devices keep working fine. No G4 panels, because they are IP.

The NetLinx IP services aren't the most robust; they've been improving in every firmware for years, but flaky networks can bring them down. I think it is not so much that the "network card" is dead, rather it seems the OS has given up talking through it.

Worst problems are those where an IP device drops offline and comes back quickly. Each time it does, it uses a new IP resource in the NetLinx, and if it happens faster than the NetLinx can clean up the dead ones, soon you have no sockets left to connect to, at which point the cleanup seems to get lost too. First you lose FTP, then ip control, then panels and finally TELNET won't connect either. At which point a hard power reboot is the only "cure". BUT if you don't relieve the actual network cause, it's not a cure, only a treatment. It will happen again.

We also see cheap routers as a recurring issue; the customer often thinks they know IP and the hardware they put in before us should be just fine, thank you, don't go stealing my money for things I know I don't need. About the second time we roll a truck on a T&M to reboot because their $30 router lost track of DHCP or DNS long enough to mess up the Netlinx, the $200 router we recommended starts looking like a bargain.

By the way, we don't use DHCP on the NetLinx, or panels, or anything. Use STATIC, we put them in high numbers above the DHCP range of the router, and put in hard DNS on everything. And the GOOGLE public DNS at 8.8.8.8 and secondary at 8.8.4.4 work like lightning. I generally put the router as DNS 1 and GOOGLE as 2 and 3 in the DNS lists.

yanbin · February 2011

Thanks for all the replies. Sinced we change the AMX controller and the problem stays, so our next step will be putting in a router and ethernet switcher to isolate the AMX network with the house network, what brand of router and switcher should I get? There are only 5 IP devices I need to connect to this network, like Access points, controller and a video camera server.
BTW, all the IP devices are using Static IP address.

ericmedley · February 2011

yanbin wrote: »

Thanks for all the replies. Sinced we change the AMX controller and the problem stays, so our next step will be putting in a router and ethernet switcher to isolate the AMX network with the house network, what brand of router and switcher should I get? There are only 5 IP devices I need to connect to this network, like Access points, controller and a video camera server.
BTW, all the IP devices are using Static IP address.

We use Cicso, Sonic Wall, Ruckus, there is a business class - Linksys that's actually pretty okay.

John Nagy · February 2011

A number of our dealers use Sonic Wall and PACKEDGE with good result.

Netgear Pro line, the metal blue box ones, are a good value and pretty stable - the lowest grade I'd go for a customer. They are lifetime guaranteed, and we've had good luck with them too.

Consumer grade stuff works just fine, if you don't mind rebooting them a few times a year. Which is fine if it's your own house and you know that. Let's see, if each switch and router only locks up ONCE a year, a system will need a service call, what, six or so times a year? Just to power cycle a $20 box.

For a customer whose system just quit, they only know it quit and dammit they paid a lot for it, so get out here now and fix it! What's your time worth? Don't do it. Get better hardware.

ericmedley · February 2011

One other thing to consider: failure/lockup due to massive run-time errors.

bcirrisi · February 2011

I've had issues with DLink recently, I'm in using Sonicwall, HP ProCurve, Cisco, with solid results. I can't for 100% say that D-Link has been the problem, but jobs with DLink on them are the only ones that give me issues... once every few weeks. I echo the comments on using on business class products (I only use the HP ProCurve switches, because of the fanless design) and i also use put AMX on its on subnet and let the Sonicwall route the data.

chill · February 2011

I can't really tell from the OP's message, but this might be a case of the NI device locking up. On a current project we had 17 NI-x100 controllers that were all randomly failing every few days, or weeks. An AMX engineer and I went to site Wednesday and Thursday (it's Thursday night as I type this) and upgraded everything to master firmware 3.60.447 and device firmware 1.30.4. He says that this firmware was developed for another site that had the same issue a couple of months ago, and that AMX Engineering is "very optimistic" that this will fix the problem. These are not official firmware releases, and are only available at the ftp site.

Note, the device firmware is what (supposedly) fixes this problem; but 1.30.4 device firmware REQUIRES 3.60.447 master firmware. So upgrade the master first, then the device. Couple of caveats:

- First, turn off as much ICSP bus traffic as possible. You can send_command foo,"'RXOFF'" to every port in the system, or I guess you could load an empty program to accomplish the same thing.
- Even having done that, it is still possible to brick your NI. I did it once three weeks ago, and again on Wednesday. In addition, the AMX engineer and I each had systems we thought we'd bricked, but thankfully those two were OK. The others? PO and overnight to the hotel; not pretty.

Godspeed!
.

Hedberg · February 2011

We've been having some "interesting" network issues with a couple sites/groups of sites. I think one of these may be related to the inexpensive dumb Linksys switches - maybe this problem is sort of as described in this thread.

We've got several installations with the same customer and they all have NI3100 masters with two wired Modero panels each. From time to time the master will lose connection with the touch panels. It always seems to happen when the panels go offline and come back on line a couple times (as described above). Today it happened after loading TP files -- the master appeared to be working fine but after loading the TP files to the TPs they didn't come back on line. Netlinx studio seemed to maintain connection fine. Rebooted the master (soft) through Studio and all came back fine.

These are the only installations that we have had that experience with and they all use the inexpensive Linksys 5-port switch (no router on the network). I'm thinking it might be something with the switch, but maybe I should update the firmware on the master and the NI device. These are the only installs we have with the Linksys switches -- provided by the customer. I'm thinking of programming the master to reboot itself if the TPs both go offline and stay offline. Interesting, it never happens except when I'm on site loading TP files or programming or something. Maybe I need to have my chakra re-aligned.

I'm suspicious of the Linksys switches and am thinking about trying to get them replaced with blue-metal-box Netgear switches. We've got about 100 of the blue-metal-box Netgear switches and WAPs spread around Houston and have very few issues with them.

But, the blue box Netgear stuff is involved with the other problem we've been having and I'm going to start another thread about that as I think it's a power stability issue separate from the topic of this thread. Wouldn't want the thread Nazi to accuse me of hijacking.

John Nagy · February 2011

You are THINKING it might be the switch and this has gone on how long? You could sub in a "known good" switch in about 60 seconds... and you'd soon have a good indication if it was the switch. Just do it, there is no test better than substitution.

Jorde_V · February 2011

I've only had it the other way around, IP still working fine but the device controller gets stuck in a loop. here's what I've seen in these systems.

(0354894799) CI2CMessageQueue Failed to send message, retry: 0,  errno: 3d0004
(0354897798) CI2CMessageQueue Failed to send message, retry: 1,  errno: 3d0004
(0354898068) Optimizing tCypherConnRx messages with 500 messages in queue.
(0354898070) Optimization tCypherConnRx complete with 500 messages now in queue. 0 messages avoided.
(0354900067) CMessagePipe::Write tCypherConnRx (PipeReader) this=0x01C08BF8 has 500 Messages in queue (Deleting & Ignoring1).
(0354900797) CI2CMessageQueue Failed to send message, retry: 2,  errno: 3d0004
(0354900797) CMessagePipe::Read tCypherConnRx (PipeReader) is now Reading again
(0354903620) CIpSocketMan::ProcessPLPacket - ClientOpen handle already in use
(0354903620) CIpEvent::OnError 0:3:2
(0354903796) CI2CMessageQueue Failed to send message, retry: 0,  errno: 3d0004
(0354906795) CI2CMessageQueue Failed to send message, retry: 1,  errno: 3d0004
(0354909795) CI2CMessageQueue Failed to send message, retry: 2,  errno: 3d0004
(0354912794) CI2CMessageQueue Failed to send message, retry: 0,  errno: 3d0004
(0354882801) ci2cmessagequeue faile'
(0354894799) ci2cmessagequeue failed to send message, retry: 0,  errno: 3d0004'
(0354897798) ci2cmessagequeue failed to send message, retry: 1,  errno: 3d0004'
(0354898068) optimizing tcypherconnrx messages with 500 messages in queue.'
(0354898070) optimization tcypherconnrx complete with 500 messages now in queue.  0 messages avoided.'
(0354900797) ci2cmessagequeue failed to send message, retry: 2,  errno: 3d0004'
(0354900797) cmessagepipe::read tcypherconnrx (pipereader) is now reading again'
(0354903620) cipsocketman::processplpacket - clientopen handle already in use'
(0354903796) ci2cmessagequeue failed to send message, retry: 0,  errno: 3d0004'
(0354903796) ci2cmessagequeue failed to send message, retry: 0,  errno: 3d0004'
(0354915793) CI2CMessageQueue Failed to send message, retry: 1,  errno: 3d0004
(0354918793) CI2CMessageQueue Failed to send message, retry: 2,  errno: 3d0004
(0354921792) CI2CMessageQueue Failed to send message, retry: 0,  errno: 3d0004
(0354924791) CI2CMessageQueue Failed to send message, retry: 1,  errno: 3d0004
Line 87 (14:10:58)::CI2CMessageQueue Failed to send message, retry: 1, errno: 3d0004
Line 88 (14:11:00)::Optimizing tCypherConnRx messages with 500 messages in queue.
Line 89 (14:11:00)::Optimization tCypherConnRx complete with 500 messages now in queue.0 messages avoided.
Line 90 (14:11:00)::CMessagePipe::Write tCypherConnRx (PipeReader) this=0x01C08BF8 has 500 Messages in queue (Deleting & Ignoring1)
Line 91 (14:11:01)::CI2CMessageQueue Failed to send message, retry: 2,  errno: 3d0004
Line 92 (14:11:01)::CMessagePipe::Read tCypherConnRx (PipeReader) is now Reading again

So in this case the device controller is stuck in a loop and still sending it's polling strings, but it cannot process anything that gets returned. So it's stuck in a loop. (at least that's what it looks like if you check the diagnostics and the notifications)

I'm not the only one having this issue, several of my colleagues whom I'm in contact with have the exact same issue. And this is happening on years old systems, I have yet to see it in one of the systems I have personally programmed, but there's no reason for the code that was working years ago to suddenly stop due to new firmware.

edit:

I've just been given the new firmware version chill was talking about, according to the release notes that came with it, it should fix this issue. It also describes why I haven't been having this issue as I don't tend to poll a lot. So the activity on systems I programmed is far less than what's necesarry to cause these lock-ups.

I will put this firmware on the system that's been having this issue and let you know if this fixes it.

AMX controller locked up every couple weeks

Comments