Diagnosing a controller that won't respond to IP

NMarkRoberts · July 2007

I've got a problem with a controller that won't respond to IP so I had a chat with the local tech support and their advice was so thorough and usable that I thought I would record it for posterity.

It is possible for code to spin and lock up a controller in this way. (This I plan to test thoroughly as I can't see how, surely background processes are multithreaded.)

It is also possible for IP to lock up while the controller keeps going. This kills ping, telnet, browser and Modero touchpanel access. It may be caused by insufficient nonvolatile memory, so go look for a memory leak.

Can you connect via serial? If so do a "show mem" and a "msg on" to see what's happening.

Are you doing any IP stuff in the code? Is it possible you are using up all the available IP connections?

Sometimes some network switches get offended by UDP broadcasts, so disable it using "set udp bc rate" = 0.

Look for LED transmit activity on the front panel to tell you that the code is still running and look at the green Link Active LED on the network socket on the back to see if IP is working.

REBUILD_EVENT · July 2007

be sure to load the newest firmware to the master. there was an issue with IP with an older firmware earlier this year. the problems *should* be fixed

set duet mem higher than the 2mb. set it to 10mb or so. do this even if you wont use any duet code.

when accessible via RS, try "show log all"

DHawthorne · July 2007

It is absolutely possible for code to lock up a processor; I know because I've done it

. It may be multi-threaded, but the threads do not run concurrently, they are time-sliced, and there is no internal mechanism to prevent a thread from taking over (much like Windows). So, it faulty code in a thread runs amok, the slice never gets passed off to other processes, and you lock up.

Sometimes, the nature of the fault in the code is running out of memory. Since the "drive" space is part of the master's memory, too many overlarge files on the drive can deplete it. So can too much recursion (though usually you will get a runtime message if that is happening).

I haven't had IP processes lock a maser, but I have had them bog it down considerably, to the point where actions lagged so much it may as well have been locked. But usually, it's runaway code causing the IP connections to bog, not the other way around.

NMarkRoberts · July 2007

Still a puzzle

Here's an update. A given NI4000 has spontaneously gone blind to IP twice. No ping, telnet, http or touchpanel connection. (The touchpanels are happily pinging.) Transmit LEDs show that the code is still running and sending status requests to various gear. The client tells me that the Ethernet Link Active light is flashing(!) The problem is corrected by a reboot.

The firmware is up to date, and it's non-Duet. After a few days of monitoring it does not appear to be leaking memory, but I've coded a memory check that will email me a warning if it does.

As this master is 3,600km from here with minimal tech support onsite, I've also arranged for a spare NI to be installed alongside with a serial cable to the program port. That will allow me to talk serial to it if it happens again.

I'm thinking about writing some code that detects when it loses all IP and reboots itself.

Any ideas?

Darkside · July 2007

There are numerous routers/switches that can cause intermittent behavioral issues.

I guess what may be unclear is whether the NI is actually falling over, or the network is suddenly not liking the NI any more. Quite different, same result.

We had issues around a Netcomm unit that stopped TPs connecting to the master every now and then. We replaced the router, and haven't seen the problem since.

Can you temporarily create a closed network with, say, a linksys router and see if you still lose control? Seems like it is fairly regular and could rule out the network as a cause quite quickly.

John Gonzales · July 2007

We had a situation where we lost all communication with an entire system via WAN as well recently. The controller was an NI-4100, and was one of 6 masters in a multiple master environment. The other masters were fine, but we could not find the missing master connected in the URL list on the masters that were still functioning and still communicating via WAN. Telnet, HTTP and NS2 access via port 1324 were all down. In checking the router's DHCP client list remotely we could see the device was attached, active, and with the correct assigned IP address. I assumed the system was no longer functioning, but it was actually still controlling the local system. In other words the touchpanels that were assigned to that system (system 4) were communicating with the NI controller, and the controller was properly controlling devices that were attached via the serial, IR, I/O, and relay ports. We just had no access to it remotely. When we showed up to check it out the next day (fortunately we weren't 2200 miles away), we found everything functioning properly on site, and rebooting gave us remote access via WAN again.

We had also lost that same master on a previous occasion, exhibiting the same symptoms, but it came back up by itself the next morning about 8 hours after it went offline.

In our case, we had a module that was flooding the system with messages and we were watching the messages remotely until it stopped communicating suddenly. Fixing the module's "chattiness" has prevented it from happening since.

The point is, we lost remote communication while locally it was still functioning fine. If not a memory leak, could it be an excess amount of feedback being generated which was what appears to have happened in our case?

Back in the old days of less reliable gear, we used to plug the master controllers (Cres****, and AMX or Panja, or whatever it was called at the time

) into a timer like for Christmas lights and have the power reboot on the system at 3:30 every morning as a "just in case" measure.

--John

vining · July 2007

NMarkRoberts wrote:

I'm thinking about writing some code that detects when it loses all IP and reboots itself.

You could use some thing like the link below to periodically ping the master and if it doen't respond it will power cycle it.

http://www.digital-loggers.com/EPCR2.html

It's around $300.00 US has 8 controllable 120v outlet ports in 2 seperate power banks for 2 seperate circuits (2 cords). RS-232 isn't active yet but can be controlled via HTTP commands and has built in web server. There are many other products that can do auto ping reboots as well.

It's a simple way to be able to power cycle cable boxes from code as well.

DHawthorne · July 2007

I had a lot of trouble with a master locking up once upon a time due to a faulty IP-RS323 box that frequently shut down the entire system whenever it was negotiating a connection.

What I did was create a "dead man" timer on my master. Every 15 minutes, it sent an "I'm OK" message to another master in the system. IF more than 15 minutes went by without that message getting through, the second master would fire a relay that reset the power on the first, then send me an email that this had occurred so I could check up on it. I've since cleared up the problem that was causing the lockups, but I left the mechanism in place. Once or twice in the last few months it's reset the system due to back logged events from network issues and brownouts; each time the system recovered gracefully, and when I checked up, everything was running fine again. The client has yet to notice one of these resets.

I've thought about ow to do this on a single master system, and the idea I've come up with is something like an Altronics 6060 timer relay set to hold closed as long as it's pulsed periodically. This, however, wouldn't help a system that lost IP comm, but would help one that was completely locked up. A variant on the 2 master technique might be for the master to open an IP connection to itself, and to fire a timer on it's offline event that is reset by an online event. Again, if it times out, a relay kicks in to reset the master's power.

Brad.Odegard · July 2007

Re: REbooting Single Master when IP dead

This might be the solution to a single master reboot:

http://www.cpscom.com/gprod/apc.htm

The price is very good, and their products work as advertised

Brad

jweather · July 2007

DHawthorne wrote:

A variant on the 2 master technique might be for the master to open an IP connection to itself, and to fire a timer on it's offline event that is reset by an online event. Again, if it times out, a relay kicks in to reset the master's power.

Unfortunately, this doesn't test IP connectivity -- a master with its LAN port physically unplugged will still be able to use IP to connect to itself, because it will be routed over loopback and not via the network. You need another device out there to test that. If you want to get fancy, open an IP connection to your console and use the "ping" command to verify connectivity to the router or something that will reply. If that fails, trigger a reboot. It just might work...

Jeremy

patb · July 2007

NMarkRoberts wrote:

The firmware is up to date, and it's non-Duet. After a few days of monitoring it does not appear to be leaking memory, but I've coded a memory check that will email me a warning if it does.

That in itself is the problem. As mentioned in this thread - there was a bug that caused the ethernet port to lock up and I got it to happen quite frequently on more than one job I was doing. I'm sure someone could explain all of the intricate reasons, but the bottom line is that there was a problem that was fixed by the latest firmware (and yes that's the latest DUET firmware - not fixed in older firmware).

There is no need to reboot a master is everything is up to date and set up properly. Set all of the thresholds and memory to max and make sure you have the latest firmware. I had a job recently that had 50+ separate rooms each with their own master and along with establishing master-to-master connections between all systems I also was opening up a telnet session to a single MediaMatrix Nion from each master and keeping it open. Before the latest firmware update the masters would start dropping offline about 15 mintues after a reboot and none would stay online longer than 30 mintues. The programs kept running internally, but the ethernet port died. Upgrading the firmware and maxing out thresholds and memory fixed the problem and not a single master has experience an ethernet port lockup in months.

Diagnosing a controller that won't respond to IP

Comments