Diagnosing a controller that won't respond to IP
NMarkRoberts
Posts: 455
I've got a problem with a controller that won't respond to IP so I had a chat with the local tech support and their advice was so thorough and usable that I thought I would record it for posterity.
It is possible for code to spin and lock up a controller in this way. (This I plan to test thoroughly as I can't see how, surely background processes are multithreaded.)
It is also possible for IP to lock up while the controller keeps going. This kills ping, telnet, browser and Modero touchpanel access. It may be caused by insufficient nonvolatile memory, so go look for a memory leak.
Can you connect via serial? If so do a "show mem" and a "msg on" to see what's happening.
Are you doing any IP stuff in the code? Is it possible you are using up all the available IP connections?
Sometimes some network switches get offended by UDP broadcasts, so disable it using "set udp bc rate" = 0.
Look for LED transmit activity on the front panel to tell you that the code is still running and look at the green Link Active LED on the network socket on the back to see if IP is working.
It is possible for code to spin and lock up a controller in this way. (This I plan to test thoroughly as I can't see how, surely background processes are multithreaded.)
It is also possible for IP to lock up while the controller keeps going. This kills ping, telnet, browser and Modero touchpanel access. It may be caused by insufficient nonvolatile memory, so go look for a memory leak.
Can you connect via serial? If so do a "show mem" and a "msg on" to see what's happening.
Are you doing any IP stuff in the code? Is it possible you are using up all the available IP connections?
Sometimes some network switches get offended by UDP broadcasts, so disable it using "set udp bc rate" = 0.
Look for LED transmit activity on the front panel to tell you that the code is still running and look at the green Link Active LED on the network socket on the back to see if IP is working.
0
Comments
set duet mem higher than the 2mb. set it to 10mb or so. do this even if you wont use any duet code.
when accessible via RS, try "show log all"
Sometimes, the nature of the fault in the code is running out of memory. Since the "drive" space is part of the master's memory, too many overlarge files on the drive can deplete it. So can too much recursion (though usually you will get a runtime message if that is happening).
I haven't had IP processes lock a maser, but I have had them bog it down considerably, to the point where actions lagged so much it may as well have been locked. But usually, it's runaway code causing the IP connections to bog, not the other way around.
Here's an update. A given NI4000 has spontaneously gone blind to IP twice. No ping, telnet, http or touchpanel connection. (The touchpanels are happily pinging.) Transmit LEDs show that the code is still running and sending status requests to various gear. The client tells me that the Ethernet Link Active light is flashing(!) The problem is corrected by a reboot.
The firmware is up to date, and it's non-Duet. After a few days of monitoring it does not appear to be leaking memory, but I've coded a memory check that will email me a warning if it does.
As this master is 3,600km from here with minimal tech support onsite, I've also arranged for a spare NI to be installed alongside with a serial cable to the program port. That will allow me to talk serial to it if it happens again.
I'm thinking about writing some code that detects when it loses all IP and reboots itself.
Any ideas?
I guess what may be unclear is whether the NI is actually falling over, or the network is suddenly not liking the NI any more. Quite different, same result.
We had issues around a Netcomm unit that stopped TPs connecting to the master every now and then. We replaced the router, and haven't seen the problem since.
Can you temporarily create a closed network with, say, a linksys router and see if you still lose control? Seems like it is fairly regular and could rule out the network as a cause quite quickly.
We had also lost that same master on a previous occasion, exhibiting the same symptoms, but it came back up by itself the next morning about 8 hours after it went offline.
In our case, we had a module that was flooding the system with messages and we were watching the messages remotely until it stopped communicating suddenly. Fixing the module's "chattiness" has prevented it from happening since.
The point is, we lost remote communication while locally it was still functioning fine. If not a memory leak, could it be an excess amount of feedback being generated which was what appears to have happened in our case?
Back in the old days of less reliable gear, we used to plug the master controllers (Cres****, and AMX or Panja, or whatever it was called at the time ) into a timer like for Christmas lights and have the power reboot on the system at 3:30 every morning as a "just in case" measure.
--John
http://www.digital-loggers.com/EPCR2.html
It's around $300.00 US has 8 controllable 120v outlet ports in 2 seperate power banks for 2 seperate circuits (2 cords). RS-232 isn't active yet but can be controlled via HTTP commands and has built in web server. There are many other products that can do auto ping reboots as well.
It's a simple way to be able to power cycle cable boxes from code as well.
What I did was create a "dead man" timer on my master. Every 15 minutes, it sent an "I'm OK" message to another master in the system. IF more than 15 minutes went by without that message getting through, the second master would fire a relay that reset the power on the first, then send me an email that this had occurred so I could check up on it. I've since cleared up the problem that was causing the lockups, but I left the mechanism in place. Once or twice in the last few months it's reset the system due to back logged events from network issues and brownouts; each time the system recovered gracefully, and when I checked up, everything was running fine again. The client has yet to notice one of these resets.
I've thought about ow to do this on a single master system, and the idea I've come up with is something like an Altronics 6060 timer relay set to hold closed as long as it's pulsed periodically. This, however, wouldn't help a system that lost IP comm, but would help one that was completely locked up. A variant on the 2 master technique might be for the master to open an IP connection to itself, and to fire a timer on it's offline event that is reset by an online event. Again, if it times out, a relay kicks in to reset the master's power.
This might be the solution to a single master reboot:
http://www.cpscom.com/gprod/apc.htm
The price is very good, and their products work as advertised
Brad
Unfortunately, this doesn't test IP connectivity -- a master with its LAN port physically unplugged will still be able to use IP to connect to itself, because it will be routed over loopback and not via the network. You need another device out there to test that. If you want to get fancy, open an IP connection to your console and use the "ping" command to verify connectivity to the router or something that will reply. If that fails, trigger a reboot. It just might work...
Jeremy
That in itself is the problem. As mentioned in this thread - there was a bug that caused the ethernet port to lock up and I got it to happen quite frequently on more than one job I was doing. I'm sure someone could explain all of the intricate reasons, but the bottom line is that there was a problem that was fixed by the latest firmware (and yes that's the latest DUET firmware - not fixed in older firmware).
There is no need to reboot a master is everything is up to date and set up properly. Set all of the thresholds and memory to max and make sure you have the latest firmware. I had a job recently that had 50+ separate rooms each with their own master and along with establishing master-to-master connections between all systems I also was opening up a telnet session to a single MediaMatrix Nion from each master and keeping it open. Before the latest firmware update the masters would start dropping offline about 15 mintues after a reboot and none would stay online longer than 30 mintues. The programs kept running internally, but the ethernet port died. Upgrading the firmware and maxing out thresholds and memory fixed the problem and not a single master has experience an ethernet port lockup in months.