Connections and 3.00.316

DHawthorne · April 2005

Sigh, what a nightmare week...

OK, on top of my memory leak issue, I am now having trouble with inter-master connections. One of my ancillary masters was a 260/64, so I swapped it with the older, main master whichs was just a 260. I was very careful to make sure I wiped URL lists in both masters (though only one had one). But now the other masters in the system are continuously dropping offline. They come right back up, but the dropouts can cause up to a minute or so of unresponsiveness in the panels, and that is unacceptable.

I have three buildings, two of which have subsystems, and the third controls all the outdoor stuff, but no subsystems. All the subsystems connect to both of the building masters, as does the third building master. Neither of the two central masters have anything i their URL lists. This setup worked fine until I swapped out one of the central masters for a 260/64. Now I'm getting the offlines...

Anyone have any ideas?

DHawthorne · April 2005

Nothing like a fresh outlook on things...new day, and it occured to me that I swapped masters around, and though you type in URL lists by IP number, the internal routing tables almost certainly store MAC addresses. It's possible some of the old routing tables still referenced a moved MAC and I introduced a circular route by moving things around. So I went in and wiped out all the URL lists. I re-enetered them differently though - since I really only needed communications guaranteed between the central master and the remotes, I just put the IP for each of the remotes in the one mater and left the rest blank. It's rock solid again, and I breathe a sigh of relief...

DHawthorne · April 2005

I'm not out of the woods yet with this thing. There is still a relatively small memory leak that is related to the previous post. All I did by repairing the URL lists is rope in the rampant disconnect/reconnect issues. I still get some dropoffs, which I expect to see on an IP based connection, but new memory is allocated whenever that happens, and the memory from the lost connection does not seem to be released under all circumstances. So I lose a few K every other day or so, instead of every other minute or so. But I'm still losing memory.

In addition, something very odd happened today - I was plugging away with some touch-up work and having no real issues; I moved to another part of the house to run a few tests, and found that all the panels in that area became unresponsive, as if they had lost thier connection. I had a Telnet window open, and it too had become sluggish to the point of near unresponsiveness. No amount of rebooting the master helped, until I logged into each of the remote masters and rebooted those. All was fine after that.

Fortunately, this kind of catastrophe is happening while I am on site and able to remedy it - but I would really like to know what the cause is so I can relax about it and not have to visit the house every day checking in on it.

DHawthorne · June 2005

I keep updating this thread not to bump it per se, but in hopes that eventually new data I dig up and report will trigger some new insight on someone's part and I can get to the bottom of it.

The system ran for about two weeks after my last post until I got a call from the customer claiming sluggish response from his touchpanels. I was neck deep in another project and under some deadline pressure, so all I could do was Telnet into it and reboot the main master. Everything came back up and ran fine for another couple weeks, then I got another call that the entire system was locked up. I couldn't even Telnet into the main master. This time I was able to make a site visit, and found several of the local system masters had a lot of backed up message queues and IP related messages in the log, though nothing that reported on its own with a simple "msg on;" I had to do a "show log" to see any of it. I modified some feedback code that I had previously back-burnered to make it need less inter-master communications, reloaded, rebooted, and the system was up and running with no signs of any trouble. It did, however, take several reboots before everything came up properly.

This time, I added a heartbeat function to the main master. Once a minute, it messages another master in the system. That master resets a ten-minute timeline whenever it recieves the heartbeat message. If that timeline ever runs down the entire 10 minutes without hearing from the main master, it pulses a relay that will kill the power completely on the main master and force a cold reboot. The master that initiates the reboot will e-mail me whenever this happens, so at that point I can log into teh system and make sure everything comes back up properly.

I cannot begin to tell how much I hate being forced to install such a routine, but I'm at my wit's end. I simply cannot have this system dropping dead every two or three weeks, and I'm not getting anywhere with discovering the source of the problem. I am reasonably sure it is a communications problem, and a message or event queue is backlogging to the point of stalling the entire system. I cannot be certain there is no fault in some of my code, or if this is inherent in the product itself, and the complexity of the job brings it out. I strongly believe it is the latter, but not knowing for sure exactly what is happening, I wouldn't bet the farm on anything right now.

maxifox · June 2005

DHawthorne, may I suggest to run a network sniffer for a while? Perhaps it will reveal something...

Cameron D · June 2005

Ethernet Problems.

Dave,
I have had problems with a few NI3000/4000. The program was working in one old Netlinx with the same firmware as the new NI3000 but the new one kept droping the links to my Tandberg Codec which is using the telnet port.
The other problem was with Netlinx Studio kept droping the link to the problem NI3000 as well after about 5 Mins. So on the phone to AMX and they said send us the Tandberg codec and NI3000 so i did.

Few days later problem fix.

A new Ethernet board.

Cameron

DHawthorne · June 2005

I have run a packet sniffer, but this is an extremely busy network, and the problem only crops up once every few weeks. It's too much data to sift through. The customer is mirroring his corporate server from his house, and the network architecture was originally not intended for that; unfortunately, I don't have any way of isolating the NetLinx equipment from the rest at this point without a lot of re-wiring and hardware the customer is not willing to pay for.

All the masters but one, which is the least used and has not itself had issues, are NXI frames with ME260 processors (and one ME260/64). It's not an NI related issue.

Connections and 3.00.316

Comments