Connections and 3.00.316
DHawthorne
Posts: 4,584
in AMX Hardware
Sigh, what a nightmare week...
OK, on top of my memory leak issue, I am now having trouble with inter-master connections. One of my ancillary masters was a 260/64, so I swapped it with the older, main master whichs was just a 260. I was very careful to make sure I wiped URL lists in both masters (though only one had one). But now the other masters in the system are continuously dropping offline. They come right back up, but the dropouts can cause up to a minute or so of unresponsiveness in the panels, and that is unacceptable.
I have three buildings, two of which have subsystems, and the third controls all the outdoor stuff, but no subsystems. All the subsystems connect to both of the building masters, as does the third building master. Neither of the two central masters have anything i their URL lists. This setup worked fine until I swapped out one of the central masters for a 260/64. Now I'm getting the offlines...
Anyone have any ideas?
OK, on top of my memory leak issue, I am now having trouble with inter-master connections. One of my ancillary masters was a 260/64, so I swapped it with the older, main master whichs was just a 260. I was very careful to make sure I wiped URL lists in both masters (though only one had one). But now the other masters in the system are continuously dropping offline. They come right back up, but the dropouts can cause up to a minute or so of unresponsiveness in the panels, and that is unacceptable.
I have three buildings, two of which have subsystems, and the third controls all the outdoor stuff, but no subsystems. All the subsystems connect to both of the building masters, as does the third building master. Neither of the two central masters have anything i their URL lists. This setup worked fine until I swapped out one of the central masters for a 260/64. Now I'm getting the offlines...
Anyone have any ideas?
0
Comments
In addition, something very odd happened today - I was plugging away with some touch-up work and having no real issues; I moved to another part of the house to run a few tests, and found that all the panels in that area became unresponsive, as if they had lost thier connection. I had a Telnet window open, and it too had become sluggish to the point of near unresponsiveness. No amount of rebooting the master helped, until I logged into each of the remote masters and rebooted those. All was fine after that.
Fortunately, this kind of catastrophe is happening while I am on site and able to remedy it - but I would really like to know what the cause is so I can relax about it and not have to visit the house every day checking in on it.
The system ran for about two weeks after my last post until I got a call from the customer claiming sluggish response from his touchpanels. I was neck deep in another project and under some deadline pressure, so all I could do was Telnet into it and reboot the main master. Everything came back up and ran fine for another couple weeks, then I got another call that the entire system was locked up. I couldn't even Telnet into the main master. This time I was able to make a site visit, and found several of the local system masters had a lot of backed up message queues and IP related messages in the log, though nothing that reported on its own with a simple "msg on;" I had to do a "show log" to see any of it. I modified some feedback code that I had previously back-burnered to make it need less inter-master communications, reloaded, rebooted, and the system was up and running with no signs of any trouble. It did, however, take several reboots before everything came up properly.
This time, I added a heartbeat function to the main master. Once a minute, it messages another master in the system. That master resets a ten-minute timeline whenever it recieves the heartbeat message. If that timeline ever runs down the entire 10 minutes without hearing from the main master, it pulses a relay that will kill the power completely on the main master and force a cold reboot. The master that initiates the reboot will e-mail me whenever this happens, so at that point I can log into teh system and make sure everything comes back up properly.
I cannot begin to tell how much I hate being forced to install such a routine, but I'm at my wit's end. I simply cannot have this system dropping dead every two or three weeks, and I'm not getting anywhere with discovering the source of the problem. I am reasonably sure it is a communications problem, and a message or event queue is backlogging to the point of stalling the entire system. I cannot be certain there is no fault in some of my code, or if this is inherent in the product itself, and the complexity of the job brings it out. I strongly believe it is the latter, but not knowing for sure exactly what is happening, I wouldn't bet the farm on anything right now.
Dave,
I have had problems with a few NI3000/4000. The program was working in one old Netlinx with the same firmware as the new NI3000 but the new one kept droping the links to my Tandberg Codec which is using the telnet port.
The other problem was with Netlinx Studio kept droping the link to the problem NI3000 as well after about 5 Mins. So on the phone to AMX and they said send us the Tandberg codec and NI3000 so i did.
Few days later problem fix.
A new Ethernet board.
Cameron
All the masters but one, which is the least used and has not itself had issues, are NXI frames with ME260 processors (and one ME260/64). It's not an NI related issue.