Netlinx Locks up
Thomas Hayes
Posts: 1,164
in AMX Hardware
Hi everyone. I am having an unusal problem with a NXI with a ME260 that locks up. All the LEDs on the unit stop working and I lose all comm. A hard reboot corrects the problem from serveral hours to several days. I have replace the PS. I'm now looking at replacing the whole unit. Any one else seen this type of problem? I am wondering if a power flux could be the problem.
0
Comments
When you say LEDs on the unit stop working, do you mean *ALL* LEDs (both on the ME260 and NXI), the LEDs on the NXI, or the LEDs on the ME260?
I've seen lockups occur, typically due to code issues. It got to a point where I could reproduce at will, but the code paths involved were too complex to easily isolate. I needed to write a "black box flight recorder" (implemented in syslog.axi, which is available on SourceForge in my NetLinx project) in order to track down what was occuring, and why. When this problem had occurred, information sent via SEND_STRING 0 wasn't transmitted (that stuff would have been sent, but the O/S got locked up first).
In this case, the ME260 itself was alive, but nothing else was.
Could this be a code issue? Have you tried to isolate?
You could try Tech Support; perhaps they've seen or heard of this.
Things to check: buffer overruns, non-volitile memory use (which is to say, memory leak in your code using non-volitile memory). Some specific circumstances that have caused me trouble were re-entrant function calls that were processed too rapidly, and message queues to devices that have dropped temporarily off line and backed up. I suggest telneting into your master and watching it for "xxx message pending" statements, which are a clue something is backing up, especially if it gets to the "interpreter suspended" point. You can go back a ways using the "show log /all" command as well. Once the problem is critical, you may not be able to telnet in at all, or it continually resets the telnet connection; you have to catch these things before they bring you down. If you get no clues via telnet, you can also turn on device notifications in NetLinx Studio and watch for unusual traffic.
I found a problem once in one of the few jobs we farmed to subcontractors that used one of the standard Marazantz RS-232 modules: the coder wasn't using the power state feedback on his panel, so he apparently set both the on and off power state channels to 255. The Marantz module was testing the states, then using a send command to set each of the feedback channels to match what the module was reporting the power condition as. Since both on and off had the same channel, it was toggling this single channel hundreds of time a second, all using send commands. The system would run fine for a few days, then lock up solid. I found it by doing what I mentioned above: telnet showed me I was getting backups in the message queue; and the device notifications showed the send commands to that channel going out every second in massive quantities. All we needed to do was comment out the feedback section we weren't using, and all was well. The customer even had better home network performance when it was fixed, because this was an MVP-8400 and the extra traffic was bogging down his wireless network.
I've had similar issues when buffering RS-232 commands with too small a buffer. When the buffer got filled, I had unpredicable results. I have gotten in the habit since of testing the length of my buffers so that if something unforseen backs them up, I can clear the buffer before it overruns. It's best never to need to do this of course, but lost data is better than a lockup, especially when the circumstances are beyond your control due to a flaky RS-232 implementation from a manufacturer (which is disconcertingly common).
One other thing to look out for is WAIT_UNTIL statements, especially in communication with other devices. Use the timed version if you can, or make sure there is never possible a circumstance that they will simply wait forever.
I suspect this does not apply in your case but we had a situation in a hotel application where some of our NI-2000 controllers used for local room control would lock up as you describe and go offline. As it turns out, there was a contact closure on the NI-2000 wired to a PCS for the television in the room. The PCS was not tuned properly and was changing open/close states so rapidly that the processor simply went offline and was not accessible from Studio nor the local Modero touchpanel. When we have a lockup situation or a processor that appears to be offline, this is one of the first things we check. As was true in your case, the programming was fine and nothing had changed in a long period of time but suddenly the system was not performing as expected.
Now, if only all devices provided discrete on/off commands, then PCS could go away.
I ran into the same thing last week with a system. A misadjusted PCS was sending so many Button_Events that the system would completely lock. This would occur randomly anywhere from a few hours to a few weeks. A hard reboot was the only fix.
We finally left a notebook at the job site with Studio running and both notifications and diagnostics turned on. The data logs told the story.
i am on site right now with an NI-2000 which is doing the same thing.. locking up.., leds going out.. no DMS or Ethernet comms.
there are two DMSs, one 15" wired panel, one tuner on IR, one DVD on IR, a Suite 16 and serial to control lighting.
this system mostly worked in the office, although i could not use the SureCom 16 port switcher, i had to use a simple dumb 8 port hub otherwise things could lockup.
now on site, it will run for several hours, then stop. i've checked my logs to see that devices have dropped offline then back online during this stage, the lighting on serial port 1 coming online repeatedely (at least 10 times in a few seconds)
i have an 'offline' indicator on my 15" panels (just a button saying OFFLINE, it get's switched ON (transparent) by the netlinx controller when it goes online)
this offline indicator is lighting up, but sometimes i can still get control to occur, but only for a few minutes.
being the programmer of this equipment, i find myself having to try to defend my code, and yet all the while i do believe that the netlinx products are unstable. not one of the last six installations i've done have been trouble free. panels, controllers, IRS (ir extender) cards; they all seem be flaky.
i am now going to check all the power supplies, all the data cables and who knows what else. i dont believe it's in my program, i mean i am only writing for a house, i thought this netlinx stuff could deal with a campus.
it's all very tiring, i just wish i could have one installation where everything worked.
Don't mean to be rude but probably is in the code. Have you taken a look at the system through Netlinx Device Notifications? My experience with a system locking up is that the cause more often than not will make a not-too-subtle appearance in the Diagnostics window. My last FUBR was trying to continually to drive channel number 1025 high on 12 Modero touchpanels (don't ask). A previous FUBR was driving various channels on device 0:0:0 (again, don't ask). Both of these locked up the system, and once identified in Diagnostics easily fixed.
Ian
Start by disconnecting every component from the Master including the Ethernet connection.
Use a freeware IP scanner such as (http://www.angryziber.com/ipscan/) to check the network for duplicate IP's. Hopefully you are in charge of the network router on your sub-net. You can always build your own sub-net just for the NetLinx components.
Start connecting the devices one at a time and check for system stability. The offending component(s) will eventually show itself.
There are a couple of things that are sure fire NetLinx lockup problems. One is a filled up Event queue. Too many events to process. Two is a Ethernet network issue. with IP conflicts. I have also seen problems with certain network switches.
I can't help you with each component, but at least this is a start.
i've just changed power supplies (the specs for the NI-2000 say a 6.5 amp power supply is needed, but it's rated at less than 1 amp) and the 15" panel and the controller. it's a point of concern at the moment - power that is.
the network ips are fine thanks, i am managing the network as well.
curious to see others have had problems with network switches.
as for the code, i've written one application that runs on various controllers (NI-700 and NXI-3000) and other panel configurations. pretty much the only thing that changes is the configuration files to describe a site, the core code is the same.
so, it's running mostly fine on other sites, not this one. but once again, i have to defend my code.
i see a mention of looking for to many events back logging, i will keep an eye out for that. but again, i thought these cpus can deal with a campus, i am just trying to control a simple home. there is not that much going on.
at this stage, i am still thinking power supplies. i've now removed the ICSnet hub, with only two DMSs attached directly to the controller, and one 15" panel on it's own supply.
i cant disconnect anything else, other wise there would be nothing left to control the system with, and the customer wont like that
willl keep posting updates as i work thru this.
thanks again to the repliers.
Also, don't put feedback that uses SEND_COMMAND's, _STRING's, in the DEFINE_PROGRAM section. Simple channel and level updates are fine, the processor actually keeps track of those and won't send updates if they don't need them. But commands and strings generate messages that need to be sent out and received by the device, and they use up a lot of processor ticks if you are sending a hundred of them per second. Put these kind of updates in a TIMELINE and do them once a second or so.That's fine for most applications.
One thing you can do to lower the load on the processor is to turn the broadcast off. Use terminal or telnet and type in "set udp bc rate" and follow the directions. This is only good if you DO NOT have any IP devices (MVP-8500 etc) hooked up. The orginal problem that I had was traced to a power issuse that was being induced into the line by a contract doing work in the room next door. Since he has finished(touch wood) the system has operated fine. Best of luck.
>>
i hear you on the notifications.
i do have a record of some weird events... here are the events...
Line 1 :: Feedback:On [0:0:1] - Channel 24 - 09:44:40
Line 2 :: Output Channel:On - From [0:0:1] - Channel 24 - 09:44:40
Line 3 :: Feedback:On [0:0:1] - Channel 44 - 09:44:40
Line 4 :: Output Channel:On - From [0:0:1] - Channel 44 - 09:44:40
Line 5 :: Feedback:On [0:0:1] - Channel 158 - 09:44:40
repeated about 40 times over two minutes.
i really am not sure why/where these events occur.
(NOTE: this is the system i changed power supply on a few days ago... at the time of writing it is still running.. now equal to best run so far)
the events have not occurred again for a few days now.
fingers crossed.
>>
as for using define_program, about the only code i use here is for some volume checking/controlling... and they have a counter so they only act on every 8th cycle, and only when the user has requested a volume change.
so yes, most of my SEND_x commands are event based only, not constantly trying to run in a loop.
NOTE: sometimes these system problems occur when no real-world events are taking place.
>>
as for the switch/hub problem, Mal and I were investigating an IR problem with the IRS (extender box). To elimate possible areas of concern, we swapped out the NI-3000 for an NI-2000.
within a moment or two, the 2000 would no longer communicate over the switch. put an NI-700 in place and it worked. put the NI-3000 back and it worked.
this site installation has two wired CV7s and a wireless 7500, which all work with the switch.
so it left the NI-2000 as the only device that wouldn't work with the SureCom switcher.
>>
all installations so far have had an IP panel involved, so i cant turn broadcast off.
but i have to still wonder about the quality of a product that has IP support (that a customer has paid for) and the suggestion is to turn it off. i mean no offense at the suggestion, i just wonder
>>
i have come in to work this monday morning to hear of a system that crashed over the weeked, but this is an NI-3000.
i do think it's a temperature problem though. the system has been running for six weeks until we had two VERY hot days in a row. i seem to recall last time it crashed the days were also very hot.
>>
in summary, at this stage, environmental factors seem to be impacting on stability.
i think high temperatures in cupboards and power supplies that are unstable/under-rated could be the cause, although the switcher issue is an ongoing problem.
one slight other issue, modules that get installed. i dont really have control of module quality, so some things remain in question.
in some modules i have been using, i see several notifications during boot that are warnings/errors. but i cant do a thing about it.
so, have i gone far off topic ? i think all this is related, just finding where/why is the difficult part.
now, if i could get on with programming instead of troubleshooting, i'd be a lot happier.
Line 1 :: Feedback:On [0:0:1] - Channel 24 - 09:44:40
Line 2 :: Output Channel:On - From [0:0:1] - Channel 24 - 09:44:40
Line 3 :: Feedback:On [0:0:1] - Channel 44 - 09:44:40
Line 4 :: Output Channel:On - From [0:0:1] - Channel 44 - 09:44:40
Line 5 :: Feedback:On [0:0:1] - Channel 158 - 09:44:40
repeated about 40 times over two minutes.
In your code you are generating commands to device 0:0: 1 to turn on those channel numbers - as mentioned previously I have seen this before and it will crash the system.
Ian
the curious thing is, i am not generating these events.
these events haven't turned up again, they just appeared out the blue.
it was just luck that i happened to have notify logging on at the time.
i just looked thru my code. i have one reference to the number 158.. it's a label button on a web browser panel.
here is the code ...
WebServicePanel = 201:1:0
volatile integer spSunRiseDisplay = 158
send_command WebServicePanel,"'TEXT',
itoa(spSunRiseDisplay),'-',LEFT_STRING(SunRise,5)"
the number 158 is referenced in other modules... but i am trying to not get involved in other modules.. till i can prove everything else is ok.
at this time, i think a power supply problem caused erroneous data to be transmitted/received and that's how those events were created.
i will report in a few days how these latest power supplies are behaving.. been four days now
My solution for when I have had to use one is to create a virtual device, and send all of my commands to the virtual. The only reference to the actual web panel device is a DATA_EVENT that combines it with the virtual on an ONLINE event, and uncombines it with an OFFLINE event. The virtual will never go off line, and any messages to and from it will never be interupted. Since the web device itself is only combined when the master knows it's online, it doesn't bog anything down when momentary network glitches happen.
There had been some discussion of the Axlink bus and Com 1 sharing some processing duties in the NIxxxx. What are you doing with Com 1?
Since you have some Axlink devices, try to free up Com 1 and see what happens.
(******************************)
I have never seen any channel events on device 0, the Master. This does not seem correct. I hope you are not trying to use the Master port in you programming other than a Send_String 0,.
(******************************)
As DHawthorne correctly states, your must you must use a virtual device for combining with the web control panel.
Define_Device
dvWebPanel = 201:1:0 // Actual
vdvWebPanel = 33201:1:0 // Virtual
(Right after the Define_Device section)
Define_Combine (vdvWebPanel,dvWebPanel)
(******************************)
Defective NIxxx do turn up occasionally, so don't rule that out.
(******************************)
The heat issue is an interesting one. Does anyone have the operating specifications?
Another suggestion:
You mentioned that many modules are using that particular channel? Perhaps by you using the channel yourself in code you are causing other errors in the module. I'd reccomend trying a different channel #. Or better yet just remove the modules entirely and see what kind of results you get.
My .02
that's a good idea for systems hardware checking. the problem is that the NI3000 installation has been in use and did run for 6 weeks before crashing. i cant leave the box with no program in it for 6 weeks hoping it may crash and the customer has no system. as for an intermittent cable problem, i still have to question how it lasted for 6 weeks.
i hear what your saying about it may be code related. As i mentioned in an earlier post, i am tired of having to defend my program (but still open to suggestions, like the Virtual device for the web panel - see below).
Netlinx is a high level programming language, in a way similar to Basic. It should not let me do harmful things to the system. I am only using the tools that have been given to me, it's not like i am hacking down to machine code level and trapping interupts and making my own stacks blah blah. If what i write causes loops or routines that dont exit, then yes, that's on my head, but any errors like that show up straight away, not days or weeks down the track.
One of the key modules i use is for the lighting control system on Com1 (sorry Clements). The other modules are for the iMerge, the Suite 16 and DMS.
All these modules are essential to providing the customer with a product. Again, i cant run without these for days or weeks.
>>
Clements, yep, my only send to master is for logging events to the console.
>>
Hawthorne, thanks for the virtual suggestion. i am implementing that now.
a point though, this part of the system is not in use by either client. they dont even know it exists. the webpanel is for admin purposes, usually only used during install and only occassionally used after that by the support staff. so it should not have impacted at this stage.
the other part to this is... i keep a flag for when the panel comes online/goes offline, so none of the actual sending will occur if the device is not reported as online. in fact, if will only be sent to when a button on itself is pressed, so it has to be online before anything can happen.
>>
this is the only note i've found regarding temperatures....
from here,
http://www.amx.com/techsupport/PDNTechNote.asp?id=455
it says...
__________________________________________________
Our units are temperature rated, depending on whether they have a display or not.
For products that have a display, the temperature range is 0 - 40 degrees Celsius (32-104 F).
For products that do not have a display, the temperature range is 0 - 50 degrees Celsius (32-123 F).
The MTBF (Mean Time Between Failure) for all AMX products is 35,000 hours.
__________________________________________________
now, i had a thermometer in the cupboard of the NI3000 and found the temperature to be between 36 and 40c. there is a far amount of equipment in a crowded space. three HDTV set top boxes also put out a lot of heat.
>>
generally, the controllers have been mostly solid out of the box. i've had to send one NI-3000 back so far, it had intermittent comms/power problems. the replacement has been running for months now - with my same core program in it, just reconfigured with text files. so as a bench mark, it's been a reliable installation.
thanks again to all.
this is now day five of the replacement power supplies on the NI-2000 system. this is the longest it's lasted so far.
I spent months trying to troubleshoot a Landmark system some years ago, that was constantly locking up. The customer was about to have me rip it out. I even had the Landmark techs out from Utah, and they couldn't find the problem either. Then one day I just happened to be on site when the customer's AC compressors kicked on, and I noticed all the house lights dimmed. Sure enough, I went to my system, and it was locked up. Now, this job had a UPS on it, and my first thought was maybe it was bad. So I pulled it while everything esle was down, and sure enough, no output. Turns out the installer, even though the piece had a HUGE orange sticker across the bottom of it, never connected the battery internally. Every time the power fluctuated, the stupid thing was actually shutting down because it had no battery. The UPS in that case was making my situation worse. I hooked up the battery, and all was well. I've been back to the house maybe a half dozen times now in 3 years, and even upgraded it to a NetLinx, but never have had a lockup since.
But more to the point, the lockups on that system began before we put the UPS in, and that was why we put it in; it just completely threw us off the track when it didn't fix the problem because it was improperly installed. Sometimes it pays to go back to things you have already done, to make certain they are doing what you thought.
Heh, all that longwinded reply just to say I think you might have hit the nail on the head going for the power supply. Bigger caps on the output and better regulation than what the PS2.4 can provide might have been your answer all along.
the site i'd been hanging on for the last week finally crashed yesterday.
i will provide a summary of what i found there below.
in the meantime, i have also checked further to try to find out what the $%^& is going on.
you may recall me mentioning the repeated Notifications as ...
NOTE: turn on All Devices/All Events
____________________________________________
Line 75321 :: Feedback:On [0:0:1] - Channel 24 - 16:19:04
Line 75322 :: Output Channel:On - From [0:0:1] - Channel 24 - 16:19:04
Line 75323 :: Feedback:On [0:0:1] - Channel 44 - 16:19:04
Line 75324 :: Output Channel:On - From [0:0:1] - Channel 44 - 16:19:04
Line 75325 :: Feedback:On [0:0:1] - Channel 158 - 16:19:04
Line 75326 :: Output Channel:On - From [0:0:1] - Channel 158 - 16:19:04
____________________________________________
i've pinned this down to the Suite 16 UI module.
i dont actually want to use the UI supplied, i have my own panel layout, so i added the UI module because the COMM module cant seem to operate without it (still looking into that).
that's the Suite 16, now the iMerge...
in Xiva-UI.axs at the end is DEFINE_PROGRAM
in there is a Wait 10 'UPDATE TIME' {blah blah}
below that is the lines...
IF (nBROWSE_MODE <> nMODE) { SEND_COMMAND dvTP, "'@SHO', nTXT_BTN[10],0" }
ELSE
IF (nMODE <> X_ALL_OFF AND nMODE <> X_PRESETS AND nMODE <> X_CD_TRAY){ SEND_COMMAND dvTP, "'@SHO', nTXT_BTN[10],1" }
now, the second (or last in the actual file) IF statement is constantly generating the SEND_COMMAND @SHO blah blah...
in this particular installation, more than one panel is meant to control the iMerge, but i have turned off all but one panel as it seemed to get out of control in no time, now i have an idea why.
by the way, Mal has pointed out that if you have a button set to 'Always On' it will ALWAYS generate a feedback event (20 or more a second). whether this creates a problem or not still needs to be confirmed. He pointed this out as we were investigating more systems.
summary of NI2000 condition when i last visited after it hung.
this is just the points as i was on site, haven't really tidied it up.
status, axlink icsp flashing 1 sec
nothing plugged into axlink
output light is totally inactive
serial port comm1 seems to still respond, and input led reacts to Cbus (lighting)
dms causes input led to react.
touch panel causes input led to react.
two dms on ics online but not controlling
touch panel working (can ping, vnc) but not controlling.
controller pingable
cant browse/telnet controller
telnet sort of connects. but no display, flat cursor.
diagnostic display shows telnet connection has occurred
can ftp
scan port reports only ftp and telnet and 1319, no port 80
online tree reports all devices there.
new panel could be uploaded via controller
controller url list and time could be interrogated
temperature around 32c
a (soft) reboot fixed it
my weekend is fast approaching, i am going home, this $%^& has given me a headache all week.
more in this ongoing saga next week.
i may have to eat some humble pie, but only a slice, as i'll have to share it round
there may be a programming issue here with all the data traffic i've mentioned above.
something Mal mentioned. It may be that trying to write to a D:P:S that doesn't exist causes constant retries with no timeout on the retries, they go on for ever (or maybe, until the system crashes).
so, with the TPD3, if it's in a combined devices group, and we try to write to a button higher than 255 in the combined group, the TPD3 panel can never reply.
with the Suite 16 and the iMerge, they are trying to write to buttons that i dont have in my panel.
these three situations fit the bill. it's a question of whether the continuous traffic (these events could end up being hundreds per second) would finally overflow the controller.
these situations have now been corrected and installed in three locations, and now there is zero data traffic, except for when it's expected. i haven't been on site to see the state of the output led, this was all done remotely.
will update when something else happens
I just read over your post and noticed you mentioned when devices are combined. Just a neat little tip I picked up a few weeks back from a top 10 programmer. Even though AMX recommands the use of 'combine' for devices he told me that it can cause some problems. Instead he uses a DEV command and with a assigned name. That way any time he wants to talk to only one device the others don't know. EX:
DEV dvTP[ ]={dvTP1,dvWTP1} This way when you only when you call dvTP will both devices act like they are combined otherwise you can call dvTP1 without dvWTP1 changing.
i was a bit lazy, with all the other info, to get the exact detail, i just considered it to be within the same ball park.
thanks for the tip, will check.
the NXI-2000 that i made quiet (no notifications) on friday has crashed this morning.
just going onsite to look at it (cant get to it over internet) and will make notes when i get back.
still looking like a hardware problem, curious their net is down as well.