Netlinx Locks up

LegacyUser · January 2005

today i've installed a UPS on the main touch panel.

yes i know it probably should be on the 2000 controller, but because i have actually seen the panel lose IP for no reason, i am concerned about power to it.

i have another UPS ready to put on the controller, but it will wait until i can prove the panel one way or the other.

the UPS has it's events patched into the controller, so if i have a panel power problem and the controller doesn't crash i will get a log report.

updates when they come.

i keep making these notes for my record as well as anyone else going thru these problems. if anyone is paralleling this work please feel free to comment, i appreciate knowing how others are going.

dbrady · January 2005

Hi Richard,

Dont know why your system's locking up but have noticed a few things while using the imerge module too. Apart from all the unnessesary messages it generates in the define program section, playlists dont work properly. Picking a song in a playlist will just start playing the album associated with that song and not the playlist. I managed to fix this by using a passthru command as shown below:

CASE X_TRACK:
{
if( nSELECTED_DISC_TYPE == X_PLAYLIST )
{
SEND_COMMAND vdvDEVICE,"'PLAYLIST=',ITOA(nTP_ZONE),':',cSELECTED_DISC_ID"
SEND_COMMAND vdvDEVICE, "'PASSTHRU=1:$SELECT$<TRACK><NUM>',itoa(nMEM_OFFSET + nLIST_ITEM)"
}
else
{
SEND_COMMAND vdvDEVICE, "'DISC=',ITOA(nTP_ZONE),':',uLIST[X_TRACK].cMEDIA_ID[nLIST_ITEM]"
SEND_COMMAND vdvDEVICE, "'TRACK=',ITOA(nTP_ZONE),':',uLIST[X_TRACK].cID[nLIST_ITEM]"
}
}
}//END SWITCH (nMODE)
do_push(dvTp,button.input.channel + 19 )
}//END CASES 31-35
CASE 36: // DECREMENT ALPHA SEARCH LETTER

It's put in case 31-35 of the button event. I also use a do_push for the corresponding forward browsing button so you dont have to press the arrows to get to the tracks of a selected album.
There are also some issues with the random and repeat buttons getting out of sync but i havent been able to resolve this yet. Hope this is useful.

Sl

Thomas Hayes · January 2005

Okay, The system that started this whole tread just crashed again,running for almost 2 months perfectly, a manual reboot fixed everything but I'm getting tried a walking 1/2 mile to the building to reset the &*% thing. To date I have
1-Replaced the power supply
2-Added an AC surge/regulator.
3-Changed programs.
4-Changed out the NXI-ME260
5-Moved the box for reduced heat.
6-Banged head against hard object.
I'm going back Monday to replace the last device, the TP.
Will keep everyone informed if I find the rat in this issue.

LegacyUser · January 2005

thanks dbrady, i will take a look at that (when i get the time to program again

).

there are other iMerge 'bugs' i've noticed, maybe i'll check and start a seperate thread for that... one annoying thing that i've documented is when selecting a track greater than one, the first few beats of the first song on the CD will play before the selected track plays.

hi thomas. it's good news, and bad news. good that it failed, cos it supports all that we've been trying to prove here. bad that it had to happen of course.

your ac surge/regulator could be replaced by a UPS, but as i am still testing that i can't really confirm, it's just one more option left for you in the hardware department.

just for the record, do we have your firmware version number, room temperature and any other connected devices i.e. DMSs. (although one system i get crashes on does not have DMS, just touch panels). oh, and maybe some idea of the network hardware/setup.

ah well, the walk can be good on a nice day.

Chip Moody · January 2005

Hey Thomas,

This may not be related to your installation, but reading your message reminded me of one of the more "out there" installation issues I've tripped across in the past.

At a particular client, an Axcent 3 was having intermittant problems, locking up and/or not communicating with certain devices. Through some sequence of troubleshooting, we found that the system worked fine when the plasma screen's 232 cable was disconnected. Even after nuking the plasma control code from the program, the problems >still< occured when the 232 was hooked up. After a very long period of time, we happened to discover that the plasma was on a different AC circuit than the Axcent 3's power supply. Normally that's not a bad thing, but in this case the circuit that one of the two was plugged into had the AC polarity reversed. (Or maybe they were on the same circuit and just an outlet was wired backwards) When the ground of the plasmas 232 port was connected to a 232 port on the Axcent 3, a voltage differenece between the two introduced by the wiring fault was clobbering the Axcent 3's circuitry to the point of whacky behavior. When I left the client, there was an orange extension cord running from the equipment rack across the floor to the plasma screen. I never heard if they got their wiring fixed, or just left the extension cord there. Since this was a spiffy-looking conference room, I'm inclined to think they got the AC wiring fixed...

Originally posted by Thomas Hayes
Here's a new one for a lock-up. Onsite to replace a projector. Downloaded the new code and everything worked but the projector. Reversed the 2-3 pins and still nothing( No Rx but Tx ) checked the cable for continuity and its good. Swap ports on the Netlinx, same problem. Swapped netlinx and everything worked. Bring the old Netlinx back to the office and it works fine. Yes, I did try a hard reboot onsite. Odd thing that I noticed on the new box and my box in the office was that the device address for 5001 was missing. It took some time before I was able to get the box to change to 5001. The only thing I can figure is some sort of static discharge when the projector blew that spiked the system.Things that make you go HUMMMMMMMMMMMMMMMMMMMMMMMMMMM.

Thomas Hayes · January 2005

Hi Chip
I have seen this in the past a few times and this is one reason that we spec that all our rooms and plugs are on a dedicated box and on the same phase. Good point and a hard problem to find.

LegacyUser · January 2005

damn right on the voltage difference issue. and across rs-232 devices too.

that's one reason i sent an NI-3000 back (it's replacement has been fine, although the user is tech savvy and may reboot system without telling me, i'll check it's logs).
[update: checked the logs, been running since 12/12, just coming up to 7 weeks, not really any great benchmark yet].

i clearly recall that one having ground voltage issues. i had to actually isolate the ground wire in a serial cable between the controller and the lighting control interface. i recall there were other side-effects... not sure what they all were now, but i did feel (though never measured) that voltage differences were the problem.

the other side-effect i do recall is - you could never soft reboot. power had to be removed for 10 seconds (no less, i recall) and re-applied before the system could boot.

now, i kind of think this issue is not immediately directly related to the issue of long-term intermittent freezing, but it could be. it could also be a faulty (electrical not software/firmware) controller.

Thomas Hayes · January 2005

I was just onsite with this controller for almost 2 hrs and did I ever have a lot of weird sh** happen. The box dropped its 5001 device address and that took for ever to get back. The IR ports refused to work and to top it off this is the first time that I have ever seen the LEDs flashing on the ICSNet and ICShub connectors on the back.

Spire_Jeff · January 2005

Just curious, have you considered contacting Father Guido Sarducci for an exorcism?

On a seperate note, I have an NI-3000 that seems to be locking up after about 30 days. Everything appears to function properly, except the IP Port. I am unable to connect to the processor via ip, the touchpanels (wireless) are unable to connect to the master, but the web server functions fine. I have strong inclinations towards something in my programming, but reading this thread makes me wonder.

I'll be interested to see what you find in this situation.

Jeff

Thomas Hayes · January 2005

I thought it might be my code as well Jeff but it ran fine for 2 years prior. The only change was the projector last fall. It had a NEC MT-1040 and we installed the NEC GT-1150. Same basic code commands and I added the rest. Oddly this same program code is working fine every where else on campus. When it does lock up there is no errors show in the log. HMMMMMMMMMMMMMMMMMM. Ghost in the code( Finally had time to watch I,robot this weekend.)

Spire_Jeff · January 2005

I, Robot was a very interesting movie. I really liked the way the three rules were interpreted by the computer.

One thought on your problem.... I noticed you replaced the processor card.... any chance the problem is in the cardframe?

Jeff

Thomas Hayes · February 2005

I actually did replace the NXI and ME260 as a complete piece. Today if I get a chance and the room is free i'm going back. The idea of the 3 laws was interesting because the robots would eventually realize that we(people) are not logical nor bound by it and would therefore be forced to act as they did to protect us from ourselves.(way to deep this early in the morning, late night of comparing AMX to Crest*** for a RFI that I'm involved with.)

Irvine_Kyle · February 2005

Bad hardware

A couple quick notes:

The Tx but no Rx conundrum. Ran across this one myself just recently controlling a Kaleidescape server. I had run a patch 232 cord just to get the thing controllable and ll was working just peachy. Our tech then ran another cable and we had this Tx but no Rx. Long story short a quick cal to AMX tech found the culprit. In the future, never pin-out all the pins on a cable, only the neccessary ones. Our tech, with some amount of pride I might add, said that he ALWAYS terminates all the pins on his connectors, even if they aren't used. This can cuase your 232 port to automatically drop into 485 mode, if it sees any disturbances or voltages on those pins. Hence the Tx no Rx. This in fact proved to be the case for this site.

With the locking up issue. If you have the availability to, it's always nice to eliminate code as a suspect. I have run into a rare few occasions but a few nonetheless that erratic behaviour on either AXlink bus or ICSnet was indeed bad hardware. My mentor always told me 9/10 it's code. But you can always test this theory by doing a CLEAN_DISK load a barebones source and see if the lockup occurs? Seems to me that if it were without a doubt bad hardware, the issue would happen with code or not.

Just my .02

p.s. on a better note, I know that I'm not the only one that is ecstatic that we finally have a forum to talk about these things. Long over due IMO. And kudos to AMX for giving us a place to do some brainstormn!

LegacyUser · February 2005

hmmm

Thomas Hayes · February 2005

Did the whole clean_disk, clean doc thing. Loaded everything up fresh. IR port #8 was dead so I swapped out to #9 for now. Updated all firmware. Now its wait and see time.

LegacyUser · February 2005

the ups on the panel hasn't helped. going to try the ups on the controller next.

Chip Moody · February 2005

Re: Bad hardware

Originally posted by Irvine_Kyle
Our tech, with some amount of pride I might add, said that he ALWAYS terminates all the pins on his connectors, even if they aren't used. This can cuase your 232 port to automatically drop into 485 mode, if it sees any disturbances or voltages on those pins.

Yeah, that one's a killer in the right instances... People making cables need to keep in mind that the DB-9 ports on the control systems aren't pinned out to the same standard as PCs or most AV devices. Pins 1, 4, 6 and 9 on most controllers carry the balanced TX+/- and RX+/- signals used for RS422 and RS485, while "standard" RS232 ports use those pins for other purposes. Hook 'em all up between the right gear and say "Hellllllooo!" to bad behavior.

Learned that one the hard way with Gentner (Clear One) AP-800's and Axcent3's many years back. Installers used some off the shelf RS232 cables instead of making 'em from scratch. Nothing a pair of needle-nose pliers couldn't fix, but identifying the problem was "fun"...

- Chip

LegacyUser · February 2005

report from a 3000 system. the customer reported that the main panel was offline (i have a button programmed to display if controller is lost)

note that the CP4 panels still worked, it was the CV7 panel that was offline.

so i swapped out the SureCom switcher, did NOT reboot netlinx, then the panel went online and everything worked.

this is different to the 2000 system and another 3000. when they crash a reboot is required.

a Netgear switch is now in place as the network hub. let's see how that goes.

LegacyUser · February 2005

the controller with the UPS on it died after a few days again.

that's it. i throw my hands up. there is nothing left but faulty electronics.

if you are having this problem, send the controller back.

end of story.

there is nothing that can be done. it's not programming, it's not power, it's not switches. it's amx electronics.

some one prove me wrong.

LegacyUser · February 2005

ok, now on another system (3000) they also report that their panel has gone offline.

on inspection, i find that the AXlink network still works.

i replace the switcher (a Surecom) and everything works. no rebooting of netlinx controller required.

i know i mentioned this in an earlier post, this is a confirmation of the problem.

the NI-3000s and NXI-2000 have problems with SureCom switchers.

who is to blame, i don't know, but i can't go blaming a generic switcher that works with other network devices to quickly.

the solution - don't use the two together.

i will create a seperate thread on this issue, maybe someone has something to suggest.

Spire_Jeff · February 2005

Originally posted by RichardHamilton

who is to blame, i don't know, but i can't go blaming a generic switcher that works with other network devices to quickly.

Just a little insight on this. I have access to high network tester that tests both the network wires and the network cards on a network. I have used this tester on a number of business networks. Nine times out of ten, the problems are caused by a few networks cards that don't play well with the other network devices. You would be amazed at how far out of spec some manufacturers run when paired with other manufacturers. Now, this doesn't normally cause anything more than drastic reductions in network throughput, but some programs do experience erratic behavior even if the computer itself seems to function properly. I'm not saying that AMX is not to blame, I'm not saying the network switch manufacturer is to blame, I just wanted to point out that network problems do exist between manufacturers in other aspects of the networking world, they just don't get noticed as often as they should.

Jeff

LegacyUser · February 2005

thanks Jeff for that.

i've actually started another thread on this subject, it's at
http://www.amxforums.com/showthread.php?s=&threadid=457

i'll copy your post there and reply.

LegacyUser · February 2005

an update on the NXI-2000 and its performance

it's just been running for a week, then crashed this morning.

now i think there has been an improvement in length of uptime, and the crash this morning maybe a seperate issue.

how did i get it to run four days longer than it usually would ?

i ran the Queues and Thresholds patch. it's a patch from AMX to modify message queue sizes etc, and maybe most of you have heard of it. contact tech support if you need a copy.

now here's my gotchya on this.

i had run the patch a month before installing the equipment into the customer site. since the installation the system has been unstable.

i finally got around to reviewing the queue sizes and found they had returned to defaults. so somewhere between the move from test installation to customer installation the settings had been dropped. and i was unaware of it. and i have no idea why.

solution - i've now included the Queue and Threshold.axi into my application and run the check every time the system boots. if the check fails it will do the patch it has to do, reboot, and should continue on it's merry way.

now, that may address one area.
but today, as the system had hung again, i need to still look closer at the situation.

the UPSs i have installed did report that there was a power outage (still have to talk to customer to find out what happened). the UPSs seem to have protected the Master and Panel. but i think the DMSs were lost and the ethernet side of the system did not recover at all. (system was rebooted to restore full functionality). note the DMSs are powered from the controller, which is powered from the UPS. the switcher and hub are not protected.

the customer rebooted the system before i could contact them or review the situation. i am just going off the log file right now.

it would appear that something went wrong with the mains this morning, which the UPSs reported. Why it affected the netlinx system so badly i am not sure yet. i had hoped the UPSs would prevent all of that.

so, one possible solution, and another pointer to possible causes, but still no answer.

will update when i have more.

DHawthorne · February 2005

You have DMS keypads on this system? I don't recall seeing that in previous posts, but I would escalate them to prime suspect. There is something in the modules that drive DMS's in NetLinx that forces a wait for responses from them, and queues up repeated commands until it hears back from them. My experience is that this works well 98% of the time, but every now and then, a DMS will drop offline or become unresponsive for unknown reasons, and then that queue backs up to the point of overflow. After taht point, nothing gets through and the system locks up. Setting the message threshold hleps this a lot (I believe the DMS module even lists some threshold settings that are appropriate for them). I generally crank it up to 1000 when DMS keypads are involved (the default is 50).

The NetLinx modules for DMS relies heavily on SEND_COMMANDs to update the menu and keypad feedback. It's real easy for these to bog everything down, and I have had several projects where I had to make dramatic efforts to optimize these kinds of updates so the DMS 's would not bring my system to it's knees. I consider the command and string messaging to be the Achille's heel of NetLinx programming.

I find it interesting that one of the already-developed DUET Java modules AMX will have available is for DMS keypads, and I suspect this is precisely why - they need a lower level control stream to be reliable.

LegacyUser · February 2005

okay, after reviewing the hiccup from yesterday, it really was a power outage.

so the UPSs did their job. for a time. the customer didn't reboot the system, the batteries had run out.

back to watching for the Queue size changes i mentioned earlier and the effect it has.

i hear you on the DMSs. there are only two in the installation, and they are not used to any huge extent. i do track the temperature events to provide zone information. they seem to behave quite predictably.

i have also seen queues backlogging, but they seem to clear fairly quickly.

will keep an eye out though.

LegacyUser · February 2005

another week later and still looking fairly stable..

the ups reported a 1 second power outage (which may very well have been less than 1 second).

that's the sort of thing i've been trying to trap, finally have it.

the fractional time power was disturbed could very well have disrupted a panel and controller.

so a big thumbs up to UPS protection, and using the Queue_and_Threshold_Sizes.axi

i still recommend including the Queue_and_Threshold_Sizes.axi into the main application.
Take the first entry call of this axi and put it into your main/define_start, placed before any of your other code runs.

Make a log entry if it gets called upon during a reboot to see if queue sizes etc. have been changed.

Thomas Hayes · February 2005

Sounds promising Richard. I remember the Axcent III? and older panels having a similiar problem in which the panel would drop its memory or the Axcent would likewise. It was the result of a negative going power spike. Perhaps a better/redisned PS with some big *** caps would help to reduce any power ripples?

LegacyUser · March 2005

well, looking good now after 4 weeks. still no crashes.

from a system that failed every few days (after installation. it had worked on the bench earlier) to operating non-stop for a month.

Thomas' suggestion re memory loss may be quite right. I believe that the UPS has prevented any more of those clithes. And by having the queue size check operate with each reboot should ensure it gets corrected if the error occurs again.

May not even need a UPS if the clithes are acceptable and always corrected.

this should be my last note on this. Thanks to all for the help. and to Thomas for raising the issue.

Thomas Hayes · March 2005

Sounds Greats Richard!
I am becoming more and more of a believer in UPS's. Almost 5 weeks ago we had a brown-out followed by a spike. It shut down everything on campus. All my AMX boxes came back on line(had to reboot manually 2 Axcent III) but lost a Cre***** PRO2. The unit completely fried(KFC could have not done better). Seems the demand for clean, constant power is really become a challenge for power companies.

P.S. By 'clean' I meant noise wise instead of enviroment wise. Not saying that the later is also not important

DHawthorne · March 2005

I tell my sales team to always include a UPS sufficient for every microprocessor based device in the job. It has less to do with outright outages (after all, your stereo is not the primary concern when the lights are out), than power fluctuations. I am located in a fairly congested area, and it seems to me the demand on the local power grid many times is only nominally met, and any unexpected circumstances result in minor fluctuations that don't bother the fridge much, but put control systems on the fritz. Though, come to think, the situation is no better in more remote areas, though probably for different reasons. The heart of the matter is the huge, mostly decentralized power gird system in use in most of the country. The ability to introduce various power sources from multiple points has the drawback of not always being able to provide consistent supply without fluctuations; likewise, the switchovers from source to source are rarely clean - all manner of nasty spikes and dips are introduced to the grid and propogated. I can easily see a time where power supplies themselves are going to have to take this into account and provide stabilization - the situation is only going to get worse as alternative power sources are introduced to the grid, especialy those that draw on environmental devices (like windmills, solar panels) and will need to be able to switch out when circumstance require it (wind dies down, overcast skies, etc.). Similar issues, and even worse ones, occur when the client has a local generator for power failure situations - the switching on those things is horribly noisy, and I have seen them take down microprocessor based systems that otherwise would have been fine.

Heh, all that verbage to say, "yeah, put a UPS on the job."

Netlinx Locks up

Comments