Why do R4s and Zigbee gateways fall offline so much?
a_riot42
Posts: 1,624
I am getting strange behavior with these devices, where they will bounce online/offline at random intervals even if no one is using the system. Sometimes every gateway will bounce online/offline at the exact same second. Buffers are under utilized as I am very conservative about feedback, and when they stay online they work great. But the random online/offline events really play havoc with the reliability. Firmware is the latest available, and I get no errors to indicate what might be going on. Anyone else see these types of things with these devices?
Paul
Paul
0
Comments
Sounds like you might have some interference of some kind. Possibly from a WAP or a neighbor's WAP. We had an install where we were getting the worst interference on the 2.4 GHZ band. We did some investigation and found the neighboring golf course had a GPS system on each of their golf carts and a high power transceiver antenna 4 blocks from their house. The whole property was getting baked with a wide band carrier on 2.4GHZ. he had to abandon WiFi and Zigbee in his house altogether.
Can RF interference take down a zigbee gateway? I would have thought that it would just cause delays and a nonresponsive remote.
Paul
Well, any interference on the analog portion of the exchange can cause issues. Zigbee is a pretty narrow band setup. But, if some big elephant of a signal (Like WiFi N-2.4 Ghz) waltzes in, it's going to cause trouble. Think of trying to listen to your car FM radio between towns. You get 2 or 3 stations picket fencing in on each other. When they're in, they sound great. But, then the other station busts in and the first instantly goes away.
I understand how RF interference works, but don't understand how it could make a gateway go offline. I can see an R4 going offline if the interference is so bad that it assumes it can no longer communicate but the gateway is hard wired to the controller, so the controller and gateway should never experience any communication problem that I can see.
Paul
I didn't read well enough on the 2nd post. the first post was more generic and I was following that thread. sorry about that.
R4s also fall offline from too much data. I've taken to rewriting some of my chatty modules to exclude R4s from unnecessary feedback.
No problem. If the occasional R4 dropped offline I wouldn't worry about it too much. But I am seeing all gateways in a system drop offline simultanteously (exact same timestamp) and then immediately come back online sometimes with the same timestamp as well. I just can't envision what could do that. I know there was a bug in the time server that caused a similar situation but I have all that turned off until I can figure out what's going on.
Paul
I wish it was that. I do minimal feedback to R4s and the online event only turns on a channel or two. The buffers are all well below the threshold of 300, like 42, 75 etc. At this point I have to figure its the home network but unfortunately the only thing that seems to be affected is the R4s/ZGWs, the touch panels seems fine, and with no errors or indication of what is taking them offline, its anyone's guess what could be the problem. The wireless system is an older Meru. AMX was at the site as well, and other than changing the zigbee channels (which actually seemed to make it worse but could be coincidence) couldn't find any issues in need of attention. There is some RF in the area but the channels were configured to be as far away as possible, and the home isn't in a very densely populated area. I haven't seen an R4 go offline due to hammering it with commands so if the buffer is over filled so that it goes offline does it produce some kind of error message to let you know?
Paul
Yes good point. I have gone through the system with a fine tooth comb and haven't found any conflicts, but its one of those projects where a former employee built the network, and they may have done something stupid somewhere along the way in a setting in a switch or router, or have a bad ethernet cable somewhere. They also didn't do very good ventilation and the networking gear gets hotter than it should and that could cause these problems as well. As a programmer I only can control the programming and influence the networking, but I have found that networking can be a source of a lot of problems, and skilled networking people are as rare as hen's teeth at least in the home automation business. I hope that's what it turns out to be as I have a lot of time and work invested in the remotes and don't want to have to stop using them.
Paul
The gateway can buffer 300 commands per R4 before it will disconnect FROM THE R4 (NOT from the NETLINX). The way you know is that the R4 goes unresponsive for about 30 seconds and you can see the R4's device address (NOT the gateway) go offline and back online in TELNET/Diagnostics. You can look in the web server for the gateway and see the traffic report that shows the high water mark for commands buffered is 300. If you see 300, your R4 went offline at some point. You can reset the page to see the new buffer state. Our tests show if you send about 400 commands "at once", about 100 "get through" while the buffer builds and 300 get backed up, and the connection between the gateway and R4 will go down. Send about 350, you'll see the high buffer at about 250 and all will have been dispatched and the connection stays up.
We initially instituted a slowdown queue for commands since we need more than 350 commands at times for our system, and found mixed results. The best result was found by clumping the commands with a break. We now send 40 commands from a queue (at "full speed"), pause for about .3 seconds, lather, rinse, repeat until the required commands are gone. The gateway seems to have trouble both dispatching and buffering at the same time; the pause lets it send what's buffered more quickly. We actually think this "break" in the commands to the gateway from the code results in FASTER painting on the R4's even when there are fewer commands, like 200 or so, that would normally be safe to send "at once".
Now that's all about keeping the link up between the gateway and the R4's. WE DO NOT EVER SEE THE GATEWAYS GO OFFLINE (as seen from the NETLINX over the IP network) regardless of the overflow or lack thereof to the R4's.
As mentioned, the gateways are IP hardwired network devices. If you see a pair blink off and on at once, I pretty firmly suggest that you look for a NETWORK ISSUE one way or another, not the gateway, the R4's, or your data traffic.
It could be conflicts in addresses, a switch, a cable, a router... but it's not likely the AMX hardware or your program... IMHO and experience.
I could make all the zgw's go briefly offline simultaneously by changing the time server settings. I never figured out why but I assume it had something to do with the way they interpreted ntp packets. This was fixed in the latest firmware, but sometimes instead of being fixed a problem is just more deeply hidden. It makes me wonder if there are other broadcasts from users devices that can have a similar effect that you will see only randomly.
Paul
Perhaps you have something producing an UDP storm on the network like a VOIP phone system or multi-cast media player..
What version of firmware (and for which device, gateway or R4) fixes the bug of the R4s dropping offline when the time is updated?
I'm fighting some R4 offline/online issues as well. My hunch (or hope?) is that it is a network issue, but we've yet to pinpoint the problem. 14 R4s has surely amplified any issues a system with only a few R4s might experience.
On a jobsite we've had several gateways & R4s go offline (gateways first, then R4s); doing a reboot via the page did not fix them, they had to be physically unplugged. Updating to the latest firmware and some code changes seems to have fixed it.
Yes that was something I was wondering about. If a device or two start sending broadcast messages, I don't know how the gateways will respond. I can test that and see by creating a storm on a local network and see if it takes them offline. I prefer to have the control system on a different lan than everything else but that isn't how the former network fellow did things at this remote site unfortunately.
Paul
This would be contrary to our experience and probably impossible to check if true. If maxing the buffers in the gateway-to-R4 path were able to cause a reboot or offline of the gateway, you'd lose the web connection to see the gateway traffic, and the buffers would be reset, so you could not see the evidence eitehr live or when the gateway came back online. Since you clearly CAN see the evidence in the live web view of the gateway even while the R4 drops connection due to maxing out, you are still connected on the LAN to the gateway, it has not rebooted or disconnected. So the Netlinx-to-gateway link should be uninterrupted too...
We found that 6 R4's per gateway was the most we could keep reliable and suggest no more than 5. This is regardless of repeaters (which if anything make still more traffic). We have several installations working with 6 per gateway. In two locations where they had 8 and 9, they never got satisfactory results and removed ALL the R4's rather than add a gateway and distribute the load, due to the fact that full roaming would not be possible. They didn't experiment with dual gateways PLUS repeaters.
I thought perhaps I was having the issue of too many commands being sent to the R4s when they come online, but I added them all up and I'm only around 150 per R4 on startup.
If it's something you can reproduce, then watching the queue is worth a shot and could be fixed with code.
Sort of. I believe it's when you look at the command buffer, it lists the R4s - whichever hits 300 first wins and gets to reboot the gateway. There is a command to bump it up to 500, but I forget what it is.
This can be clearly viewed in the web interface to the gateway which stays up and connected to the LAN while R4's come and go. Look at the log - the log that would be cleared by a gateway reboot - and you can see the history of the on and offline of the R4's attached to it. The disconnects will be logged by reason for disconnect, and some may say buffer overflow.
Here's a sample from my gateway.
1012 12/29/10 2:30:11 AM ICSP Join - Device Connected 00-0D-6F-00-00-0D-05-86 10006
1013 12/29/10 2:40:21 AM ICSP Leave - Device Connection Lost 00-0D-6F-00-00-0D-05-86 10006
1014 12/29/10 3:03:27 AM ICSP Join - Device Connected 00-0D-6F-00-00-0D-05-86 10006
1015 12/29/10 9:17:41 AM Zigbee Leave - Overflow Disconnect 00-0D-6F-00-00-0D-05-86 10006
1016 12/29/10 9:17:41 AM ICSP Leave - Device Connection Lost 00-0D-6F-00-00-0D-05-86 10006
1017 12/29/10 9:51:23 AM ICSP Join - Device Connected 00-0D-6F-00-00-0D-05-86 10006
Note the bolded item - The R4 went offline for an OVERFLOW, induced by forcing the buffer to hit 300 by unwise rapid forcing a redraw of a page. If the gateway had rebooted at this point, the log would have been cleared and it would be impossible to see this. Not to mention that I was watching before, during and after the event in my web browser, meaning the web server in the gateway stayed up the whole time. Connected, not rebooting. By the way, my connection log goes back more than 90 days, meaning the gateway has not rebooted EVER since I put it back in service after using it at CEDIA. It lists a number of overflow events that reflect various torture-testing during that period.
You get 300 buffered commands PER R4, not per gateway. You can have several R4's on the gateway and monitor them all, and see one go to 300 and drop while the others continue. But as mentioned, get too many per gateway, and the dispatch to the R4's slows, meaning more likely that the buffers are climbing.
We presently send about 200 commands to each R4 when they come online, and various functions send 100 or more per event. If you push relentlessly on a button that paints a screen, say the HOME PAGE button, you can stack well over 1000 commands in seconds. Without some kind of handling, this will make the R4 drop from the gateway...
BUT OVERFLOWS WILL NOT REBOOT THE GATEWAY. At least not mine, not the one at our showroom, or any of the dozens of systems using R4's that our dealers have placed. We've worked extensively with AMX product management and engineering and spent hundreds of lab hours stress testing the R4's (and our wits) to fully understand and make the damn things respect our a-thor-i-tee. Your mileage ought not vary.
The 300 limit is a ZIGBEE COMMAND BUFFER limitation. It only affects the communication in ZIGBEE, between the gateway and a specific R4.
http://amx.com/techsupport/techNote.asp?id=922
The 300 number may have been the pre-pro version and may have very well been per device not gateway, hence the ?. There used to be a zigbee calculator or something for determining max messages based on the number of devices and repeaters assigned to a particular gateway. If I recall the number of messages dropped by a large precentage for each repeater added depending on how many hops were required to get to the gateway from the end device. I haven't used repeaters since my initial dreams were shattered and I realized a single gateway with multiple repeater single mesh network wasn't practical for a reliable system. Again this was pre-pro and since I've only designed systems with out repeaters but that may change when the need arises.
I'm not sure what benefit the pro version really offers over the non pro version since all I've read is that the pro provides better routing which to me implies a multiple repeater system.
One thing in the tech note that puzzles me is #2 below:
How does a repeater help distribute the load? It's still all funneling through the gateway unless it can dump data quickly to the repeater and make the repeater do the buffering for the traffic routed its way.
If this were true then maybe a repeater per R4 behind a single gateway could be a good thing and could speed things up or improve reliability?
//added
This technote that discusses the role of repeaters doesn't seem to suggest they help shed the load of gateways.
http://amx.com/techsupport/techNote.asp?id=921
The 300 is the current CUT OFF threshold, AMX suggests you don't approach it. In the old firmware, the cutoff was 150, and the suggested max was 75. That didn't work out for nearly anyone, and the new firmware (now over a year old) upped it. We find that anything under 300 will be fine, but as discussed, user actions can cause unexpected loads and take you way over - so your code needs to be prepared. We tried using the queue devised by AMX Australia, but found it helped only some.
After too many failures and an unwillingness to dumb down the device to being a light-up R1, we have a 4 prong approach to control the data flow to R4's:
* We don't try to do as much on the R4 as on a G4. Graphics and options are reduced, but on the small screen, aren't much missed. All actual G4 panel functionality was retained.
* A command queue slows the overall delivery pace to the gateway.
* We pause every 40 commands sent to each R4 to give the gateway a chance to dispatch while not buffering. The pause time and the command count interval is variable and can be adjusted on site per UI since performance can vary by environment.
* The Netlinx code command queue itself is circular and holds only 450 commands per UI. If more than that accumulate before the queue sends them on to the gateway, we send only the most recent, since the earlier ones would be obsolete and overwritten by whatever came later anyway.
Yeah, it's a lot of effort to make them do all they can. But keeping commands under 100 wasn't realistic for us. Even with all the above, you can still overload it if you get silly.
Under the old firmware, the downtime in an overflow was about 45 or more seconds, really tragically long from a user perspective. Now it is under 20 seconds when it DOES occur, still bad but if your code makes sure it's rare, it's not too bad, especially since it will typically occur only with user foolishness (pressing the SOURCES button 10 times in 1 second).
The theory about reducing load with more repeaters is probably due to the MESH nature of the network, theoretically ANY zigbee device can be an RF bridge/repeater to another zigbee device that is otherwise out of range of the gateway. If you were to make one R4 relay data to another, the first one has to handle twice the data. From all our testing and what the logs report, the R4's don't relay data, so this is likely to be an incorrect or outdated recommendation.
But I don't have much experience with the new repeaters; there could be some buffering in them that could help, but I can't see how it would do much differently, and might offer yet another place to overflow. It would be nice to be wrong on this.
Maybe someone has more real field experience with the PRO repeaters?
Remember, IP is part VOODOO, and Zigbee is the love child IP had after a drunken new year's party with X10. Expect the unexpected.
I only contribute in this matter with such detail because we went eye to eye with AMX about our product not supporting the R4 for over 2 years as it was simply untenable with the old firmware. The new firmware finally gave hope to the promise of this attractive and inexpensive panel substitute... so we made it work. And I share this pain and joy only in hopes that others need not retrace all our steps and hundreds of hours of false starts and mixed successes.
Each gateway has its own Netlinx Device Number and duplicate device numbers behave just like IP conflicts.
Yes they are unique. I don't think you could get two gateways with the same ID online at the same time. As soon as one came online the gateway with the duplicate ID would immediately get changed to a virtual device number I think.
Paul
Don't rule out your cabling. I had an R4 that was dropping out regularly (not often, mind you, but regularly) that was literally 3' away from he POE network switch that was feeding it. Signal strength was great on the transmit, but kinda flaky on the receive, so I thought I would switch channels to see if that helped (by the way, the R4 itself was only a 5-6 feet away). I could not get into the web interface on the gateway no matter how hard I tried. I called AMX, and they sent me an RA, but when I went back with a replacement, the new one wouldn't work either. So I moved the gateway to an other location, where it worked fine. I didn't think my switch had an issue, since the other devices on it were OK (mainly IP cameras), so entirely on a hunch, I replaced the 3' cable going from the switch to the gateway. It's been fine since.
One of the biggest problems I have with cabling is that my installers often make their own from bulk CAT5 (well, CAT6 nowadays) with crimped on connectors. They test each one, but I can't get it in their heads that your basin network cable tester just checks continuity, and even getting a link light on your devices does NOT mean he cable is fine. 99% of their crimps are perfect, but that one that was done in a hurry after a queue of several dozen, is the one that comes back and bites us in the butt. But, and this is a big one, even factory-molded cables can have issues, especially the freebies that come with network devices. My point is, if you can, use a real network tester on your cabling to make sure it doesn't have too much capacitance, or dropped packets, or whatever. If you don't, just try swapping cables to see if there is any change. You can beat your head against the wall a long time looking for the problem elsewhere ( I lost nearly two days on the example cited above).