Yesterday I did some more tests
setting communication to 10Mbps half did not help
I opened the NI-3000 and found that it is all surface mount chips; I do not have access to such equipment anymore, so I closed the controller.
The only tother thing I could do was to put a longer ethernet cable, so I removed the regular cable and put a 50' cable and for the first time the system stayed connected for more than 12 hours, if that solve the problem I will leave the cable in (not pretty, but I still prefer this as to reset the controller).
I have been unable to get to the house where my troubled system resides, but I talked a handyman on site through rebooting the master. Now that I have communications re-established, I can checkthe logs and found the following troubling entry:
This is the last entry in the log before it dropped offline, and it raises two questions:
(1) Why the heck is the master accepting an SSH connection from an Asian IP?
(2) Is this what is causing lockups, or is it coincidence?
Since my logger depends on an IP connection, I can't be sure if the master is locking up entirely, but I don't believe it is. It's just losing network as near as I can tell. The log definitely stops there at least in this instance, and I don't have multiple dropouts to check if the SSH message is present every time. But, if this is the case, it points away from a hardware problem and more towards a firmware response to this suspicious connection.
It could be anybody on the planet trying to access your network.
Best thing to try, is to have a wirelees router (or a router plus an access point) that is not connected to the outside world.
That way you will know for sure where the problem is coming from.
Not possible; one of the main points of the system is access to the outside world, it's how the customer checks the home status when he's not in occupancy. I did plug the hole in the router settings that allowed it (someone got lazy and forwarded a range of ports to cover FTP and Telnet instead of doing them individually, which incidentally included the standard SSH port), and for good measure turned off the SSH port in the master. I'm just baffled that a supposedly secure protocol was allowed to connect without authentication ... and I have to wonder what it gave them access to. It might be a complete red herring as far as the network lockup, but I'm wondering now ...
(1) Why the heck is the master accepting an SSH connection from an Asian IP?
Snice the chips are probably made in China and that's where that IP comes from I would think engineering should look into this and make sure supplied parts weren't shipped with back doors if they do indeed come from the same region.
We are going away from the subject, this should probably be another tread.
are you sure that it connect ?
Maybe he tried so much to get to the device (similar to the famous ping of dead) that the NI-3000 simply cut itself from the network.
Back on the subject my controler is still up and running (almost 24 hours now) using the longer cable.
My NI-3000 is still up and running after 2 1/2 days, so putting a 50' patch cable does the trick.
So Dave and others that have the same problem, you can try this.
I do not know what would be the minimum lenght to solve the problem; but I do know that 25' was not enough.
That's pretty similar to what Dave posted previously in this thread.
DHawthorne wrote:
I'd be interested to know if it's a short cable issue ... because I did have that problem years ago with a D-Link switch. In that case it was the panels ... connected directly, they wouldn't work, but if I coiled up 30' or so of Cat 5 and coupled it to the lines, they were fine. Without my extensions, the lines were about 20'.
I can see less than 1 meter between devices but I think that's a reflection issue but less than 30' causing issues is very odd. Maybe the added resistance of the wire lowers the voltage enough to work. Maybe some devices output voltages above spec while AMX threshold is set at or below spec? Maybe AMX needs to make their threshold a little more forgiving to accomodate devices that output signals a little hotter than they're supposed too? Maybe it's just a coincidence and it's something else entirely?
That's pretty similar to what Dave posted previously in this thread.
I can see less than 1 meter between devices but I think that's a reflection issue but less than 30' causing issues is very odd. Maybe the added resistance of the wire lowers the voltage enough to work. Maybe some devices output voltages above spec while AMX threshold is set at or below spec? Maybe AMX needs to make their threshold a little more forgiving to accomodate devices that output signals a little hotter than they're supposed too? Maybe it's just a coincidence and it's something else entirely?
For the record, the last time I had a short cable issue, I was able to take my extenders out when I changed the switch. I was told at the time that it had something to do with the automatic speed negotiation going on in the switch; it apparently got confused by the responses in the AMX NICs, and the long cable made it less sensitive ... so at least in that specific instance, it wasn't necessarily the AMX implementation, but the switch.
And, to further contribute data on the original post, so far plugging the firewall hole that allowed SSH has stabilized my problem NI-3000. That leads me to believe whatever or whoever made that connection left it open, and the master eventually lost use of the Ethernet port because of it. It's hard to be certain though, since I also had to put a nightly reboot routine in because my customer entirely lost patience. This is what I did the last time I had a similar problem, and that system had no SSH nonsense going on. So it may also be that any kind of persistent IP connection might contribute to the issue (both systems have an IP controlled media server, though both are different brands).
Most of what has been discussed jives with what I have heard in the past with resolving the issue. The auto-negotiation issue (i.e. switching from auto to 100/full) has resolved some issues and I cannot remember the exact reason. I will have to pick the brain of one of my engineers and see if he remembers what the issue was. I cannot state why a longer cable works sometimes. It comes down to strength of signal of the router/switch and/or receive sensitivity of the signal strength on router/switch. In other words, it may be more sensitive to the level of signal AMX is putting out or we are sensitive to the signal level they are putting out. As I said we have measured the signal levels and we within spec. In most cases I would think the signal is too hot as opposed to too weak. Thus the longer run makes it work.
If someone has a master that exhibits this behavior and a longer/shorter cable doesn't help and everything you've tried is unsuccessful I'd like to help get you an advance exchange so I can take a look at the master. If there is an issue I'd like to understand what it is.
That's pretty similar to what Dave posted previously in this thread.
DHawthorne wrote:
I can see less than 1 meter between devices but I think that's a reflection issue?
My backgroup is mainly PC and network, and the problem with cable shorter than 1 meter is not with reflection but the time it takes to detect collision (same problem occur when it is longer than 100 meters or 330').
I have tried to set the controller to 10/half or 100/full and everything else and it did not help. the only thing that made it to work was to put a patch cable longer than 25' (actually 50')
I've replaced my trouble system with an NI-3100 and have the one that keeps dropping out on my test bench. I can't duplicate the system environment completely, but I hope I can duplicate the dropouts enough to collect data needed to fix it permanently.
I can report this much: the SSH thing was a red herring. The master reports a connection even if a SSH client only gets as far as the login request (which I suppose makes sense, in hindsight). I can "open" as many as 6 SSH sessions and leave them open to no effect; after six, the master refuses additional connections. But they don't seem to be related to the lockups, the one time the SSH connection happened coincidental to a lockup was just that ... coincidental.
Maybe I spoke too soon. It just locked up on me. Couple things for engineering: I've been assuming all along that it was just the network locking up because I saw RS-232 activity on site ... but I wasn't able to confirm it because the handyman shut off the breaker by mistake before I could connect. Now that I have it on my bench, it's for certain a total lockup, I can't even reconnect via the programing RS-232 port. So it seems the master is locked up, but the controller portion is not.
I also have to perhaps take back the SSH being a red herring. I left my half-dozen SSH connections open on the login, and saw no problem, and one by one they started to time out. When the second one timed out, my master locked up. I'm going to let it run a while without SSH connections, and see how long it lasts. With them open, it didn't go as long as an hour.
Maybe I spoke too soon. It just locked up on me. Couple things for engineering: I've been assuming all along that it was just the network locking up because I saw RS-232 activity on site ... but I wasn't able to confirm it because the handyman shut off the breaker by mistake before I could connect. Now that I have it on my bench, it's for certain a total lockup, I can't even reconnect via the programing RS-232 port. So it seems the master is locked up, but the controller portion is not.
I also have to perhaps take back the SSH being a red herring. I left my half-dozen SSH connections open on the login, and saw no problem, and one by one they started to time out. When the second one timed out, my master locked up. I'm going to let it run a while without SSH connections, and see how long it lasts. With them open, it didn't go as long as an hour.
I once saw this and it ended up being a very small crack in the circuit board. When it would get warm enough or get enough of a bump it would lock up deader than an bucket of door knobs.
I once saw this and it ended up being a very small crack in the circuit board. When it would get warm enough or get enough of a bump it would lock up deader than an bucket of door knobs.
Anything is possible at this point. It's running fine the last 2 hours, as I continue to try to track it down before calling for an RA. Hard to say whether the SSH connection is coincidence at this point, I just have to let it run till it stops.
Anything is possible at this point. It's running fine the last 2 hours, as I continue to try to track it down before calling for an RA. Hard to say whether the SSH connection is coincidence at this point, I just have to let it run till it stops.
You might try popping the top and pushing on the circuit board in places and see if it dies.
The system ran fine on my desk for three days. So I started opening up SSH connections ... after about a half dozen, I got a "too many connections" message, and the master locked up. So I think that's pretty definitive: when overburdened with IP connection requests, the master freezes. I tried it again and got the "too many connections" message without the lockup, so it doesn't happen every time, but when it does, it's during a connection attempt or a disconnection when it gives up on an attempt (I had it fail once on a connection timeout). I think SSH in and of itself is a red herring, unless it just uses more overhead; it is the IP connectivity in general.
When running normally, this particular system has 4 persistent IP connections: two internal device monitors, a ReQuest unit, and a Homeworks system. In addition to that, it goes out to check the weather once an hour, checks the external IP once every 15 minutes, checks a OneWire temperature probe once every five minutes, and connects to IP cameras on demand. That's a total of 8 IP connections if they are all active at once; we're supposed to be able to handle 20. But I think it's more an issue of when they are actually connecting or disconnecting ... the overhead in those processes is what locks it down if too much is happening. The NI-3100 would seem to be more sensitive to it than other masters (perhaps simply because it's slower).
For the record, the 2100, 3100, and 4100 have the exact same processor in them and all run at the same speed. The only difference would be the number of control ports and/or expansion slots. I don't think what you have encountered is a speed issue with the master. I am somewhat excited that you have a way of duplicating it. This can go a long way in help us resolve the issue. I will have one of my engineers test the timeout theory on the SSH port. If you end up sending your master in I'd like you to work with your tech support representative and make sure it gets flagged to come to engineering. If not, it will go through our repair process and that (obviously) wouldn't help me find the problem. If you have a test program that will generate the problem it would be good to load that on the master with source so I can extract that when it comes in. Thanks for digging into this. It could be a big help.
I was talking the difference between a 4000 and 4100. But it would seem the speed issue is in fact irrelevant, as the "new" master failed this weekend as well. I'm back to square one, though I still believe it related to IP communications.
I am desperate for even a short-term band-aid here. This was a perfectly reliable system until I had that power center failure and upgraded the firmware. The power center cannot be a factor, since the master has been replaced, and any hardware damage due to the power issue would be sitting on my desk, not in the customer's house. So it has to be a combination of the new firmware and the demands of the running program (which are not strenuous at all, though there are a fair number of IP devices). This is a remote location, and the customer relies on this system to check household conditions, and every time he has attempted to do so since the turn of the year, it has not been working. I am looking very, very bad right now to this customer, my boss is livid ... and has only recently been convinced not to drop AMX in favor of Savant.
So now what do I do? (Rhetorical question, but I'm open to suggestions ...)
Was just made aware of this thread and have been trying to scan through all of the responses...Exactly what firmware version is being tested? The latest master firmware (3.50.430) has some additions for extended security features. Also at that time the SSH session interface was analyzed to solve a problem we were seeing in the field with rapid hits to SSH. The modifications seemed to make the interface more reliable.
I was talking the difference between a 4000 and 4100. But it would seem the speed issue is in fact irrelevant, as the "new" master failed this weekend as well. I'm back to square one, though I still believe it related to IP communications.
I am desperate for even a short-term band-aid here. This was a perfectly reliable system until I had that power center failure and upgraded the firmware. The power center cannot be a factor, since the master has been replaced, and any hardware damage due to the power issue would be sitting on my desk, not in the customer's house. So it has to be a combination of the new firmware and the demands of the running program (which are not strenuous at all, though there are a fair number of IP devices). This is a remote location, and the customer relies on this system to check household conditions, and every time he has attempted to do so since the turn of the year, it has not been working. I am looking very, very bad right now to this customer, my boss is livid ... and has only recently been convinced not to drop AMX in favor of Savant.
So now what do I do? (Rhetorical question, but I'm open to suggestions ...)
Dave,
I can tell you from experience, Savant is definitely NOT a silver bullet. It has its issues and the down time / response to making them not down is a real problem.
I can also say that nay major savings up front are quickly eaten at the back end of the project and past project end.
Just thought you might like to know... and then pass this on to your boss.
e
Read through all the posts now and I don't have a lot to add other than make sure you are running with the latest master firmware (3.50.430). Aside from the SSH attacks that we've seen at one or two other sites without firewalls, some time ago we saw systems where the sheer volume of network traffic was locking up the OS's IP stack. This wasn't necessarily traffic destined for the master but still must be processed by the master and discarded. We've seen issues with the IP stack running out of processing buffers. When this occurs we've seen the entire network stack lockup and will fail to respond to anything, including pings. Only a reboot will clean up the system. To date, we've incorporated all known fixes from the OS vendor. These fixes were primarily part of FW v3.41.414 and appeared to solve the problem. At least the problem we could recreate in-house.
Have you tried rolling back the firmware to see if it is indeed an issue with the newer version? Didn't you replicate this in your shop but if not and this house is in the boonies maybe they have a sporatic internet connection and when the system attempts to periodically connect to the internet it hangs until timeout and while hanging something else occurs and pushes it over the edge.
Have you tried to periodically ping something on the WAN side to test for internet connectivity and set a flag to enable/disable those periodic weather checks, etc. Log the results in your logging program and see if there's an issue there. When the customer can't log into the system can he still log into the house VPN if available?
One last thing. Is this master still on an open network (i.e. no firewall?) Some early posts eluded to this (random SSH hits). In most cases where we've seen these unexplained lockups, its been on open networks were there is no control over what's hitting the master's IP interface. As I stated in my last post, we've incorporated all known fixes from the OS vendor for that version of their IP stack. But that doesn't mean there aren't vulnerabilities. It's strongly advised to put the master/network behind a firewall and only open up those ports that are needed (SSH, HTTPS, etc)
Was just made aware of this thread and have been trying to scan through all of the responses...Exactly what firmware version is being tested? The latest master firmware (3.50.430) has some additions for extended security features. Also at that time the SSH session interface was analyzed to solve a problem we were seeing in the field with rapid hits to SSH. The modifications seemed to make the interface more reliable.
It's running 3.50.430. I don't recall what it was on previously, when it was OK, but it was at least a year back last it was updated. SSH may contribute to the problem, and I can certainly force it with SSH, but it clearly occurs under other circumstances as well, as the current master locked up with that port blocked.
The plot sickens. I was trying to chase this further in house. I can consistently crash a master by opening a bunch of SSH connections and letting them time out. So I tried reverting to an older firmware version to see if it still happens ... and yes, it does, so the issue is not firmware dependent. However, and here is where it gets really bad, while I was playing around with this, I not only crashed my demo 3000, but bricked it. It will no longer reboot at all and I just sent for an RA.
The plot sickens. I was trying to chase this further in house. I can consistently crash a master by opening a bunch of SSH connections and letting them time out. So I tried reverting to an older firmware version to see if it still happens ... and yes, it does, so the issue is not firmware dependent. However, and here is where it gets really bad, while I was playing around with this, I not only crashed my demo 3000, but bricked it. It will no longer reboot at all and I just sent for an RA.
For the sake of providing a caution to anyone following this saga, the trashing of my NI-3000 had nothing directly to do with the lockup. In the course of my experiments, I had downgraded the master firmware, but neglected to downgrade the device firmware. It didn't cause a problem right away, but when I locked the master up and had to do a cold boot, the firmware mismatch kicked in and bricked the unit. Apparently (so I find after talking to several tech support people and getting engineering involved), the latest device firmware can send out some messages the older master firmware doesn't know what to do with. So if you ever need to downgrade master firmware, make sure you also downgrade the device.
For the sake of providing a caution to anyone following this saga, the trashing of my NI-3000 had nothing directly to do with the lockup. In the course of my experiments, I had downgraded the master firmware, but neglected to downgrade the device firmware. It didn't cause a problem right away, but when I locked the master up and had to do a cold boot, the firmware mismatch kicked in and bricked the unit. Apparently (so I find after talking to several tech support people and getting engineering involved), the latest device firmware can send out some messages the older master firmware doesn't know what to do with. So if you ever need to downgrade master firmware, make sure you also downgrade the device.
This concerns me (understatement?). Did you attempt to upgrade the device side firmware at any time? What does brick mean exactly? Master won't boot? Can't talk to device side? For clarification, there is a dependency on the new device side firmware -- it requires the latest master firmware (x.x.430 or later) to work help resolve some serial port issues. Note that this does NOT mean it won't work. The master side has NO dependency on the device side. At least none that we are aware of. We are trying to figure out how downgrading the master side would cause the issue you are seeing. It doesn't make sense to us. Not to say it isn't possible since you just did it, we are just trying to understand what exactly happened and what steps you took to get to that point. Any further info you can provide such as exact sequence of steps would be useful. Thanks.
This concerns me (understatement?). Did you attempt to upgrade the device side firmware at any time? What does brick mean exactly? Master won't boot? Can't talk to device side? For clarification, there is a dependency on the new device side firmware -- it requires the latest master firmware (x.x.430 or later) to work help resolve some serial port issues. Note that this does NOT mean it won't work. The master side has NO dependency on the device side. At least none that we are aware of. We are trying to figure out how downgrading the master side would cause the issue you are seeing. It doesn't make sense to us. Not to say it isn't possible since you just did it, we are just trying to understand what exactly happened and what steps you took to get to that point. Any further info you can provide such as exact sequence of steps would be useful. Thanks.
At some point in the past, the master (NI-3000) was upgraded to 3.50.430, and the device to 1.20.7. In my attempts to induce a lockup, which were successful in this configuration, I wanted to see if the lockup still occurred in earlier firmwares. So I downgraded the master firmware to 3.30.371 without touching the device firmware. The firmware changes and subsequent soft reboots caused no issue. When I again induced the lockup, I had to do a power cycle; after the cold boot, the master would no longer complete its boot cycle. All communications ports were inactive, and the front panel lights were as follows: LINK off, STATUS lit solid, OUT/IN both lit solid. The lockups were induced by opening as many SSH connections as the master permitted, then letting them time out on the login screen.
I spoke to two tech support people. Both initially re-iterated what you just said, that it shouldn't have mattered. The second person (I can PM you a name if you like) put me on hold and spoke with someone else, then came back and said what I posted here: device firmware 1.20.7 had probably sent some commands to the master it didn't understand when initializing and caused it to become unresponsive. The piece is on it's way back to you as we speak (I can PM you the SRA if you want to intercept it).
It's entirely possible there are other factors involved, including pure dumb luck ... but that's always the case with these things, isn't it? That particular master, being my demo unit, gets reset and power cycled more than I like to think about, and it never caused a problem until I had that device mismatch in there.
Comments
Yesterday I did some more tests
setting communication to 10Mbps half did not help
I opened the NI-3000 and found that it is all surface mount chips; I do not have access to such equipment anymore, so I closed the controller.
The only tother thing I could do was to put a longer ethernet cable, so I removed the regular cable and put a 50' cable and for the first time the system stayed connected for more than 12 hours, if that solve the problem I will leave the cable in (not pretty, but I still prefer this as to reset the controller).
<10:07:17> (0073234335) SSH connection accepted 122.227.30.35:33412 socket=4098
This is the last entry in the log before it dropped offline, and it raises two questions:
(1) Why the heck is the master accepting an SSH connection from an Asian IP?
(2) Is this what is causing lockups, or is it coincidence?
Since my logger depends on an IP connection, I can't be sure if the master is locking up entirely, but I don't believe it is. It's just losing network as near as I can tell. The log definitely stops there at least in this instance, and I don't have multiple dropouts to check if the SSH message is present every time. But, if this is the case, it points away from a hardware problem and more towards a firmware response to this suspicious connection.
It could be anybody on the planet trying to access your network.
Best thing to try, is to have a wirelees router (or a router plus an access point) that is not connected to the outside world.
That way you will know for sure where the problem is coming from.
Not possible; one of the main points of the system is access to the outside world, it's how the customer checks the home status when he's not in occupancy. I did plug the hole in the router settings that allowed it (someone got lazy and forwarded a range of ports to cover FTP and Telnet instead of doing them individually, which incidentally included the standard SSH port), and for good measure turned off the SSH port in the master. I'm just baffled that a supposedly secure protocol was allowed to connect without authentication ... and I have to wonder what it gave them access to. It might be a complete red herring as far as the network lockup, but I'm wondering now ...
We are going away from the subject, this should probably be another tread.
are you sure that it connect ?
Maybe he tried so much to get to the device (similar to the famous ping of dead) that the NI-3000 simply cut itself from the network.
Back on the subject my controler is still up and running (almost 24 hours now) using the longer cable.
My NI-3000 is still up and running after 2 1/2 days, so putting a 50' patch cable does the trick.
So Dave and others that have the same problem, you can try this.
I do not know what would be the minimum lenght to solve the problem; but I do know that 25' was not enough.
I will put back the NI-3000 in the rack now.
DHawthorne wrote:
I can see less than 1 meter between devices but I think that's a reflection issue but less than 30' causing issues is very odd. Maybe the added resistance of the wire lowers the voltage enough to work. Maybe some devices output voltages above spec while AMX threshold is set at or below spec? Maybe AMX needs to make their threshold a little more forgiving to accomodate devices that output signals a little hotter than they're supposed too? Maybe it's just a coincidence and it's something else entirely?
For the record, the last time I had a short cable issue, I was able to take my extenders out when I changed the switch. I was told at the time that it had something to do with the automatic speed negotiation going on in the switch; it apparently got confused by the responses in the AMX NICs, and the long cable made it less sensitive ... so at least in that specific instance, it wasn't necessarily the AMX implementation, but the switch.
And, to further contribute data on the original post, so far plugging the firewall hole that allowed SSH has stabilized my problem NI-3000. That leads me to believe whatever or whoever made that connection left it open, and the master eventually lost use of the Ethernet port because of it. It's hard to be certain though, since I also had to put a nightly reboot routine in because my customer entirely lost patience. This is what I did the last time I had a similar problem, and that system had no SSH nonsense going on. So it may also be that any kind of persistent IP connection might contribute to the issue (both systems have an IP controlled media server, though both are different brands).
If someone has a master that exhibits this behavior and a longer/shorter cable doesn't help and everything you've tried is unsuccessful I'd like to help get you an advance exchange so I can take a look at the master. If there is an issue I'd like to understand what it is.
My backgroup is mainly PC and network, and the problem with cable shorter than 1 meter is not with reflection but the time it takes to detect collision (same problem occur when it is longer than 100 meters or 330').
I have tried to set the controller to 10/half or 100/full and everything else and it did not help. the only thing that made it to work was to put a patch cable longer than 25' (actually 50')
I can report this much: the SSH thing was a red herring. The master reports a connection even if a SSH client only gets as far as the login request (which I suppose makes sense, in hindsight). I can "open" as many as 6 SSH sessions and leave them open to no effect; after six, the master refuses additional connections. But they don't seem to be related to the lockups, the one time the SSH connection happened coincidental to a lockup was just that ... coincidental.
I also have to perhaps take back the SSH being a red herring. I left my half-dozen SSH connections open on the login, and saw no problem, and one by one they started to time out. When the second one timed out, my master locked up. I'm going to let it run a while without SSH connections, and see how long it lasts. With them open, it didn't go as long as an hour.
I once saw this and it ended up being a very small crack in the circuit board. When it would get warm enough or get enough of a bump it would lock up deader than an bucket of door knobs.
Anything is possible at this point. It's running fine the last 2 hours, as I continue to try to track it down before calling for an RA. Hard to say whether the SSH connection is coincidence at this point, I just have to let it run till it stops.
You might try popping the top and pushing on the circuit board in places and see if it dies.
When running normally, this particular system has 4 persistent IP connections: two internal device monitors, a ReQuest unit, and a Homeworks system. In addition to that, it goes out to check the weather once an hour, checks the external IP once every 15 minutes, checks a OneWire temperature probe once every five minutes, and connects to IP cameras on demand. That's a total of 8 IP connections if they are all active at once; we're supposed to be able to handle 20. But I think it's more an issue of when they are actually connecting or disconnecting ... the overhead in those processes is what locks it down if too much is happening. The NI-3100 would seem to be more sensitive to it than other masters (perhaps simply because it's slower).
I am desperate for even a short-term band-aid here. This was a perfectly reliable system until I had that power center failure and upgraded the firmware. The power center cannot be a factor, since the master has been replaced, and any hardware damage due to the power issue would be sitting on my desk, not in the customer's house. So it has to be a combination of the new firmware and the demands of the running program (which are not strenuous at all, though there are a fair number of IP devices). This is a remote location, and the customer relies on this system to check household conditions, and every time he has attempted to do so since the turn of the year, it has not been working. I am looking very, very bad right now to this customer, my boss is livid ... and has only recently been convinced not to drop AMX in favor of Savant.
So now what do I do? (Rhetorical question, but I'm open to suggestions ...)
Dave,
I can tell you from experience, Savant is definitely NOT a silver bullet. It has its issues and the down time / response to making them not down is a real problem.
I can also say that nay major savings up front are quickly eaten at the back end of the project and past project end.
Just thought you might like to know... and then pass this on to your boss.
e
Have you tried to periodically ping something on the WAN side to test for internet connectivity and set a flag to enable/disable those periodic weather checks, etc. Log the results in your logging program and see if there's an issue there. When the customer can't log into the system can he still log into the house VPN if available?
It's running 3.50.430. I don't recall what it was on previously, when it was OK, but it was at least a year back last it was updated. SSH may contribute to the problem, and I can certainly force it with SSH, but it clearly occurs under other circumstances as well, as the current master locked up with that port blocked.
This concerns me (understatement?). Did you attempt to upgrade the device side firmware at any time? What does brick mean exactly? Master won't boot? Can't talk to device side? For clarification, there is a dependency on the new device side firmware -- it requires the latest master firmware (x.x.430 or later) to work help resolve some serial port issues. Note that this does NOT mean it won't work. The master side has NO dependency on the device side. At least none that we are aware of. We are trying to figure out how downgrading the master side would cause the issue you are seeing. It doesn't make sense to us. Not to say it isn't possible since you just did it, we are just trying to understand what exactly happened and what steps you took to get to that point. Any further info you can provide such as exact sequence of steps would be useful. Thanks.
At some point in the past, the master (NI-3000) was upgraded to 3.50.430, and the device to 1.20.7. In my attempts to induce a lockup, which were successful in this configuration, I wanted to see if the lockup still occurred in earlier firmwares. So I downgraded the master firmware to 3.30.371 without touching the device firmware. The firmware changes and subsequent soft reboots caused no issue. When I again induced the lockup, I had to do a power cycle; after the cold boot, the master would no longer complete its boot cycle. All communications ports were inactive, and the front panel lights were as follows: LINK off, STATUS lit solid, OUT/IN both lit solid. The lockups were induced by opening as many SSH connections as the master permitted, then letting them time out on the login screen.
I spoke to two tech support people. Both initially re-iterated what you just said, that it shouldn't have mattered. The second person (I can PM you a name if you like) put me on hold and spoke with someone else, then came back and said what I posted here: device firmware 1.20.7 had probably sent some commands to the master it didn't understand when initializing and caused it to become unresponsive. The piece is on it's way back to you as we speak (I can PM you the SRA if you want to intercept it).
It's entirely possible there are other factors involved, including pure dumb luck ... but that's always the case with these things, isn't it? That particular master, being my demo unit, gets reset and power cycled more than I like to think about, and it never caused a problem until I had that device mismatch in there.