Rebooting master

Danny Campbell · March 2006

I have a customer that has asked to have an automatic reboot of the masters on a scheduled basis.

They have one system that locks up after about 3 months even after the threshold and queues code has been added in. Now they want to reboot this and all of the other systems every day.

While this may address the symptoms for the one system, I feel uneasy about doing this every day, and on all systems. (100+).

Does anyone have any feeling (pro or con) about doing this?

Danny

[Deleted User] · March 2006

Danny Campbell wrote:

While this may address the symptoms for the one system, I feel uneasy about doing this every day, and on all systems. (100+).

Does anyone have any feeling (pro or con) about doing this?

Danny

I had a similar issue. Auto reboots are out of the question in my opinion. What I ended up doing was to put the troublematic master on a Power strip with "Switched" outlets and a front access button to "Switch" the power on and off. The master was the only thing plugged into it. If the system went down, locked up, etc... The client would have to go push the button. They complained at first, but now its not an issue for the rarity of malfunction

Thomas Hayes · March 2006

If it is only 1 master giving you a problem, why not just reboot it. If all the masters talk together than you could program that once this master goes 'offline' to send a command to a switched AC unit to do a hard reboot.

jjames · March 2006

Out of curiosity, is it an NI-x000 series master? If so, is there anything plugged into port 1? I've had and know that chatty devices on port 1 can lock up the system. If this is the case, try moving whatever is on port 1 to another port.

Just a thought . . .

Danny Campbell · March 2006

Yes it is a loaded NI-4000. Port 1 controls the Toshiba DLP cube. Definitely not too chatty. Turn it on, turn it off. I don't even need to change the input on it.

Danny

DHawthorne · March 2006

I recently solved (I hope) a very longstanding and irritating issue with a master that needed occaisional reboots - like once every 3 days or so. The culprit was an RS-232 to Ethernet converter I was using. If the device lost it's connection, the NetLinx did not always recognize it in a timely manner, and while it was in this not conected but not-knowing-it-was-not-connected state, messages would pile up in the queue until things go sluggish and eventually failed. I ditched the converter and all has been well since (running over a month without the problem). So check into anything that uses an IP connection. One of the telltales, as far as I am concerned, was that when trying to reconnect to this device, I got a "device in use" message, even though I had previously gotten a "Connection failed" message. It seemed to hover in an intermediate state for a long time instead of disconnecting cleanly.

But anyway, the solution I had for reboots is that I put a timeline in that reset another timeline on a second master. If the second master didn't hear from the first, it fired a relay that rebooted it. That way, it only rebooted if there was a hang of some sort, or enough sluggish behaviour that it timed out.

A single-master solution along the same lines would be to put a timer relay on the power to the master, and use a heartbeat timeline and one of the onboard relays (or an IO port) to pulse it frequently enough that it stays closed all the time. If the pulses stop for any reason, the relay times out and resets the power. In effect, a deadman switch.

frthomas · March 2006

I think there is a reboot command available as well (check the technotes) that could be tried. You could schedule it to run every month (if the thing locks up every three month).

Fred

frthomas · March 2006

KennyProgram wrote:

Auto reboots are out of the question in my opinion.

Why? I mean, apparently reading your post it's not the reboot that is the issue, its having it automatically that is. In which way is a manual switch better?

Just curious

Fred

[Deleted User] · March 2006

Don't get me wrong, I was simply implying that rebooting based on time reguardless of malfunction or not was out of the question in my mind. Putting logic in code to trigger a reboot if there is a problem is fine. The solution I used was the manual approach because there was only 1 master in the system and there was no other way to trigger the reboot.

Why reboot when not needed?

And, even after scheduled reboots, The system may still lock up in between, thus not really accomplishing anything.

If the lockup is consistent then there are other issues to be resolved...

Joe Hebert · March 2006

Danny Campbell wrote:

I have a customer that has asked to have an automatic reboot of the masters on a scheduled basis.

They have one system that locks up after about 3 months even after the threshold and queues code has been added in. Now they want to reboot this and all of the other systems every day.

While this may address the symptoms for the one system, I feel uneasy about doing this every day, and on all systems. (100+).

Does anyone have any feeling (pro or con) about doing this?

To be quite blunt and with no disrespect intended to the person making the decision, I would have a big problem accepting such a drastic over reaction. Reboot 100 systems once a day because 1 system locks up once every 3 months? Ouch! Talk about use a bigger hammer?

I can understand the client?s expectation that the system should function every time it?s put to use. But I have a hard time swallowing auto reboots. I?m not going out on any limb by saying I would much rather focus on getting to the root of the problem than covering it up by reboots. Sounds to Micro$oftish to me.

I expect more from an AMX system.

I?m sure you?ve already been down this path so don?t feel obligated to answer any or all of these questions that are crossing my mind:

1) Are all 100 systems programmed exactly alike with the same hardware?
2) Is there any disk I/O performed on the system that locks up?
3) When you say locked up, does the output and/or input LED lock on solid?
4) Can you telnet into the locked up system?
5) What kinds of devices are in the system? Any IP? Lots of RS-232?
6) Are there many reoccurring TIMELINEs?
7) Have you tried monitoring messages to see if memory is leaking or any run-time errors occur?

If there is a 3 month pattern of failure then something must be wrong and more rocks need to be uncovered?

DHawthorne wrote:

In effect, a deadman switch.

I?ve heard of dead man?s curve but never dead man?s switch so naturally it had to be googled. Interesting background.

http://en.wikipedia.org/wiki/Dead_man's_switch

DHawthorne · March 2006

I would consider any type of auto reboot a temporary measure just to insure a functioning system while the real problem is worked out. It's all very well to say you shouldn't do it, but the bottom line is always: will the customer be happy with a system that becomes unresponsive while you try to work out what is causing it? THe answer is, only in a very non-critical application, and only if you are able to jump immediately every time there is a problem. It's just another tool in your toolbox, an dyou have to know when to use it, and when less drastic measures are more appropriate. I've only used it once, but that one time it really saved me a great amount of grief with the customer - he never even noticed the reboots, and eventually I found the problem and they were no longer necessary. As far as he was concerned, I fixed it the day I install auto-reboots, not the day (months later) I actually really fixed it. But you do need to be careful to follow through, it could very well come back to bite you in the butt.

Another thing, now that I am thinking about it, is I like to put i!-EquipmentMonitor in my jobs when I can to fire off an e-mail whenever a system starts up. That will alert me to potential problems, and in the above case, it alerted me when there was a reboot.

Hedberg · March 2006

Dave Hawthorne wrote

The culprit was an RS-232 to Ethernet converter I was using. If the device lost it's connection, the NetLinx did not always recognize it in a timely manner

Do you remember which one of these devices that was? There seems to be a bunch of them out there going from the very inexpensive to the quite expensive. I've been wondering how reliable they are in general and if when the inevitable interruption occurs, how well they are at regaining connections.

DHawthorne · March 2006

Hedberg wrote:

Dave Hawthorne wrote

Do you remember which one of these devices that was? There seems to be a bunch of them out there going from the very inexpensive to the quite expensive. I've been wondering how reliable they are in general and if when the inevitable interruption occurs, how well they are at regaining connections.

It was a VLinx from www.bb-elec.com. I don't believe the problem was with the device, to be honest, but with the NetLinx's detection of its connection. I don't rule out my code implementation of the connection process either, it was slapped together quickly, and pulled out again just as quickly when I was able to put a wired serial connection in its place.

Danny Campbell · March 2006

I don't mind answering the questions. Here goes....

1) Are all 100 systems programmed exactly alike with the same hardware?

No. This is the only one that locks up and it is a one-of-a-kind system that is naturally used by the top execs. I've given a brief description of it at the bottom of this post.

2) Is there any disk I/O performed on the system that locks up?

No.

3) When you say locked up, does the output and/or input LED lock on solid?

It has been so long that I can't remember. I do remember that the LEDs for the RS232 ports are all dead. No RX or TX.

4) Can you telnet into the locked up system?

No.

5) What kinds of devices are in the system? Any IP? Lots of RS-232?

No IP devices, but it does use IP to get to a database for a dialing directory for the video conferencing system. Lots of RS-232. All 7 ports on the NI-4000 and three COMM-2 cards.

6) Are there many reoccurring TIMELINEs?

I use three timelines in the main program. Two begin at startup and repeat constantly. The third is used as a shutdown timer. If there are no button presses for two hours after 5:30PM, the system shuts down. If a button is pressed, the timeline is killed and restarted if it is between 5:30PM and 7:00AM. There are also 10 modules in use. A few I had written but most are from AMX, so I don't know what they are doing.

7) Have you tried monitoring messages to see if memory is leaking or any run-time errors occur?

I've suggested that we do this before we do any crazy rebooting. My biggest problem is that now whenever anything odd happens, they cry "lockup" and someone reboots the system without doing any diagnostic work at all. They will not leave the system in the locked state long enough for me to get there, and I'm only 15 minutes away.

This is a video conferencing room that has 8 cameras, the codec, 8 38" DLP cubes, one 60" DLP cube, a Zandar video processor, two Sierra A/V switches, a DSS receiver, a DVD player, a VHS player, and a couple of Vortex EF systems. The design was to have an auto tracking ability built into the system so one of the four eye-level cameras would point to whoever was speaking based on the input from one of 12 table-mounted microphones. After a few seconds of quiet, or when several people are speaking at once, the view would switch to a quad-view of the entire room using the Zandar and the wall cameras. This is controlled by the Vortex, and does cause more RS-232 I/O than the standard systems. However, this same feature is used in one other unique room which does not have any lockup issues. Did I also mention two separate speaker systems? Everything is RS-232 controlled. Basically, the table is a giant square with nothing in the center. The DLP's are mounted with two along each side (on the inside), so the participants can look down slightly to see the far side of the video conference plus the other monitor for the near side.

With the exception of the Zandar, all of the equipment used in this room is used in some of the other rooms. Much of the equipment used in this room is used in another room that uses 8 DLP cubes in a videowall arrangement, has fewer cameras, but adds a Yamaha surround receiver and a Barco into the mix. This is the same one that does the camera tracking trick that has not had a lockup.

I'm sure that there is some kind of memory leak type of issue that is killing the system, but there is no way to reproduce the problem on demand.

I believe that I've talked them out of rebooting all systems on a schedule, but I also believe that every morning someone goes in and reboots this one system. In fact, the reason I started this thread was to gather ammunition on why they should not do auto-reboots.

DHawthorne · March 2006

Load my logging module, it will keep a persistent log on the master that will survive reboots. If you turn on page flip feedback, and sprinkle your code with some judicious SEND_STRING 0 statements, you can probably pin down the sequence of events that lead up to the problem. The module is in this thread. You can FTP the master to retrieve the log.

Danny Campbell · March 2006

Dave,

thanks. That's what I'm trying to get them to let me do. In fact, I downloaded your module a few days ago with this in mind.

Rebooting master

Comments