Card slots not coming online

jjames · March 2010

I get your point Dave, but I've seen this problem for the past 3 years. This is not a little glitch that rarely gets noticed. I see this every single time I deploy a 4100. I still cannot believe it's not officially fixed.

Spire_Jeff · March 2010

jjames wrote: »

I get your point Dave, but I've seen this problem for the past 3 years. This is not a little glitch that rarely gets noticed. I see this every single time I deploy a 4100. I still cannot believe it's not officially fixed.

Just curious, but do you call into tech support every time this happens? Or at this point do you just deal with it and move on?

Jeff

jjames · March 2010

Spire_Jeff wrote: »

Just curious, but do you call into tech support every time this happens? Or at this point do you just deal with it and move on?

Jeff

I've called in several times in the past. I did mention it recently when I called in about another issue. I asked if it was supposed to have been fixed in the latest firmware update, and he had said "yes", so I told him that it still wasn't fixed. But for now, I just deal with it.

The problem is that it's so random, that I'm sure it's difficult to know what to fix.

DHawthorne · March 2010

jjames wrote: »

I get your point Dave, but I've seen this problem for the past 3 years. This is not a little glitch that rarely gets noticed. I see this every single time I deploy a 4100. I still cannot believe it's not officially fixed.

I hear you, and you are right, they need to fix it. I'm just pointing out you can't rightly apply it to the entire organization. They have a lot of stuff that is rock solid, and has been for years. I have done more than my share of ranting and raving over their lapses myself (mainly over Duet), but it doesn't stop me from recognizing the lapses are an exception, not the rule.

Spire_Jeff · March 2010

I just had a meeting with AMX Tech support regarding a really random lockup problem I am having and they were/are being extremely helpful. At the meeting, we were talking about some of the problems that have been discussed on the forums and one of them was not even on tech supports radar (odd delays on 3101). They reminded me that most of them don't follow the forums on a regular basis and that if the issues are not called into tech support, AMX probably doesn't know that the problem exists. I also think that part of the tracking involves how many people are reporting the problem and how frequently the problem occurs.

Reporting the problem every time it occurs can help move it up in priority, but more importantly, it allows details to be gathered about the system configuration and logged messages. This data can be very helpful in determining how to make the error occur and it can also help narrow down where the problem is occurring. Think of it this way... how much easier is it to help a client that says "When I push the Guide button on my cable box, nothing happens." versus "My TV is Broken."?

Just a little insight I realized during my meeting,
Jeff

P.S.
I used to have the problems with the 4100 cards no coming online about 3 of 10 times. Since the last couple of firmware upgrades and the telnet command that was mentioned earlier, it has not happened since.

mpullin · March 2010

Spire_Jeff wrote: »

how much easier is it to help a client that says "When I push the Guide button on my cable box, nothing happens." versus "My TV is Broken."?

On the other hand "My TV is Broken" is more helpful than "there is no picture or sound" when the issue is "the remote has run out of batteries"

jjames · March 2010

Oh I hear you, and trust me - in about 6 years of doing this stuff, I have just under 300 calls to tech support. All of which I can say with 98% certainty were of bugs and needing an RMA. It has never been because I didn't know how something worked, I needed coding support. Now, a few calls may have been with the R4 anomalies, but c'mon - that's not my fault, is it?

Point is - I completely agree: calling in is very important. This issue though . . . I've accepted it because I know how to deal with it and it doesn't seem like it's going to be fixed any time soon. Just like we all had to make our own work-around for the brokenness of GET_LAST forever until now, I'm content with my work-around with this problem. It's a bit aggravating, but oh well.

DHawthorne wrote: »

I hear you, and you are right, they need to fix it. I'm just pointing out you can't rightly apply it to the entire organization. They have a lot of stuff that is rock solid, and has been for years. I have done more than my share of ranting and raving over their lapses myself (mainly over Duet), but it doesn't stop me from recognizing the lapses are an exception, not the rule.

I agree. Here's my point of view and stance: AMX is such a great company - that when there are problems, they are much more noticeable. AMX in general has done such a great job with their products, that when a less-than-stellar piece comes out, I'm very bummed. Or when a terrible revision of Studio comes out (2.7), it angers me. Our (everyone here who actually sells AMX) clients are typically high-rollers, and they are expecting the best. So they turn to us for guidance and are confident that we know what the best is, so we provide them a solution and that solution is AMX. So when there is a hiccup on AMX's part, it potentially makes us look bad to the client who has put their faith in us to make the best decisions possible. I think this is why I get so aggravated sometimes: when you're used to near perfection and get less than that - it can put a thorn in your side quite easily.

Auser · March 2010

GSLogic wrote: »

This will help in SOME situations.

Force the master to wait before loading data_event online data
TELNET: set device holdoff on
query: get device holdoff

Whenever I've seen this issue (including the job I'm working on now) the systems have been fairly complex with lots going on when the controller reboots. The fix for me has always been to delay (or increase the dealy of) the code that runs as the control system reboots. With less required of the master at that point in time, the ICSNET devices always seem to come online reliably.

That said, I tend to use self-rolled modules where I know what processing is occuring in DEFINE_START, etc. - I don't use many AMX provided modules where I have no visibility/control over what runs at boot time.

This behaviour does beg the question as to why the ICSNET devices don't come online once the processor utilisation drops to a lower level. I'm sure that the control system log would give clues, but I've never had the time to look into it.

DHawthorne · April 2010

Just got bit by this one, on a 4100 running 3.41.414. Going to upgrade the firmware, and see what happens, since clearly an older version doesn't mean you won't have the problem.

vining · July 2010

Any update on a fix for this, new firmware or anything? It definitely is a real pain and seems like a roll of the dice on whether the NI4100 cards (slots) will come online. Most of my code already delays at start up so I don't what else to do on my end except creating a var to store an online flag for the cards, wait 5 minutes after start up and if my online flag isn't set reboot and continue until they come online. That and some duct tape should keep the system running smooth.

jjames · July 2010

I'd agree that 5 minutes would be a long wait, which is why I wait 30 seconds.

But to answer your question, I don't think a fix has been released. And to also add - it's not just 4x00s, it's any card frame. I've ran NI-3100s with several IRS-4s and COM2s around the house and low and behold, they don't appear online sometimes.

Putting in that flag has become standard.

John Nagy · July 2010

Thresholds and settings

4 years ago or more, the NetLinx Firmware product manager/guru looked over the things we were taking for granted in our quite complex software, listened to some of our issues (which inlcuded AXLINK and ICSNET items not coming on line reliably), and gave us notes on what to do. While I don't understand most of them, we haven't seen -any- of those issues since.

What our code does is riff through the following items and check to see each value. If it is NOT as expected, we set it, then set a reboot flag for when the list completes. So, every Netlinx reboots an extra time - ONCE - then these values are set forever...

I'm the first to admit this is as much superstition as science. But it came from the "the man" at the time, and it had the result we wanted, so this remains in our code today. May it work for you? Let us know... comments welcome.

CHECK_THRESHOLD_SIZE(INTERNAL_THRESHOLD_INDEX_INTERPRETER,2000,"'Interpreter'");
// Check and reset Lontalk Threshold to 50
CHECK_THRESHOLD_SIZE(INTERNAL_THRESHOLD_INDEX_LONTALK,50,"'Lontalk'");
// Check and reset IP Threshold to 600
CHECK_THRESHOLD_SIZE(INTERNAL_THRESHOLD_INDEX_IP,600,"'IP'");
// Check and reset Interpreter Queue Size to 3000
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_INTERPRETER,3000,"'Interpreter'")
// Check and reset Notification Queue Size to 3000
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_NOTIFICATION_MGR,3000,"'Notification Manager'");
// Check and reset Connection Manager Queue Size to 3000
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_CONNECTION_MGR,3000,"'Connection Manager'");
// Check and reset Route Manager Queue Size to 200
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_ROUTE_MGR,200,"'Route Manager'");
// Check and reset Device Manager Queue Size to 200
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_DEVICE_MGR,500,"'Device Manager'");
// Check and reset Diagnostic Manager Queue Size to 500
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_DIAGNOSTIC_MGR,500,"'Diagnostic Manager'");
// Check and reset TCP Transmit Threads Queue Size to 600
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_TCP_TX,600,"'TCP Transmit Threads'");
// Check and reset IP Connection Manager Queue Size to 500
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_IPCONNECTION_MGR,500,"'IP Connection Manager'");
// Check and reset Message Dispatcher Queue Size to 500
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_MESSAGE_DISPATCHER,500,"'Message Dispatcher'");
// Check and reset Axlink Transmit Queue Size to 3000
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_AXLINK_TX,3000,"'Axlink Transmit'");
// Check and reset PhastLink Transmit Queue Size to 3000
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_PHASTLINK_TX,3000,"'PhastLink Transmit'");
// Check and reset ICSNet Transmit Queue Size to 500
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_ICSPLONTALK_TX,500,"'ICSNet Transmit'");
// Check and reset ICSP 232 Transmit Queue Size to 500
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_ICSP232_TX,500,"'ICSP 232 Transmit'");
// Check and reset UDP Transmit Queue Size to 500
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_ICSPIP_TX,500,"'UDP 232 Transmit'");
// Check and reset NI Device Manager Queue Size to 0
CHECK_QUEUE_SIZE(INTERNAL_QUEUE_SIZE_INDEX_NI_DEVICE,0,"'NI Device Manager'");

yuri · July 2010

John, isn't That from the queue and threshold include?

vining · July 2010

Yeah, that's line for line from the Q & T .axi and all the values are identical to the version I'm running and I obviously still have issues so there's got to be something else but what?

jjames wrote:

I'd agree that 5 minutes would be a long wait, which is why I wait 30 seconds. 

But to answer your question, I don't think a fix has been released. And to also add - it's not just 4x00s, it's any card frame. I've ran NI-3100s with several IRS-4s and COM2s around the house and low and behold, they don't appear online sometimes.

Putting in that flag has become standard.

Really? You actually do flag and reboot, did you try the duct tape approach? I was kinda kidding about doing this but if there's nothing else to do this sure beets parts of the system not working after a long power outage where the UPS can't maintain. Hmmm, maybe John will remember something else the guru did.

Jimweir192 · July 2010

I thought the Q&T include was superseded by a firmware revision sometime ago... or am I just dreaming

jjames · July 2010

Yes, we actually do put in the flag and reboot. There's nothing worse than hearing "Jay! The TV isn't turning on!!!!" after I just made a change in code and we're sitting in the client's house testing it all out . . . only to find that all of my auxiliary devices (COM2s, IRS4s) are offline and that a simple reboot fixes it. So now, It's standard. (If I remember to put it in - LOL!)

Spire_Jeff · July 2010

The queue and threshold sizes don't fix it in my experience. If you look at the buffer sizes during boot, you will see that in most cases, the queues are not even close to full. On large programs, this might not be the case, but overflowing queues will cause other problems. On large programs, I have found that increasing the queue sizes and thresholds along with delaying startup code and randomizing its execution is necessary.

I recently had to use the SET DEVICEHOLD ON command so that the processor does not let devices know that it is online until it is actually ready to deal with the devices. I was running a large program and when the processor was rebooted with 20 something touch panels and Zigbee gateways all online and ready to join, the processor would occasionally choke at boot.

Hope this helps,
Jeff

vining · July 2010

jjames wrote: »

Yes, we actually do put in the flag and reboot. There's nothing worse than hearing "Jay! The TV isn't turning on!!!!" after I just made a change in code and we're sitting in the client's house testing it all out . . . only to find that all of my auxiliary devices (COM2s, IRS4s) are offline and that a simple reboot fixes it. So now, It's standard. (If I remember to put it in - LOL!)

Do you stop rebooting after x amount of failed attempts to get the cards online. I can see a card crapping out and with this reboot program in the code the system would constantly reboot until that section of the code is pulled.

Where's this SET DEVICEHOLD ON command?

jjames · July 2010

I haven't gotten that sophisticated yet - though I should. When we first ran into this issue, I actually had it doing exactly that, doing all sorts of popups to warn the user of an issue and to contact us, etc. etc. I've gotten lazy in my old age - haha!

the DEVICE HOLDOFF [ON/OFF] is a telnet command. To view the current state, used GET DEVICE HOLDOFF.

vining · July 2010

I looked in the telnet commands but I was looking for "SET xxxx" so I didn't scroll up to the top of the list.

I wrote a quick .axi file to handle verification of the card slots that will only make 2 attempts before giving up. Since I included this is my code and sent it to a remote system the darn thing haven't failed to come online to test the code but it should work and could use some pop ups and/or an email notification to the office if it contnues to fail and the max attempts are reached. That's another day since doing this half *** repair to fix the card slot issue wasn't on my list of things to do today so now I get to play catch up to get done what I was supposed to have been working on.

PROGRAM_NAME='VAV_CardSlots_Verify'

DEFINE_DEVICE

//defined in main
#IF_NOT_DEFINED NI4100_CARD_SLOTS
dvCardSlot_1_1		= 1021:1:0 ;
dvCardSlot_1_2          = 1021:2:0 ;
dvCardSlot_2_1          = 1022:1:0 ;
dvCardSlot_2_2          = 1022:2:0 ;
dvCardSlot_3_1		= 1023:1:0 ;
dvCardSlot_3_2          = 1023:2:0 ;
dvCardSlot_4_1          = 1024:1:0 ;
dvCardSlot_4_2          = 1024:2:0 ;
#END_IF

DEFINE_CONSTANT //WAIT TIME TO CHECK FOR ONLINE STATUS AND REBOOT IF REQUIRED

INTEGER CARDSLOTS_NUM_DEVS	= 8 ;	
INTEGER CARDSLOTS_MAX_REBOOTS	= 2 ;	
INTEGER CARDSLOTS_VERIFY_WAIT	= 900 ; //1-1/2 MINUTES

DEFINE_VARIABLE //CARD SLOT ARRAY, DEBUG, ONLINE & REBOOT VARS

VOLATILE INTEGER nCardSlots_Debug = 1 ;
#WARN 'nCardSlots_Debug = 1 in VAV_CardSlots_Verify.axi'

VOLATILE DEV dvCardSlot_Arry[CARDSLOTS_NUM_DEVS] = 
		    {
		    dvCardSlot_1_1,
		    dvCardSlot_1_2,
		    dvCardSlot_2_1,
		    dvCardSlot_2_2,
		    dvCardSlot_3_1,
		    dvCardSlot_3_2,
		    dvCardSlot_4_1,
		    dvCardSlot_4_2
		    }

VOLATILE INTEGER   nCardSlots_Online[CARDSLOTS_NUM_DEVS] = {0,0,0,0,0,0,0,0} ;
PERSISTENT INTEGER nCardSlots_Reboots = 0 ;

DEFINE_FUNCTION fnCardSlots_DeBug(CHAR iStr[])

     {
     if(nCardSlots_Debug)
	  {
	  STACK_VAR CHAR cCopyStr[1024] ;
	  STACK_VAR INTEGER nLineCount ;
	  
	  cCopyStr = iStr ;
	  
	  nLineCount ++ ;
	  WHILE(length_string(cCopyStr) > 100)
	       {
	       SEND_STRING 0,"'CardSlot Verify (',itoa(nLineCount),'): ',get_buffer_string(cCopyStr,80)" ;
	       nLineCount ++ ;
	       }
	  if(length_string(cCopyStr))
	       {
	       SEND_STRING 0,"'CardSlot Verify (',itoa(nLineCount),'): ',cCopyStr" ;
	       }
	  }
   
     RETURN ;
     }
     
DEFINE_FUNCTION fnCardSlots_Verify() 

     {
     if(nCardSlots_Reboots <= CARDSLOTS_MAX_REBOOTS)
	  {
	  STACK_VAR INTEGER i ;
	   
	  for(i = 1 ; i <= CARDSLOTS_NUM_DEVS ; i++)
	       {
	       if(!nCardSlots_Online[i])
		    {
		    fnCardSlots_DeBug("'ONE OR ALL CARDS OFFLINE! First failed index position = ',itoa(i),', REBOOTING (attempt ',itoa(nCardSlots_Reboots),'). :DEBUG <',ITOA(__LINE__),'>'") ;
		    
		    nCardSlots_Reboots++ ;
		    REBOOT(0:0:0) ;
		    
		    RETURN ;
		    }
	       }
	  fnCardSlots_DeBug("'ALL CARDS ONLINE! No Reboot required! Number of attempts required = ',itoa(nCardSlots_Reboots),'. :DEBUG <',ITOA(__LINE__),'>'") ;
	  //nCardSlots_Reboots = 0 ; //this will clear only if it passes. Subsequent prog uploads will not attemp reboots if this fails
	  }
     else
	  {
	  fnCardSlots_DeBug("'ONE OR ALL CARDS OFFLINE! Maximum Reboot attempts exceeded.  Aborting further attempts! :DEBUG <',ITOA(__LINE__),'>'") ;
	  }
	  
     nCardSlots_Reboots = 0 ;//this will start again after next prog upload or reboot. 
	  
     RETURN ;
     }
     
DEFINE_START 

WAIT CARDSLOTS_VERIFY_WAIT 'CARDSLOTS_VERIFY'
     {
     fnCardSlots_Verify() ;
     }
     
DEFINE_EVENT   //DATA_EVENT [dvCardSlot_Arry]

DATA_EVENT [dvCardSlot_Arry]
     
     {
     ONLINE:
	  {
	  STACK_VAR INTEGER nDev_Indx ;
	  
	  nDev_Indx = GET_LAST(dvCardSlot_Arry) ;
	  fnCardSlots_DeBug("'ONLINE. Index Position-',itoa(nDev_Indx),', D:P:S-',fnDEV_TO_STRING(DATA.DEVICE),'. :DEBUG <',ITOA(__LINE__),'>'") ;
	  nCardSlots_Online[nDev_Indx] = 1 ;
	  }
     OFFLINE:
	  {
	  STACK_VAR INTEGER nDev_Indx ;
	  
	  nDev_Indx = GET_LAST(dvCardSlot_Arry) ;
	  fnCardSlots_DeBug("'OFFLINE. Index Position-',itoa(nDev_Indx),', D:P:S-',fnDEV_TO_STRING(DATA.DEVICE),'. :DEBUG <',ITOA(__LINE__),'>'") ;
	  nCardSlots_Online[nDev_Indx] = 0 ;
	  }
     }

Spire_Jeff · July 2010

device holdoff

Just an update as I am in the process of using this right now. The correct commands seem to be:

get device holdoff

and

device holdoff on

This does not work: set device holdoff on.

Jeff

jjames · July 2010

FYI - I've had no luck with DEVICE HOLDOFF as I'm using it at another job and I'm still running into this issue from time to time. I've heard that changing the Lontalk value could help - I forget which way though.

m.Berner · July 2012

firmware v3.60.453 on a NI-4100 same issue, cardslots rarely not going online after a reboot.

Trying the device holdoff command.

Manuel

mpullin · July 2012

jjames wrote: »

FYI - I've had no luck with DEVICE HOLDOFF as I'm using it at another job and I'm still running into this issue from time to time. I've heard that changing the Lontalk value could help - I forget which way though.

I always set the Lontalk threshold to the max. Its default is 50, max is 2000. I've never heard an explanation for why the default is still so low.

vining · July 2012

mpullin wrote: »

I always set the Lontalk threshold to the max. Its default is 50, max is 2000. I've never heard an explanation for why the default is still so low.

I don't recall this setting, is that in the queue and threshold .axi or through telnet?

mpullin · July 2012

vining wrote: »

I don't recall this setting, is that in the queue and threshold .axi or through telnet?

Telnet: set threshold

vining · July 2012

mpullin wrote: »

Telnet: set threshold

it's also in the Q&T.axi

CHECK_THRESHOLD_SIZE(INTERNAL_THRESHOLD_INDEX_LONTALK,50,"'Lontalk'");

Card slots not coming online

Comments