Home AMX User Forum AMX General Discussion

Offline and Online messages

Hi, i have a situation where we have 7 wireless units connecting to an NI-4100 running firmware v3_41_422.

There are actually many masters on this jobs site and all 8 of them connect to this NI-4100 via wired Lan, all of them point to the NI-4100 expect in one case the NI-4100 points to the other master. All the master use "Route Mode Direct" when connecting to another master. The NI-4100 having the Problem is System2

Anyway i get a lot of OFFLINE/ONLINE errors happening on the job site recently. I don't know what the issue is. Even the wired LAN devices seem to be dropping offline (not just the wireless). At first it was occuring about every 3 minutes some random device would drop offline and come back online, but it seems to have stabalized to about 3 times an hours. Its odd, and i'm looking for advice here, what could be wrong, on the surface the client rarely notices a device doesnt seem to respond, but largely the system seems to work. Below are some of messages i get in telnet from the NI-4100, the MVP-8400 and the MVP-5150s


*************************************************************************************************
*************************************************************************************************
*************************************************************************************************
NI-4100 TELNET
*************************************************************************************************
*************************************************************************************************
*************************************************************************************************

Welcome to NetLinx v3.41.422 Copyright AMX LLC 2008
>ip status
IP Status
NetLinx IP Connections
TCP Client connected on IP Port 23 LocalPort=17 IP=10.2.52.40 (socket=42)
TCP Client connected on IP Port 2001 LocalPort=12 IP=10.2.52.20 (socket=46)
TCP Client connected on IP Port 2001 LocalPort=13 IP=10.2.52.24 (socket=47)
TCP Client connected on IP Port 2001 LocalPort=14 IP=10.2.52.28 (socket=48)
TCP Client connected on IP Port 2001 LocalPort=15 IP=10.2.52.32 (socket=49)
TCP Client connected on IP Port 2001 LocalPort=16 IP=10.2.52.36 (socket=50)
TCP Client connected on IP Port 1632 LocalPort=10 IP=10.2.52.49 (socket=53)
TCP Client connected on IP Port 2001 LocalPort=11 IP=10.2.52.16 (socket=45)
TCP Client connected on IP Port 23 LocalPort=19 IP=10.2.52.44 (socket=41)

>show tcp
Show TCP List
The following TCP connections exist(ed):
1: IP=10.2.52.19:1370 Socket=24 (Active)
Last[MsgsInQ=0 RecvTime=0168704413 SendTime=0168704411]
2: IP=10.2.52.41:3602 Socket=25 (Active)
Last[MsgsInQ=0 RecvTime=0168704406 SendTime=0168704517]
3: IP=10.2.52.39:3602 Socket=26 (Active)
Last[MsgsInQ=0 RecvTime=0168704406 SendTime=0168704517]
4: IP=10.2.52.52:1026 Socket=27 (Active)
Last[MsgsInQ=0 RecvTime=0168705122 SendTime=0168705122]
5: IP=10.2.52.30:34305 Socket=29 (Active)
Last[MsgsInQ=0 RecvTime=0168704407 SendTime=0168704517]
6: IP=10.2.52.43:1061 Socket=31 (Active)
Last[MsgsInQ=0 RecvTime=0168704407 SendTime=0168704518]
7: IP=10.2.52.38:34417 Socket=33 (Active)
Last[MsgsInQ=0 RecvTime=0168704413 SendTime=0168704518]
8: IP=10.2.52.46:1026 Socket=34 (Active)
Last[MsgsInQ=0 RecvTime=0168704414 SendTime=0168704519]
9: IP=10.2.52.10:1319 Socket=32 (Active)
Last[MsgsInQ=0 RecvTime=0168708111 SendTime=0168708111]
10: IP=10.2.52.22:43857 Socket=36 (Active)
Last[MsgsInQ=0 RecvTime=0168704408 SendTime=0168704519]
11: IP=10.2.52.18:60448 Socket=39 (Active)
Last[MsgsInQ=0 RecvTime=0168704409 SendTime=0168704519]
12: IP=10.2.52.34:58684 Socket=40 (Active)
Last[MsgsInQ=0 RecvTime=0168704409 SendTime=0168704520]
13: IP=10.2.52.26:35642 Socket=51 (Active)
Last[MsgsInQ=0 RecvTime=0168704408 SendTime=0168704520]
14: IP=172.16.6.197:1145 Socket=43 (Active)
Last[MsgsInQ=0 RecvTime=0168704422 SendTime=0168704520]

>msg stats
Message Stats

Messages since boot:
ICSP messages rx from <10.2.52.19> = 73922
ICSP messages rx from <10.2.52.41> = 33695
ICSP messages rx from <10.2.52.39> = 33722
ICSP messages rx from <10.2.52.52> = 34647
ICSP messages rx from <10.2.52.30> = 375
ICSP messages rx from <10.2.52.43> = 33750
ICSP messages rx from <10.2.52.38> = 8747
ICSP messages rx from <10.2.52.46> = 32250
ICSP messages rx from <10.2.52.10> = 145124
ICSP messages rx from <10.2.52.22> = 1378
ICSP messages rx from <10.2.52.18> = 10338
Interpreter messages = 368612
ICSP messages rx from <10.2.52.34> = 2725
ICSP messages rx from <10.2.52.26> = 372
ICSP messages rx from <172.16.6.197> = 513

Collecting message statistics over 10 seconds. Please wait...

Messages in last 10 seconds:
ICSP messages rx from <10.2.52.19> = 4
ICSP messages rx from <10.2.52.41> = 2
ICSP messages rx from <10.2.52.39> = 2
ICSP messages rx from <10.2.52.52> = 4
ICSP messages rx from <10.2.52.30> = 2
ICSP messages rx from <10.2.52.43> = 2
ICSP messages rx from <10.2.52.38> = 2
ICSP messages rx from <10.2.52.46> = 2
ICSP messages rx from <10.2.52.10> = 5
ICSP messages rx from <10.2.52.22> = 2
ICSP messages rx from <10.2.52.18> = 2
Interpreter messages = 0
ICSP messages rx from <10.2.52.34> = 2
ICSP messages rx from <10.2.52.26> = 2
ICSP messages rx from <172.16.6.197> = 2

>show buffers max
Show Buffers

Thread TX RX Queued
---- ---- ----
Axlink 0
ICSNet 2 0 0-Sent=NO Waiting=NO
PhastLink 0 0 0-Sent=NO Waiting=NO
UDP 0 0-Sent=NO Waiting=NO
IPCon Mgr 0
Lontalk 0
LontalkHost 0

Con Manager 0
Interpreter 0
Device Mgr 0
Diag Mgr 0
Msg Dispatc 0
Cfg Mgr 0
Route Mgr 0
Notify Mgr 0
Java Router 0
---- ---- ----
Total 2 0 0 GrandTotal 2
>show buffers
Show Buffers

Thread TX RX Queued
---- ---- ----
Axlink 0
ICSNet 0 0 0-Sent=NO Waiting=NO
PhastLink 0 0 0-Sent=NO Waiting=NO
UDP 0 0-Sent=NO Waiting=NO
IPCon Mgr 0
Lontalk 0
LontalkHost 0

Con Manager 0
Interpreter 0
Device Mgr 0
Diag Mgr 0
Msg Dispatc 0
Cfg Mgr 0
Route Mgr 0
Notify Mgr 0
Java Router 0
---- ---- ----
Total 0 0 0 GrandTotal 0
>show mem
Display Memory
Volatile Free : 25392208/67108864 (largest free block in bytes/max physical)
NonVolatile Free: 1036814/1047536 (bytes free/max physical)
Disk Free :504913920/512196608 (bytes of free space/max physical)
Duet Memory Free : 2711856 (bytes)
Partition 1 - 2711856 (bytes)
Total Collections - 8
Average Time Between Collections - 18083683ms
Partition 2 - <UNKNOWN>
>cpu usage
CPU usage = 99.36% (30 sec. average = 98.84%, 30 sec. max = 99.85%)
>show mem
Display Memory
Volatile Free : 25392208/67108864 (largest free block in bytes/max physical)
NonVolatile Free: 1036814/1047536 (bytes free/max physical)
Disk Free :504913920/512196608 (bytes of free space/max physical)
Duet Memory Free : 2711856 (bytes)
Partition 1 - 2711856 (bytes)
Total Collections - 8
Average Time Between Collections - 18083683ms
Partition 2 - <UNKNOWN>
>show log all

Message Log for System 2 Version: v3.41.422
Entry Date/Time Object
Text


"CICSPTCP Rx connection to 10.2.52.26 has been closed locally or by peer"

"Closing connection to 10.2.52.26 due to transmission failure to system 2 device 0"

"Closing connection to 10.2.52.30 due to transmission failure to system 2 device 10014"

"Closing connection to 10.2.52.26 due to transmission failure to system 2 device 10013"

"Closing connection to 10.2.52.22 due to transmission failure to system 2 device 10012"

"Closing connection to 10.2.52.34 due to transmission failure to system 2 device 0"

"Closing connection to 10.2.52.34 due to transmission failure to system 2 device 10015"

"Closing connection to 10.2.52.38 due to transmission failure to system 2 device 0"

"Closing connection to 10.2.52.38 due to transmission failure to system 2 device 10016"

"Closing connection to 10.2.52.18 due to transmission failure to system 2 device 10011"

etc... etc... There are many Offline and Online messages that go with the lines above, but they are redundant.



*************************************************************************************************
*************************************************************************************************
*************************************************************************************************
Telnet Info from one of the MVP-8400s
*************************************************************************************************
*************************************************************************************************
*************************************************************************************************

11: 01-01-1970 THU 00:01:01 SocketError
Socket connect attempt failed: No route to host (errno=0x94)

>show connection stats

Connection Statistics
Total Last 15 Minutes
===== ===============
ICSP Messages: 1040263 received 4781 received
1040257 processed 4781 processed
6 dropped 0 droppped


Blink Messages: 31735 received 158 received
9 missed 0 missed


IP Statistics:

RX packets:297736 errors:0 dropped:0 overruns:0 frame:0
TX packets:225209 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:68678575 (65.4 Mb) TX bytes:16145201 (15.3 Mb)


Wireless Statistics:

Mode:Managed Frequency:2.437 GHz Access Point: 00:1A:1E:D0:B7:81
Link Quality=70/100 Signal level=84/255 Noise level=170/152
Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:1205418
Tx excessive retries:36981 Invalid misc:84154 Missed beacon:0





*************************************************************************************************
*************************************************************************************************
*************************************************************************************************
Telnet Info from one of the MVP-5150s
*************************************************************************************************
*************************************************************************************************
*************************************************************************************************

ICSP Messages: 545314 received 302 received
545193 processed 302 processed
121 dropped 0 droppped


Blink Messages: 250900 received 141 received
145 missed 0 missed


IP Statistics:

RX packets:813411 errors:0 dropped:0 overruns:0 frame:0
TX packets:825983 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:195389452 (186.3 MiB) TX bytes:103906823 (99.0 MiB)


Wireless Statistics:

Mode:Managed Frequency:2.437 GHz Access Point: 00:1A:1E:D0:B7:81
Link Quality:76/100 Signal level:-49 dBm Noise level:-122 dBm
Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
Tx excessive retries:0 Invalid misc:0 Missed beacon:0


>show connection log

Connection Log
Entry Date/Time Event
Text
1: 08-07-2009 FRI 17:56:22 MasterUrlConnect
Connected to (Sys=2) Master 10.2.52.45 (URL Mode)

2: 08-07-2009 FRI 17:56:21 SocketClosure
ICSPTCP End

3: 08-07-2009 FRI 17:56:18 MasterUrlConnect
Disconnected from (Sys=0) Master 10.2.52.45 (URL Mode)

4: 08-07-2009 FRI 17:56:18 MasterUrlConnect
Disconnected from (Sys=0) Master 10.2.52.45 (URL Mode)

5: 08-07-2009 FRI 17:56:18 SocketError
ICSP TCP Socket read error: iBytesRead==0

6: 08-07-2009 FRI 17:56:18 IcspMsgTimeout
Closing connection due to ICSP Message timeout

7: 08-07-2009 FRI 17:56:15 BlinkMissTimeout
Blink Messages Missed

8: 08-07-2009 FRI 17:56:01 BlinkMissTimeout
Blink Messages Missed

9: 08-07-2009 FRI 09:14:00 MasterUrlConnect
Connected to (Sys=2) Master 10.2.52.45 (URL Mode)

10: 08-07-2009 FRI 09:14:00 SocketClosure
ICSPTCP End

Comments

  • DHawthorneDHawthorne Posts: 4,584
    It could be any IP device causing the problem. Making IP connections seems to tie up the processor considerably while it is actually connecting, which in turn can cause timeouts with other connections, etc.

    The first thing I would check is to make double sure there are no master-to-master circular connections. If that's fine, you have to start looking at anything else making an IP connection (panel, media server,whatever applies) and see if any of them might be dropping off too frequently. Lastly, there is a possibility of an exceptionally busy piece of code causing timeouts. The latter two can be a real pain in the nethers to find; it seems to me I am more likely to figure such issues out chasing hunches than by hard logic.
  • DHawthorne wrote: »
    It could be any IP device causing the problem. Making IP connections seems to tie up the processor considerably while it is actually connecting, which in turn can cause timeouts with other connections, etc.

    The first thing I would check is to make double sure there are no master-to-master circular connections. If that's fine, you have to start looking at anything else making an IP connection (panel, media server,whatever applies) and see if any of them might be dropping off too frequently. Lastly, there is a possibility of an exceptionally busy piece of code causing timeouts. The latter two can be a real pain in the nethers to find; it seems to me I am more likely to figure such issues out chasing hunches than by hard logic.

    The audio guy said that AMX was supposedly connecting to his peavey program 3 times a second a month back. I never saw anything in AMX that would indicate this though, it could be the same issue of Network devices falling offline. Since the system does it even when not being used and if it's not Net devices causing the problem, I assume then it's something in mainline. Well, i'll look it over. Thanks for your help.
  • Spire_JeffSpire_Jeff Posts: 1,917
    Was this working fine on a previous version of firmware? I have a couple of similar situations, but I am not sure that they are exactly the same.

    I had one job in which I upgraded firmware (because I needed something in the new firmware) and the job went to the crapper shortly there after. I have since removed all of the new code and that didn't fix the problem (it would lock up after a couple days, but show no signs of memory leaks and have 1 message stuck in the interpreter queue). I have since downgraded the firmware and I am in the process of waiting to see if the problem occurs again. While I was running the 422 firmware, I did see some of the errors you mentioned. At one point, a different processor on the job locked up completely and had to be hard rebooted.

    On the job I am at currently, everything was working fine, then last week, everything fell apart. I am still not sure if this is a code issue or a hardware/firmware/ip conflict issue, but recently, I have been having problems with code that had been functioning properly on past jobs. I have yet to come up with concrete evidence as to the cause, but there is a small part of me that is leaning towards blaming the firmware. Unfortunately, I seemed to be the only one having issues according to tech support. Let me know if you find the cause, or if your code was running fine on a previous firmware version.

    Jeff
  • After making some changes, It seems to be working ok so far. This situation seems more of a "gets worse over time" issue.
    But after taking Hawthornes suggestion i went through all my mainline code and puts waits of 1 tenth of seconds on everything. Seems to work way more stably now. CPU usage is down and i haven't gotten and disconnects yet. Something in that mainline must have been running too much i think. I'm going to wait a day or 2 before i celebrate, but seems to be working better now. (After upgrading the firmware the system did lock up the next day, even though it had been running ok for weeks. But i don't know if it's really the firmware, i think the mainline code was a little too much. But again, i'll wait and see.)
  • Well, so far putting waits in mailing seemed to help a little bit, but the offline/online issue started occuring again. I set the master to "ethernet mode 10 half" instead of auto and it drastically cut down the number of Offline/Online events. Although there has been 1 Offline event today over the past few hours, i'm ok with that.
    Another strange thing is the CPU usage goes from 64% at startup and gradually grows over time of a few hours of the system idle'ing. It's at like 98% now and i think that's really weird. Why would it grow? It's like a memory leak, except... in the form of a processor leak. (I've got a piece of code in the program that says the processor has never dropped below running mainline 6070 times a minute, so the cpu definetly seems to be able to keep up). Anyway, i don't know how accurate the CPU usage indicator is because it seems to only change once ever hour or so. I was thinking maybe the code isn't processing named waits correctly and it's building them up and eventually clogging up the processor, but that's just me guessing. I'll see if anything new has happened tomorrow.

    And yes... the only time the system has locked up was about a day after i installed the 422 firmware, it only happened once. But i had no choice but upgrading, since the previous firmware didn't handle virtual feedback correctly, when using master to master.
  • viningvining Posts: 4,368
    If you telnet into the master and type in the spy command I believe that will give you the breakdown of the CPU usage. Not sure what good that will do ya and btw my office's master usage runs similar to yours. I don't know what it starts off as though but I would assume it would be low and then build as things come online.
  • DHawthorneDHawthorne Posts: 4,584
    While you're at it, do a "show mem" command periodically when you tell net in to check if there is a memory leak. "Getting worse over time" sure suggests that.
  • Spire_JeffSpire_Jeff Posts: 1,917
    I was in the same boat with virtuals not working correctly, so I had to upgrade to 422 as well. In my case, I did not run the CPU usage stuff as I was under the impression that it is a slight CPU hog in and of itself. I figured I would see noticeable slowdowns in the response on touch panels if there was a major CPU glitch. I did however monitor the memory usage as I have created more than one memory leak in my time. The memory did not budge. Not even after 5 days. (The java mem usage did shrink slightly on the 5th day, but I think that was due to a garbage collection).

    I did have to scale back some feedback to the panels, but this is already handled in time lines for the most part. The original problem was caused by an RF security sensor that was on the edge of the RF range. The sensor started failing and was generating a lot of update traffic. After I determined this to be a cause, I eventually disconnected the security panel completely and removed all associated code. The system still locked up a week later tho. This was all code that had been running fine for at least 8 months.

    I am wondering if they may have made some efficiency improvements in 422 firmware that allow more data to be sent which in turn could flood communication to the point of lockups? Maybe the processor can send stuff to the network faster than the network can negotiate and transmit?

    Jeff
  • Well, it's not a Memory Leak, i already check that. And over the past day only 2 Offline/Online events, which is pretty good. Seems like the "Ethernet Mode 10 half" has done the best so far is solving the Offline/Online issue.

    If anyone knows how to read the Spy Log, attached is what the spy log looked like when the system was having most of it's problems. I noticed that on the first pass of the spy log, almost all the Stats where low numbers but after letting it run like 7 times in a row, the numbers got bigger and bigger. Attached is the log after letting it run for about 7 times.
Sign In or Register to comment.