server looses connection

alexander77 · October 2013

Hi,
I have 30+ systems running very similar code. 2 of them, both in network segments known as 'very busy', show a really odd behavior.
These controllers run 2 - 4 modules, each controlling one beamer, one global cache and some of them one scaler. UI is provided from customer; I need to open a server and the UI connects as client.
The problem is, that sometimes (intervals between 4 hrs and 3 days) the connection is lost and reconnection does not work.
Logging the console messages show the following:
(0000111918) (Reader=tSocketMan writer=tInterpreter)- CMessagePipe::Max = 25
(0000118272) (Reader=tSocketMan writer=tInterpreter)- CMessagePipe::Max = 50
(0000119066) Optimizing tSocketMan messages with 50 messages in queue.
(0000119067) Optimization tSocketMan complete with 50 messages now in queue. 0 messages avoided.
(0000121065) CMessagePipe::Write tSocketMan (PipeReader) this=0x00D9ECC0 has 50 Messages in queue (D
eleting & Ignoring1)

Unfortunately there is no offline-event happening. But I can run a 'ping-pong' watchdog, meaning, I send a ASCII 'ping' and get an immediate response 'pong!'.
With sending my ping I start a counter which is reset after receiving a pong. When the counter reaches the value 9 i execute a ip_server_close, wait some time (I tried intervals between 3 and 50 sec) and reopen the server.

What happens is: After cleanup of this CMessagePipe the server remains connected as reported by the console command ip status.
Closing the client on the other side does not change anything, unplugging the network cable or shutting down the other side does not help. Until now I could not find a way to reconnect to the server without rebooting.
Also repeated attempts to close the server shows no effect.

This is a report with locked ports 10001 and 10002
IP Status

NetLinx IP Connections
TCP Server Listener on IP Port 9999 (backlog=1) (socket=27)
TCP Server waiting on IP Port 9999 LocalPort=10
TCP Server connected on IP Port 10002 LocalPort=12 IP=131.130.194.177 (socket=42)
TCP Server connected on IP Port 10001 LocalPort=11 IP=131.130.194.172 (socket=39)
TCP Client connected on IP Port 4998 LocalPort=26 IP=131.130.194.178 (socket=30)
TCP Client connected on IP Port 4998 LocalPort=25 IP=131.130.194.173 (socket=31)
TCP Client connected on IP Port 4999 LocalPort=17 IP=131.130.194.178 (socket=33)
TCP Client connected on IP Port 4999 LocalPort=19 IP=131.130.194.174 (socket=32)
TCP Client connected on IP Port 5000 LocalPort=18 IP=131.130.194.178 (socket=36)
TCP Client connected on IP Port 5000 LocalPort=16 IP=131.130.194.173 (socket=37)
TCP Client connected on IP Port 4999 LocalPort=21 IP=131.130.194.179 (socket=38)
TCP Client connected on IP Port 4999 LocalPort=15 IP=131.130.194.173 (socket=28)
TCP Client connected on IP Port 4352 LocalPort=24 IP=131.130.194.180 (socket=41)
TCP Client waiting on IP Port 4352 LocalPort=23

Sometimes the onerror routine is executed, reporting error 17 (Local port not open), while ip status reports a connected device (the correct one).

Any ideas?

Regards,
Alexander

champ · October 2013

I was hoping someone else would answer this, here is some info that will helpfully point you in the right direction.
Read Tech note 824 http://www.amx.com/techsupport/techNote.asp?id=824 - Queues and Thresholds explained.

There are commands available in Telnet to assist in the tuning process. The command “show max buffers” displays the highest value the queues have reached. The command “show buffers” displays the current value in the queues. The command “set threshold” is used to change the upper limit of the thresholds. The command “set queue size” is used to change the upper limit of the queues.

Read the tech note carefully and consider modifying the threshold sizes.

alexander77 · October 2013

Hi Champ,

thx for the reply.
Here is a log from earlier:
38: 09-21-2013 MON 13:49:05 Unknown
CMessagePipe::Write tSocketMan (PipeReader) this=0x00D9ED30 has 50 Messages in queue (Deleting & Ignoring1)
39: 09-21-2013 MON 13:49:03 Unknown
Optimization tSocketMan complete with 50 messages now in queue. 0 messages avoided.
40: 09-21-2013 MON 13:49:03 Unknown
Optimizing tSocketMan messages with 50 messages in queue.
41: 09-21-2013 MON 13:49:03 Unknown
(Reader=tSocketMan writer=tInterpreter)- CMessagePipe::Max = 50
42: 09-21-2013 MON 13:49:01 Unknown
(Reader=tSocketMan writer=tInterpreter)- CMessagePipe::Max = 25
43: 09-21-2013 MON 13:48:57 Interpreter
Exiting TCP Read thread - closing this socket for local port 11
44: 09-21-2013 MON 13:48:56 Interpreter
Exiting TCP Read thread - closing this socket for local port 12
45: 09-21-2013 MON 13:48:52 ConnectionManager
Memory Available = 38767296 <131104>
46: 09-21-2013 MON 13:48:40 Unknown
DDD.handleDynamicDevice(): new DPD - 131.130.194.42

It was recorded on a saturday with no system activity, just the ping-pong should occur. This happens every 2 seconds on the older program versions and every 3 seconds on the newer versions.
But the message pipe was filled with 25 messages within 4 seconds, There is nothing in the program that generates anything at that rate, especial with the rooms in standby.

I've checked the queu sizes at that time, here is the report:
>set queue size

-- This will set the maximum message queue sizes for several threads. --
-- Use caution when adjusting these values. --
Set Queue Size Menu

1. Interpreter (factory default=2000,currently=3000)
2. Notification Manager (factory default=2000,currently=3000)
3. Connection Manager (factory default=2000,currently=3000)
4. Route Manager (factory default=400,currently=400)
5. Device Manager (factory default=1500,currently=2000)
6. Diagnostic Manager (factory default=500,currently=500)
7. TCP Transmit Threads (factory default=600,currently=600)
8. IP Connection Manager (factory default=800,currently=800)
9. Message Dispatcher (factory default=1000,currently=1000)
10. Axlink Transmit (factory default=800,currently=3000)
11. PhastLink Transmit (factory default=500,currently=3000)
12. ICSNet Transmit (factory default=500,currently=500)
13. ICSP 232 Transmit (factory default=500,currently=500)
14. UDP Transmit (factory default=500,currently=500)
15. NI Device (factory default=500,currently=500)
Enter choice or press ESC:
>
No limit was set to 25 or 50.

I've also checked the buffers, the reading was:

>get device traffic
Cypher TX:942 AVG TX:647 RX:2500 AVG RX:1890
>show buffers
Show Buffers

Thread TX RX Queued

---- ---- ----
Axlink 0
UDP 0 0-Sent=NO Waiting=NO
IPCon Mgr 0

Con Manager 0
Interpreter 0
Device Mgr 0
Diag Mgr 0
Msg Dispatc 0
Cfg Mgr 0
Route Mgr 0
Notify Mgr 0
Java Router 0
---- ---- ----
Total 0 0 0 GrandTotal 0

I've also checked the cpu usage. Here is a report after a reboot

Welcome to NetLinx v4.1.373 Copyright AMX LLC 2012
>cpu usage
CPU usage = 0.00% (30 sec. average = 1.98%, 30 sec. max = 39.27%)
>time
17:58:49
>cpu usage
CPU usage = 0.00% (30 sec. average = 1.98%, 30 sec. max = 39.27%)
>cpu usage
CPU usage = 0.00% (30 sec. average = 1.98%, 30 sec. max = 39.27%)
>cpu usage
CPU usage = 0.00% (30 sec. average = 1.98%, 30 sec. max = 39.27%)
>time
17:59:12
>cpu usage
CPU usage = 0.00% (30 sec. average = 1.98%, 30 sec. max = 39.27%)
>msg on all
Extended diagnostic information messages turned on.
>time
17:59:46
>cpu usage
CPU usage = 0.00% (30 sec. average = 1.98%, 30 sec. max = 39.27%)
>cpu usage
CPU usage = 0.00% (30 sec. average = 1.98%, 30 sec. max = 39.27%)
>cpu usage
*** CPU usage statistics are not available at this time. The data collection ***
*** process is being starved of execution, either due to the program running ***
*** on the master or as a result of too many commands being entered on the CLI. ***
>time
18:00:14
>

The network is a managed network with verification by MAC adress. Is there anything that I should ask the IT that could help me to solve this problem?

champ · October 2013

In the absence of a silver bullet here is a machine gun barage of possible fixes.

The Device Manager's queue size must be at least as large as the Interpreter Threshold.
When adjusting make the queue sizes long and the threshold sizes short. With too much data traffic this will cause the interpreter to stall and not overload the queues.

The issue is often caused by network disconnections. When you disconnect a cable you don't get an offline event so the master continues to generate messages which back up.
Updating feedback too fast can really back up a queue so move your feedback into a 1 second timeline.

Make sure you are clearing all buffers by handling all incoming messages, don't just leave unknown messages in buffers.

Check your URL lists, if you have a loop created with URL lists you get strange network issues.
Make sure you have a valid gateway set in the master and if there is no gateway then don't define one.

Turn UDP broadcast off, see tech note 544 for details http://www.amx.com/techsupport/techNote.asp?id=544

Avoid events with button arrays of over 200 positions.

Avoid using rebuild_event within half a second of the last one, this typically occurs if you put rebuild_event in multiple modules.

Try unplugging everything except the network and power, I saw a post where somebody had a serial cable causing a cMessagePipe::Max issue.

Try doing a search on these forums for "CMessagePipe::Max" and see what has fixed other peoples problems.

When you finally solve your problem please let us know.

alexander77 · October 2013

A lot of things to check ...

so lets start:
DM and Interpreter are both set to 1500 (default), so this should be OK.

URL List is clear, no M2M connections
UDP broadcasts are at 0, Zeroconf ist turned off
longest btn-array is 17 btn's
rebuild_events not used
unplug everything except network and power: this is the default configuration. All controllers are located in the network cabs, only in one special case we have located the controller in the AV-rack.

Some interesting posts when looking for cmessagepipe.

When I count the updates and compare that with the time it took to reach the 50 msg limit, this is most likely the source of the messages.

Now let's look at the problem step by step:
For some unknown reason the connection betwees server and client is lost without generating an offline event.
The controller must be aware of that in any way, otherwise the message pipe would not be filled up.
When a clean-up of the message pipe is done, all server connections are lost (or all server connections are lost at the beginning, this is unclear). But not the whole IP-traffic is lost, Telnet still works.
The work-around does not work, a ip_server_close seems not to be executed, since ip-status already reports an active connection. Sending data to that port results in an error 17.

So the triggering issue, that causes to start filling the pipe, is unknown and must be found. This is most likely a network issue. I may be able to sort out some possible sources; the customer has a spare controller; I can configure that one with only the servers on duty, no clients. This can sort out issues as described with an IP2serial converter. Let's see what happens then.
The other thing is the server_close thing. If this happens for any reason, why is it impossible to kill the servers and restart them?

If that works, the customer might not even recognice that there are issues on his system; theoretically these reconnections could happen within seconds.
The dirty methode to work around is to count the missing pongs and after reaching a threshold initiate a reboot.

Is there a way to monitor a message pipe fillup under program control to stop feedback generation and initiate a server_close and reopen before the internal cleanup happens?

I really miss something like a hardware (or firmware) based status-feedback of these ip-connections, coming from lower layers of the protocol. These self-generated feedbacks based on online and offline events do not work properly. And after receiving an error 14 ('unknown') or err 2 ('general error'), is the connection online or offline?

champ · October 2013

I've never tried reading a data event on 0:0:0 or 0:1:0, maybe that will catch diagnostics.

Try changing the IP address of the offending masters, i have had strange network issues which went away after doing this. There could be some rogue device spamming junk to the IP address.

Do a wireshark trace and see if there is anything spamming the network with multicast traffic.

When you try the spare controller, leave a master on the network without any devices connecting and see if that gets cmessagepipe notifications. This could give you a clue if the problem is caused by unrelated network traffic or the AMX system.

Ask the IT admin to block everything except port 1319 and the ports your UIs are connecting on.
Maybe to minimise open ports you could make all UIs connect on the same port. Netlinx accepts multiple sockets on a single port as long as you define enough servers on the port.

You said the client created the UIs, could they have a feedback loop or be sending too much data?
Could you have a race condition with feedback between a level on a UI and the processor?

Maybe put a router before your Netlinx, I know the IT admin won't like it but you'll be rule out the issue coming from their switch and it'll give you absolute control of what traffic makes it to the master.

I hope you solve this soon because I'm running out of suggestions.

Joe Hebert · October 2013

alexander77 wrote: »

I have 30+ systems running very similar code. 2 of them, both in network segments known as 'very busy', show a really odd behavior...
The problem is, that sometimes (intervals between 4 hrs and 3 days) the connection is lost and reconnection does not work.

Have you talked to AMX Technical Support? There is a hotfix available for the NI-x masters and one of the items listed as a change is:

- Fixed Ethernet driver to prevent lockups in high traffic networks

I don’t know if the hotfix will solve your issue but it may be worth a shot.

Note: If you plan on using Telnet you have to enable SSH. I had SSH disabled (the port was set to 0) and after I applied the hotix Telnet didn’t work anymore. It took a while to figure out the correlation between the two. I reported the SSH/Telnet bug and that should be fixed at a later date if it hasn't been already.

alexander77 · November 2013

Hi Joe,
THX for the reply. In short; no, I did not contact the support this time. The reason is simple, this year in spring I had a similar problem, had many contacts to the support but was unable to solve the problem. The problem went away because I could move the system into another network segment, and as you know for shure, never touch a running system. So I left it as it is.
What I've done here is: I used the spare system (located in another segment), ported the program without any change in behaviour. I tried both available versions of the Master Firmware (3.60.453 and 4.1.373), no change.
Without having other ideas I rewrote the program to run on a different brand of controller. This was the initial trigger to solve the primary problem. The errorlog there pointed to a faulty network device that flooded the network with nonsense messages. It is not a permanent solution because this type of controller reboots up to four times a hour, but the errorlog did help.
It remains a bitter smell not being able to diagnose the problem with onboard resources and missing entries in the error log system.
And that there are unanswered questions. Remember:

I really miss something like a hardware (or firmware) based status-feedback of these ip-connections, coming from lower layers of the protocol. These self-generated feedbacks based on online and offline events do not work properly. And after receiving an error 14 ('unknown') or err 2 ('general error'), is the connection online or offline?

I would like to add: what is an unknown error (which condition generates this errortype) and a general error?

Anyway, many thanks for the help I got, I will contact the support on monday.

Cheers,

Alexander

Edit: Hot fix is for NI2100, 3100 and 4100 only, I run the NI700. Let's see what the support has at their hands ...

Joe Hebert · November 2013

alexander77 wrote: »

Hot fix is for NI2100, 3100 and 4100 only, I run the NI700. Let's see what the support has at their hands ...

The *- Fixed Ethernet driver to prevent lockups in high traffic networks* hotfix is available for the NI-700. It’s version 4.1.388. That was over a month ago so there may be an even newer hotix by now.

champ · November 2013

This is not a problem isolated to AMX brand controllers.
Handling excessive network traffic is very difficult for firmware writers as the NIC must read the header of every single data packet to determine if the packet is intended for it or not; this is a major reason why networks are segmented so it annoys me that it is somehow the fault of AMX.
The IT administrator is responsible for monitoring network traffic and ensuring a network is not suffering, they have tools for this; do not blame AMX controllers for not diagnosing what an IT administrator cannot.
Isolate the NI from the network with another router and only NAT the user interfaces through the fiirewall then your problems will go away.

I'd be interested in seeing some reports of the quantity of network traffic that has caused this issue.
I would i agine AMX would have done tests when writing firmware to combat it, I'd like to see if they came up with any metrics on the point where this problem starts to occur.

alexander77 · November 2013

Hi Champ,

first: yes, this problem is not isolated to AMX controllers, absolutely correct.

But looking back, how all that begun, is important. It always can happen that connections are lost. The question is: how can I detect that and can that be corrected. As you can read from earlier post, it looks like that there is no answer to that question. Having a hotfix may decrease the number of situations where this happens, but the problem itself is not solved.

The faulty device is logically connected to this controller, as client. The other brand reported 'sending packet to xxx failed'. I would expect a similar message from my NI700. And before one asks, yes, 'onerror' is integral part of every connection I establish. But that errorlogging remained clear.
I can offer a short summary of monitoring the faulty device by the IT admins.

FastEthernet0/6 is up, line protocol is up (connected)
Hardware is Fast Ethernet, address is 2037.06e0.2786 (bia
2037.06e0.2786)
MTU 1500 bytes, BW 10000 Kbit, DLY 1000 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Half-duplex, 10Mb/s, media type is 10/100BaseTX
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input never, output 00:00:01, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops:
1018904
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 8000 bits/sec, 9 packets/sec
57760937 packets input, 4020292085 bytes, 0 no buffer
Received 603762 broadcasts (0 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 watchdog, 0 multicast, 0 pause input
0 input packets with dribble condition detected
211100069 packets output, 18764586878 bytes, 0 underruns
0 output errors, 3388 collisions, 1 interface resets
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE output
0 output buffer failures, 0 output buffers swapped out

Fa0/6 connected 398 a-half a-10
10/100BaseTX

This device is running 10MBit half duplex, so nothing that triggers a alarm button in an Gigabit network. In this particular case I cannot blame the IT admins not doing their job.

You posted some days ago:
The issue is often caused by network disconnections. When you disconnect a cable you don't get an offline event so the master continues to generate messages which back up.

What the master does is building up the message pipe and later a 'cleanup', which drops at least all server connections (why all? If one cable is disconnected, only one connection should be involved). And again: the problem is here:
a ip_server_close seems not to be executed, since ip-status already reports an active connection. Sending data to that port results in an error 17.
This is in my eyes a 'dirty situation' and can happen under different circumstances.
Replacing that faulty device solves my immediate problem, but without any further actions I have a 'ticking bomb'.

My hope is that there will be soon a hotfix for the server_close and reconnect problem since rebooting the controller is not a good solution. And a more accurate online feedback for all ip_connections.

Cheers,

Alexander

alexander77 · November 2013

Joe Hebert wrote: »

The *- Fixed Ethernet driver to prevent lockups in high traffic networks* hotfix is available for the NI-700. It’s version 4.1.388. That was over a month ago so there may be an even newer hotix by now.

I've got the 4.1.400 firmware some minutes ago. THX for the hint.

Very interesting to read the readme.txt.

Fixed master lockup involving very fast device online/offline cycles
Fixed Ethernet driver to prevent lockups in high traffic networks
Fixed an Ethernet driver lockup issue that could occur in high ethernet
traffic networks that was introduced in hotfix version v4.1.388.
Fixed an TCP/IP socket lockup issue with IP_SERVER when a client ungracefully
disconnects from the Master.

Looks like the dev's are reading the forum ...

Cheers,

Alexander

alexander77 · November 2013

Still not stable

Sorry to say that, but the server still hangs sometimes withaut any possibility to stop and restart the server.

Problem is not fixed.

Fort testing the new firmware I added the faulty device on friday evening and checked the behaviour. No change.

Gordon.Fr · April 2014

the same thing...

Hi, alexander77, is there any update on this topic?

It seems that we have faced the same problem...

server looses connection

Comments