Buffers/queues overflow problem

Gordon.Fr Posts: 4

April 2014 in AMX Technical Discussion

Hello everybody!

We have faced very complicated problem with AMX controllers and iPads... To describe it I'll post here my letters to AMX and TPControl. If you have any ideas or any guess, we will appreciate of any help...

Hello, AMX team!

We need your help and assistance to resolve a very complicated issue we had faced.

It is very important for us to find any decision as soon as it possible... We've got red deadline from our customer at the beginning of the next week... So we would appreciate any help.

We've got 15+ systems, cascade topology, 10 iPads. Our system works fine for few days, then it becomes unstable:
Some masters can loosing their M2M connections, e.g.:
(
Line 49 (06:59:50):: ICSPTCPRx14::CloseSocket: Closing Socket
Line 50 (07:00:14):: CICSPTCP Rx connection to 192.168.19.20 has been closed locally or by peer
),
some of them can start loosing and finding their iPads (online/ offline events at diag. repeats one by one continuously), some of them can miss events (pushes, Ons/Offs, etc.).

There is no any known sequence which could lead to such unfortunate behavior. We had spent a lot of time watching the system, trying to recreate the problem, and to localize it. Here are some of our observations:

We have found out, that while somebody uses the system show max buffers-> Grand Total grows. And when nobody uses it Grand Total stand still.
We have found out, that when the system is crashed - one of the queues is overflowed (it could be Interpretter, Device mgr, or IpCon mgr) and Grand Total often is more than ~ 4000 - 5000
We have found out that these buffers grow mostly in case of iPad reconnection to master. There are couple wifi access points across the house and when iPad switches between them, it reconnects to its master controller
We have found out, that reboot from netlinx for only those 'overflowed' controllers makes better for a short time, and after that buffers overflowed again.
We have found out that iPads sometimes get frozen. Even iOS could gets frozen. The best decision in such cases is to reboot the iPad (Home+Power button), but in most cases only restarting of TPControl helps.
We have found out that power reboot for all controllers and all ipads(!) makes system 'fresh' so it would work for another couple of days.

Thus we have some guess about reasons of such behavior.
1. We think that this problem is caused by communication between the iPads and controllers
2. We think that there is some kind of memory leakage at the moment of iPad to Master reconnection
3. We think that the size of TPdesigner project somehow connects to this problem.

We are relying on your help and expertise.

Tech Information

NI3100 firmware 4.1.404 / 1.3.10
NI700 firmware 3.6.453 / 1.3.10
Netlix Firmware 3.3.1.525
TpControl, TpTransfer, iOS - the latest

iPad GUI consists of 40 static pages and 344 popup pages. GUI Project size is 33,901 Mb
iPads connected only to NI3100 masters

see some logs attached

NI3100 systems: 3,6,8,11,12,13,14,17,18,19,20
NI4100 systems: 15
NI700 systems: 2,5,9,10

In logs you could see
1. Logs right after reboot, before any of ipads is turned on
2. Logs after some tests provided with iPad 10707 (this iPad is connected to system 20)
3. Logs after some more tests provided with iPad 10707 and iPad 10705 (this iPad is connected to system 19)
NOTE: Pay attention on ‘show max buffers’ for .19 and .20
We will send additional logs after next steps of tests...
---

Thank you very much.

Sincerely,
Vadim ...
Programmer ...

Hello, TPControl team!

We've got AMX based system with 15+ controllers. Also we have 10 iPads to control it. TPDesign project consists of 40 static pages and 344 popup pages. Project is 33,901 Mb weight.

The problem is that all works fine, but after couple of hours of testing the system, TPControl starts going wrong:

textures, buttons or backgrounds could disappear, some of them could become transparent, and finally it gets frozen completely, sometimes making even iOS unresponsible on button press. sometimes iOS's desktop becomes 'invisible' - all icons disappear. The only way in these cases is to hard reboot the iPad pressing Home+Power button.

The second problem is that we have tried to recreate this situation on the standalone controller with 1 iPad... and we have failed...

The third problem is that we've got controllers with overflowing queues. And it looks like this overflowing somehow correlates with TPControl malfunction. It seems, like malfunctioning iPad somehow feeding the system with some kind of wrong data...

I have to note that we have failed to find any sequence to recreate the crash. It seems that there is no any specific sequence. It looks like TPControl crashes after some kind of 'total amount of use'.

I also have to say that it seems for me, that TPControl malfunction strongly correlates to amount of pages in project - more pages in project gives more frequent crashes, less pages in project makes crashes less frequent, but they stand still (ranging for example from 1-2 hours to 3-5 days).

It is very important for us to find any decision as soon as it possible... We've got red deadline from our customer at the beginning of the next week... So we would appreciate any help.

We are relying on your help and expertise,
Sincerely,
Vadim

Tech Information

We have different models of iPads, here they are:
MD371B/A
MC769LL/A
MD515B/A

TPControl Application 2.4.5.0
TPTransfer 1.3.5.3
iOS version 7.0.1

AMX NI3100 Masters 3.60.453

I have to note that have already updated AMX firmware and TPControl, TPTransfer, iOS versions. And logs attached are gathered from updated system.

24.04.2014 Logs 1 start.zip 32.5K

24.04.2014 Logs 2 some tests.zip 263K

25.04.2014 Log 3 some more tests.zip 137.6K

Comments

ericmedley Posts: 4,177

April 2014

How do you have your Master to Master configured? Have you ran terminal and watched with MSG ON ALL command to check for runtime errors? the system you describe should not be taxing anything. I've seen masters handle more TP and M2M connections.

0
vining Posts: 4,368

April 2014

Since you're using a cascade topology are your masters set to route mode direct or are they still route mode normal? Should be direct On "all" masters for cascade.

http://www.amx.com/techsupport/PDFs/919.pdf

0
Gordon.Fr Posts: 4

April 2014

Thank you for your responces!

You could find url lists and route mode attached

Vining: thank you, they are all direct mode...

Ericmedley: thank you, we will try to use MSG ON ALL... I'm afraid that this problem not only about quantity of TP, but about their type (iPads) , and GUI project size (in terms of pages)

url & route mode.zip 5.8K

0
DHawthorne Posts: 4,584

April 2014

Queues overflowing is an indication that something is sending too many messages for the master to process. In itself, it is not the problem, it is a symptom. It's possible it's a network problem with devices disconnecting and re-connecting at a rapid rate, because going online does generate a lot of messages for a UI device. However, I would open up the notifications on the system and make very sure there isn't too much activity being generated by the program. A run-time error that doesn't actually break the program can flood the processor. A recursive function can tie it down so other things don't process other things in a timely fashion. A module that eats up volatile memory can do it too. A channel that goes on and off too frequently ... text fields that get updated unnecessarily (ie., they haven't changed) can do it. And you can get a fair indication of these if you watch the notifications. Don't assume that if your program does what you want that it is actually running properly.

I was at a demo presentation a few years back at my local AMX distributor (they aren't around anymore), and I connected to their master just to see what it was doing, and there was an endless stream of runtime errors. Because it was a tiny demo system, it didn't affect operations, but a big system it would have killed. You might simply be overtaxing the processor with little things you can get away with in a smaller job.

Of course, all this is might and maybe. Just a suggestion of another direction to look.

0
Gordon.Fr Posts: 4

May 2014

DHawthorne, thank you very much for your responce

I have checked the system for almost all mentioned programming errors. In most cases it was ok. But I've got some questions, could you please tell me little bit more

What do you mean by 'module that eats volatile memory?' What should I do to prevent it? If this is the case, then does 'show mem' shows this out?
'show mem' shows me that at least 30% of memory is free now.

What do you mean by 'can flood the processor'? Could the processor somehow 'accumulate' anything bad? Because as I've said before, we have the system which works fine for some period, and then it become sluggish or not working. I have to mention, that I don't see any run-time errors while the programm is working.

We have "Axonix MediaMax Module for AMX" which sends a lot of media data when activated. Besides this, we have more than 10 iports, which can also send media data. Finally we have 10 logitech players, which also send media data to the panels through the controller.

It seems that the problem is into this mediadata traffic. I believe that soon we will have a chance to have some more tests. But what I still don't understand is

Why the system works fine for some period? What does it accumulate? And why?

Thank you very much for your help

0
DHawthorne Posts: 4,584

May 2014

Show mem will display 4 types of memory. NonVolatile is the second listing, and is the one I'm talking about. I had a personal memory glitch and typed volatile, but I meant non-volatile. There is less of it than any other type, and it is the default for a declared variable, so if the modules declare a lot of variables without specifying "volatile," you can easily run out and the processor will just stop working. You basically have to add the keyword VOLATILE in front of all your variables to prevent it (unless there is a specific reason for them to be non-volatile, which means they will keep their value after a reboot.) But usually this is going to be obvious right away, not after a period of time.

I agree that it looks like it may be your media data, and I don't have an easy answer except those modules may have to be optimized so they only request data when it's really needed, not all the time. If you didn't write them yourself, you are going to have to go back to whoever did.

0

or Register to comment.