Memory leak with 2.31.139 ?

DHawthorne · April 2005

I just discovered something that has distressed me a bit, and I'm hoping to get some feedback on it. It very well be related to some of the issues noted on the huge thread on NetLinx lockups.

I've got a big project with 22 Modero panels; 4 of them are MVP-8400's, and the rest CV-12's. There are 7 NetLinx masters in the system; 3 sort central masters for each of the buildings on the estate, and 4 in local theaters. One of those building masters has the task of coordinating the rest and keeping everything talking to each other. I try to load the code for specific devices on the master that is directly responsible for controlling them, but all the panels can access any of it, so the panels are referenced in the device lists on all of them.

OK, that's the background material.

One of my early modules (this system has evolved quite a bit in the two years it's been running), didn't really work very well, so I decided to bite the bullet and re-write it. It controls 3 Escient Fireballs on the system - anyone familiar with them knows it's a chatty device when running. So, in my re-write, I took pains to optimize panel refreshes and communications to keep the traffic to a minimum. When I went to load the new module, however, I found I was short on non-volatile memory. So I went through all the code and specified VOLATILE on all my variables that didn't specifically need to be persistent. After a bit of back-and-forth, I wound up with about 1.5M volatile free, and 350K nonvolatile - more than enough headroom, so I thought, and everything seemed to be behaving, so I left it there. As an aside, though I'm not sure it has any bearing, I upgraded all my masters to firmware rev. 2.31.139, except for the sole Duet capable one, to which I loaded 3.00.316. I also updated all the panels to the latest firmware available for each.

That was last week - Thursday, to be exact. Today (Tuesday) I went back for an unrelated update, and noticed some things seemed to become unresponsive. For example, I could connect to my main master via Telnet, but NetLinx Studio would not connect in debug mode. So I checked the memory - non-volatile has stayed steady, but the volatile memory was down to 8200! I had to reboot to even be able to load new code.

There is something very fishy here - something is eating up memory. I have no disk writes going on; I'm only using the virtual hard drive to store a Homeworks database to read button labeling information from, but that never changes. Nothing does any actual writes with the exception of an Ademco alarm module (not my own, downloaded form AMX) that stores the password on the drive. An FTP list confirms there are no other files on the drive.

I am completely at a loss here, and Tech Support only recommends I load the queue and threshold include file to optimize those. I really think there is a memory leak though - and in an interpreted language this should never happen. One other change, now that I think of it, is all this started after the release of Studio 1.2. Before that, this system ran without any hitches I couldn't explain by plain buggy code; there were certainly no memory dropouts like I am seeing now. I just left the site two hours previous to posting this - when I left, there was 14.M volatile memory. As of this moment, there is 951K. When I started this post, there was 998K, and there have been no intervening error messages, or online/offline events, my Telnet session was open the entire time. Until I resolve this, I am going to have to reboot this system daily to prevent it locking up.

Follow up since original post in the Studio forum:

The Queue_and_Threshold_Sizes include file did not affect this problem, but monitoring the system closely on site, I noticed this in the system log:

  1: 04-20-2005 WED 11:28:21 ConnectionManager
     Memory Available = 935448 <314016>
  2: 04-20-2005 WED 11:18:14 ConnectionManager
     Memory Available = 1249464 <10120>
  3: 04-20-2005 WED 11:14:40 Interpreter
     CIpEvent::OnLine 10001:11:4
     CIpEvent::OnLine 10001:11:4

And previousley in the day:

  1: 04-20-2005 WED 10:44:31 Interpreter           
     CIpDiag::CloseSession  32001:1:1
  2: 04-20-2005 WED 10:44:23 Interpreter           
     CIpDiag::OpenSession  32001:1:1
  3: 04-20-2005 WED 10:44:22 Interpreter           
     CIpEvent::OnLine 32001:1:1
  4: 04-20-2005 WED 10:44:22 ConnectionManager     
     Memory Available = 1512304 <28896>
  5: 04-20-2005 WED 10:29:09 ConnectionManager     
     Memory Available = 1541200 <698620>
  6: 04-20-2005 WED 10:29:09 ConnectionManager     
     Memory Available = 2239820 <174200>
  7: 04-20-2005 WED 10:29:09 ConnectionManager     
     Memory Available = 2414020 <73328>

Looks to me like the connection manager is at fault. Notice there are no error messages, nor messages pending, but in the first log, the big drop between #1 and #2, and in the second log between #5 and #6 - though all of them have losses of some degree.

DHawthorne · April 2005

Reverting to firmware 2.31.136 did not resolve the memory loss. I really hate it when a bunch of changes go in all at once and you can't figure out whichis your culprit...

In any case, I'm back to thinking it somehow has something to do with my addition of all the VOLATILE keywords. Something is allocating memory and not giving it back...and it's related to the connection manager.

kennyann · April 2005

I am not sure whether this would help or not - I installed a system with two masters - one for the whole house and the other for the theater system. I installed AMX's alarm module - I cannot remember which one I think it could be for Ademco. The system kept locking up after two or three days - I could not figure out why. So we decided to put a computer and log all the information from the Main controller and found out that the AMX module was creating a problem. AMX changed it and that appartently fixed our problem. It could me an issue with that module. The system was installed sometime in Aug of 2004.

Kenny A

cwpartridge · April 2005

FYI, The connection manager is simply checking the largest free memory block periodically, and reporting if it changes by some amount. In other words, it is responsible for polling the largest memory block free and reporting it. I doubt it is the responsible party for this. There is little than can be inferred by this amount of info. I believe you will have to dig a little deeper. Sorry I can't be of more help.

Chuck

DHawthorne · April 2005

cwpartridge wrote:

FYI, The connection manager is simply checking the largest free memory block periodically, and reporting if it changes by some amount. In other words, it is responsible for polling the largest memory block free and reporting it. I doubt it is the responsible party for this. There is little than can be inferred by this amount of info. I believe you will have to dig a little deeper. Sorry I can't be of more help.

Chuck

Oh bleh. I was never a big fan of the terse error reporting, and this reinforces that feeling. There should at least be a verbose mode...

DHawthorne · April 2005

If this is a firmware issue at all, it's been around a while, even reverting to .135 did not make it go away. But I am strongly leaning away from the thought it is inthe firmware at all; though it's possible I always had this going on but never hit the wall with it, so to speak, I somewhat doubt that is the case. I'm back to thinking it's a compiler problem: an interpreted language should not have memory leaks, period, even if the programming is, shall we say, deficient. Not that I beleive that is true either

.

DHawthorne · May 2005

Upon upgrading the master on this system to a ME-260/64, and installing firmare rev. 3.00.316, I find I still have memory disappearing over time. It's far less frequent, and the new master provides a lot more headroom, but I still have a problem that requires a roboot periodically. I strongly suspect it is in the socket managment. It's not consistent, but I have noticed diagnostic messages indicating a memory allocation change when I restart a panel connected to the system. Since that same panel had previously been connected and in use, I fail to see why it should need additional memory on a restart. Yet, it does not happen every time, and I can't be 100% certain that the memory message was just a matter of random congruence with the restart.

Memory leak with 2.31.139 ?

Comments