Memory Leak...

DHawthorne · April 2005

I just discovered something that has distressed me a bit, and I'm hoping to get some feedback on it. It very well be related to some of the issues noted on the huge thread on NetLinx lockups.

I've got a big project with 22 Modero panels; 4 of them are MVP-8400's, and the rest CV-12's. There are 7 NetLinx masters in the system; 3 sort central masters for each of the buildings on the estate, and 4 in local theaters. One of those building masters has the task of coordinating the rest and keeping everything talking to each other. I try to load the code for specific devices on the master that is directly responsible for controlling them, but all the panels can access any of it, so the panels are referenced in the device lists on all of them.

OK, that's the background material.

One of my early modules (this system has evolved quite a bit in the two years it's been running), didn't really work very well, so I decided to bite the bullet and re-write it. It controls 3 Escient Fireballs on the system - anyone familiar with them knows it's a chatty device when running. So, in my re-write, I took pains to optimize panel refreshes and communications to keep the traffic to a minimum. When I went to load the new module, however, I found I was short on non-volatile memory. So I went through all the code and specified VOLATILE on all my variables that didn't specifically need to be persistent. After a bit of back-and-forth, I wound up with about 1.5M volatile free, and 350K nonvolatile - more than enough headroom, so I thought, and everything seemed to be behaving, so I left it there. As an aside, though I'm not sure it has any bearing, I upgraded all my masters to firmware rev. 2.31.139, except for the sole Duet capable one, to which I loaded 3.00.316. I also updated all the panels to the latest firmware available for each.

That was last week - Thursday, to be exact. Today (Tuesday) I went back for an unrelated update, and noticed some things seemed to become unresponsive. For example, I could connect to my main master via Telnet, but NetLinx Studio would not connect in debug mode. So I checked the memory - non-volatile has stayed steady, but the volatile memory was down to 8200! I had to reboot to even be able to load new code.

There is something very fishy here - something is eating up memory. I have no disk writes going on; I'm only using the virtual hard drive to store a Homeworks database to read button labeling information from, but that never changes. Nothing does any actual writes with the exception of an Ademco alarm module (not my own, downloaded form AMX) that stores the password on the drive. An FTP list confirms there are no other files on the drive.

I am completely at a loss here, and Tech Support only recommends I load the queue and threshold include file to optimize those. I really think there is a memory leak though - and in an interpreted language this should never happen. One other change, now that I think of it, is all this started after the release of Studio 1.2. Before that, this system ran without any hitches I couldn't explain by plain buggy code; there were certainly no memory dropouts like I am seeing now. I just left the site two hours previous to posting this - when I left, there was 14.M volatile memory. As of this moment, there is 951K. When I started this post, there was 998K, and there have been no intervening error messages, or online/offline events, my Telnet session was open the entire time. Until I resolve this, I am going to have to reboot this system daily to prevent it locking up.

[Deleted User] · April 2005

"As an aside, though I'm not sure it has any bearing, I upgraded all my masters to firmware rev. 2.31.139, except for the sole Duet capable one, to which I loaded 3.00.316. I also updated all the panels to the latest firmware available for each."

Is the master that is locking up daily the one with the 3.00.316 firmware?

DHawthorne · April 2005

dvalosek wrote:

"As an aside, though I'm not sure it has any bearing, I upgraded all my masters to firmware rev. 2.31.139, except for the sole Duet capable one, to which I loaded 3.00.316. I also updated all the panels to the latest firmware available for each."

Is the master that is locking up daily the one with the 3.00.316 firmware?

It isn't locking up daily, it takes more like a week to run out of memory, I'm just resetting daily becasue things stop working. But no, it's not the 3.0 master with the problem. That is a newer master (the job was started before then 260/64's were available - matter of fact, even before the 260's), and currently is in an ancillary location. I intend to switch them out though, so it becomes the central one.

An update: since my original post, 12 hours ago, it has dropped another 10K of volatile memory, down to 941K.

DHawthorne · April 2005

This is looking more and more like a firmware issue than a Studio issue, so I'm going to repost it there on the advice of one of the NS2 developers.

DHawthorne · April 2005

I've ruled out the firmware; I reverted back to 2.31.135 and still had a leak.

And I've fixed the problem too (I think...).

After wracking my brain, I went through all my individual modules and removed any recursive CALL's. It was once my habit to use them to parse buffers since it would clear the buffer out very quickly, and there really should be no reason for it causing trouble in an interpreted language as long as you observe reasonable care that there is an absolute exit path from the recursion. Typically, I would use the token delimiter; test for it, and if it existed, recurse the call to process the next token. As soon as the buffer was empty, the recursion would cease and this should happen very quickly...my buffers wern't big in the ifrst place. But I took them out anyway, and just put the same test in mainline: check the buffer for a delimiter, if it existed, parse the token. It's fractionally slower on a chatty device, but not enough to worry about with the current processor power (I did, by the way, need to do this with a routine on an old pre-260 master once-upon-a-time; mainline did not clear the buffer fast enough, but recursing the token parsing did). It would seem, however, that the master was not always releasing the stack memory allocated for each call when it exited.

Anyway, my memory leak is gone, or at least the most aggregious one...I still see some drops, but they seem to be related to the message queue, I'm not getting constant 11k dropoffs like I did before removing the recursive calls.

Now the question is, why did this start happening all of the sudden? I can only surmise it was introduced with Studio 1.2, or when I converted most of my variable space to VOLATILE. I'm going to leave that one to AMX; this is a live system, and I can't experiment on it. I'm happy just to have it not crashing and burning when it ran out of memory.

DHawthorne · April 2005

After clearing up the connection issues mentioned in another thread, all unexplained memory dropouts have ceased. It would seem the leak was caused by bad routing in the master-to-master communications. Clearing up the routing tables and re-entering them did not, however, regain the memory that was lost previously; it just kept it from eroding further. I am loathe to reboot the system remotely at this point to reclaim that memory; it will keep until I am on site. But I have had a Telnet session open all day on this system remotely, and not a hiccup. No further ConnectionManager messages showing memory use, no errors, no online/offline events, and a periodic /show system confirms all the devices are still online.

dchristo · September 2005

I experienced what I believe to be a memory leak this past weekend with an NI-700, which I traced to the use of STACK_VAR's in functions. Here's the setup:

I have a module which opens an IP connection to the device. There are 30 instances of the module in the main program. Each message sent to the device involves two functions, each with one STACK_VAR. The parsing of the response is another function with one STACK_VAR. So each message and response should create (and destroy) 3 STACK_VAR's. Over the 30 modules, this equals 90 STACK_VAR's. The devices are polled every 1.5 seconds to keep the connection alive.

I was seeing 10-13K of memory drop every 5 seconds or so (viewed in Terminal, with 'msg on'). After quite a bit of troubleshooting, I removed the STACK_VAR's and defined the variable in DEFINE_VARIABLE section. All memory drops stopped.

It seems like STACK_VAR's don't always release the memory when leaving the function.

--D

Spire_Jeff · September 2005

dchristo wrote:

I was seeing 10-13K of memory drop every 5 seconds or so (viewed in Terminal, with 'msg on'). After quite a bit of troubleshooting, I removed the STACK_VAR's and defined the variable in DEFINE_VARIABLE section. All memory drops stopped.

As soon as I have some time, I will try to troubleshoot this. I make extensive use of STACK_VARs in some of my code and this could explain some of the wierd problems I've been having.

Thanks,

Jeff

Memory Leak...

Comments