Memory Leak...
DHawthorne
Posts: 4,584
I just discovered something that has distressed me a bit, and I'm hoping to get some feedback on it. It very well be related to some of the issues noted on the huge thread on NetLinx lockups.
I've got a big project with 22 Modero panels; 4 of them are MVP-8400's, and the rest CV-12's. There are 7 NetLinx masters in the system; 3 sort central masters for each of the buildings on the estate, and 4 in local theaters. One of those building masters has the task of coordinating the rest and keeping everything talking to each other. I try to load the code for specific devices on the master that is directly responsible for controlling them, but all the panels can access any of it, so the panels are referenced in the device lists on all of them.
OK, that's the background material.
One of my early modules (this system has evolved quite a bit in the two years it's been running), didn't really work very well, so I decided to bite the bullet and re-write it. It controls 3 Escient Fireballs on the system - anyone familiar with them knows it's a chatty device when running. So, in my re-write, I took pains to optimize panel refreshes and communications to keep the traffic to a minimum. When I went to load the new module, however, I found I was short on non-volatile memory. So I went through all the code and specified VOLATILE on all my variables that didn't specifically need to be persistent. After a bit of back-and-forth, I wound up with about 1.5M volatile free, and 350K nonvolatile - more than enough headroom, so I thought, and everything seemed to be behaving, so I left it there. As an aside, though I'm not sure it has any bearing, I upgraded all my masters to firmware rev. 2.31.139, except for the sole Duet capable one, to which I loaded 3.00.316. I also updated all the panels to the latest firmware available for each.
That was last week - Thursday, to be exact. Today (Tuesday) I went back for an unrelated update, and noticed some things seemed to become unresponsive. For example, I could connect to my main master via Telnet, but NetLinx Studio would not connect in debug mode. So I checked the memory - non-volatile has stayed steady, but the volatile memory was down to 8200! I had to reboot to even be able to load new code.
There is something very fishy here - something is eating up memory. I have no disk writes going on; I'm only using the virtual hard drive to store a Homeworks database to read button labeling information from, but that never changes. Nothing does any actual writes with the exception of an Ademco alarm module (not my own, downloaded form AMX) that stores the password on the drive. An FTP list confirms there are no other files on the drive.
I am completely at a loss here, and Tech Support only recommends I load the queue and threshold include file to optimize those. I really think there is a memory leak though - and in an interpreted language this should never happen. One other change, now that I think of it, is all this started after the release of Studio 1.2. Before that, this system ran without any hitches I couldn't explain by plain buggy code; there were certainly no memory dropouts like I am seeing now. I just left the site two hours previous to posting this - when I left, there was 14.M volatile memory. As of this moment, there is 951K. When I started this post, there was 998K, and there have been no intervening error messages, or online/offline events, my Telnet session was open the entire time. Until I resolve this, I am going to have to reboot this system daily to prevent it locking up.
I've got a big project with 22 Modero panels; 4 of them are MVP-8400's, and the rest CV-12's. There are 7 NetLinx masters in the system; 3 sort central masters for each of the buildings on the estate, and 4 in local theaters. One of those building masters has the task of coordinating the rest and keeping everything talking to each other. I try to load the code for specific devices on the master that is directly responsible for controlling them, but all the panels can access any of it, so the panels are referenced in the device lists on all of them.
OK, that's the background material.
One of my early modules (this system has evolved quite a bit in the two years it's been running), didn't really work very well, so I decided to bite the bullet and re-write it. It controls 3 Escient Fireballs on the system - anyone familiar with them knows it's a chatty device when running. So, in my re-write, I took pains to optimize panel refreshes and communications to keep the traffic to a minimum. When I went to load the new module, however, I found I was short on non-volatile memory. So I went through all the code and specified VOLATILE on all my variables that didn't specifically need to be persistent. After a bit of back-and-forth, I wound up with about 1.5M volatile free, and 350K nonvolatile - more than enough headroom, so I thought, and everything seemed to be behaving, so I left it there. As an aside, though I'm not sure it has any bearing, I upgraded all my masters to firmware rev. 2.31.139, except for the sole Duet capable one, to which I loaded 3.00.316. I also updated all the panels to the latest firmware available for each.
That was last week - Thursday, to be exact. Today (Tuesday) I went back for an unrelated update, and noticed some things seemed to become unresponsive. For example, I could connect to my main master via Telnet, but NetLinx Studio would not connect in debug mode. So I checked the memory - non-volatile has stayed steady, but the volatile memory was down to 8200! I had to reboot to even be able to load new code.
There is something very fishy here - something is eating up memory. I have no disk writes going on; I'm only using the virtual hard drive to store a Homeworks database to read button labeling information from, but that never changes. Nothing does any actual writes with the exception of an Ademco alarm module (not my own, downloaded form AMX) that stores the password on the drive. An FTP list confirms there are no other files on the drive.
I am completely at a loss here, and Tech Support only recommends I load the queue and threshold include file to optimize those. I really think there is a memory leak though - and in an interpreted language this should never happen. One other change, now that I think of it, is all this started after the release of Studio 1.2. Before that, this system ran without any hitches I couldn't explain by plain buggy code; there were certainly no memory dropouts like I am seeing now. I just left the site two hours previous to posting this - when I left, there was 14.M volatile memory. As of this moment, there is 951K. When I started this post, there was 998K, and there have been no intervening error messages, or online/offline events, my Telnet session was open the entire time. Until I resolve this, I am going to have to reboot this system daily to prevent it locking up.
0
Comments
Is the master that is locking up daily the one with the 3.00.316 firmware?
An update: since my original post, 12 hours ago, it has dropped another 10K of volatile memory, down to 941K.
And I've fixed the problem too (I think...).
After wracking my brain, I went through all my individual modules and removed any recursive CALL's. It was once my habit to use them to parse buffers since it would clear the buffer out very quickly, and there really should be no reason for it causing trouble in an interpreted language as long as you observe reasonable care that there is an absolute exit path from the recursion. Typically, I would use the token delimiter; test for it, and if it existed, recurse the call to process the next token. As soon as the buffer was empty, the recursion would cease and this should happen very quickly...my buffers wern't big in the ifrst place. But I took them out anyway, and just put the same test in mainline: check the buffer for a delimiter, if it existed, parse the token. It's fractionally slower on a chatty device, but not enough to worry about with the current processor power (I did, by the way, need to do this with a routine on an old pre-260 master once-upon-a-time; mainline did not clear the buffer fast enough, but recursing the token parsing did). It would seem, however, that the master was not always releasing the stack memory allocated for each call when it exited.
Anyway, my memory leak is gone, or at least the most aggregious one...I still see some drops, but they seem to be related to the message queue, I'm not getting constant 11k dropoffs like I did before removing the recursive calls.
Now the question is, why did this start happening all of the sudden? I can only surmise it was introduced with Studio 1.2, or when I converted most of my variable space to VOLATILE. I'm going to leave that one to AMX; this is a live system, and I can't experiment on it. I'm happy just to have it not crashing and burning when it ran out of memory.
I have a module which opens an IP connection to the device. There are 30 instances of the module in the main program. Each message sent to the device involves two functions, each with one STACK_VAR. The parsing of the response is another function with one STACK_VAR. So each message and response should create (and destroy) 3 STACK_VAR's. Over the 30 modules, this equals 90 STACK_VAR's. The devices are polled every 1.5 seconds to keep the connection alive.
I was seeing 10-13K of memory drop every 5 seconds or so (viewed in Terminal, with 'msg on'). After quite a bit of troubleshooting, I removed the STACK_VAR's and defined the variable in DEFINE_VARIABLE section. All memory drops stopped.
It seems like STACK_VAR's don't always release the memory when leaving the function.
--D
As soon as I have some time, I will try to troubleshoot this. I make extensive use of STACK_VARs in some of my code and this could explain some of the wierd problems I've been having.
Thanks,
Jeff