Memory leak with 2.31.139 ?
DHawthorne
Posts: 4,584
in AMX Hardware
I just discovered something that has distressed me a bit, and I'm hoping to get some feedback on it. It very well be related to some of the issues noted on the huge thread on NetLinx lockups.
I've got a big project with 22 Modero panels; 4 of them are MVP-8400's, and the rest CV-12's. There are 7 NetLinx masters in the system; 3 sort central masters for each of the buildings on the estate, and 4 in local theaters. One of those building masters has the task of coordinating the rest and keeping everything talking to each other. I try to load the code for specific devices on the master that is directly responsible for controlling them, but all the panels can access any of it, so the panels are referenced in the device lists on all of them.
OK, that's the background material.
One of my early modules (this system has evolved quite a bit in the two years it's been running), didn't really work very well, so I decided to bite the bullet and re-write it. It controls 3 Escient Fireballs on the system - anyone familiar with them knows it's a chatty device when running. So, in my re-write, I took pains to optimize panel refreshes and communications to keep the traffic to a minimum. When I went to load the new module, however, I found I was short on non-volatile memory. So I went through all the code and specified VOLATILE on all my variables that didn't specifically need to be persistent. After a bit of back-and-forth, I wound up with about 1.5M volatile free, and 350K nonvolatile - more than enough headroom, so I thought, and everything seemed to be behaving, so I left it there. As an aside, though I'm not sure it has any bearing, I upgraded all my masters to firmware rev. 2.31.139, except for the sole Duet capable one, to which I loaded 3.00.316. I also updated all the panels to the latest firmware available for each.
That was last week - Thursday, to be exact. Today (Tuesday) I went back for an unrelated update, and noticed some things seemed to become unresponsive. For example, I could connect to my main master via Telnet, but NetLinx Studio would not connect in debug mode. So I checked the memory - non-volatile has stayed steady, but the volatile memory was down to 8200! I had to reboot to even be able to load new code.
There is something very fishy here - something is eating up memory. I have no disk writes going on; I'm only using the virtual hard drive to store a Homeworks database to read button labeling information from, but that never changes. Nothing does any actual writes with the exception of an Ademco alarm module (not my own, downloaded form AMX) that stores the password on the drive. An FTP list confirms there are no other files on the drive.
I am completely at a loss here, and Tech Support only recommends I load the queue and threshold include file to optimize those. I really think there is a memory leak though - and in an interpreted language this should never happen. One other change, now that I think of it, is all this started after the release of Studio 1.2. Before that, this system ran without any hitches I couldn't explain by plain buggy code; there were certainly no memory dropouts like I am seeing now. I just left the site two hours previous to posting this - when I left, there was 14.M volatile memory. As of this moment, there is 951K. When I started this post, there was 998K, and there have been no intervening error messages, or online/offline events, my Telnet session was open the entire time. Until I resolve this, I am going to have to reboot this system daily to prevent it locking up.
Follow up since original post in the Studio forum:
The Queue_and_Threshold_Sizes include file did not affect this problem, but monitoring the system closely on site, I noticed this in the system log:
I've got a big project with 22 Modero panels; 4 of them are MVP-8400's, and the rest CV-12's. There are 7 NetLinx masters in the system; 3 sort central masters for each of the buildings on the estate, and 4 in local theaters. One of those building masters has the task of coordinating the rest and keeping everything talking to each other. I try to load the code for specific devices on the master that is directly responsible for controlling them, but all the panels can access any of it, so the panels are referenced in the device lists on all of them.
OK, that's the background material.
One of my early modules (this system has evolved quite a bit in the two years it's been running), didn't really work very well, so I decided to bite the bullet and re-write it. It controls 3 Escient Fireballs on the system - anyone familiar with them knows it's a chatty device when running. So, in my re-write, I took pains to optimize panel refreshes and communications to keep the traffic to a minimum. When I went to load the new module, however, I found I was short on non-volatile memory. So I went through all the code and specified VOLATILE on all my variables that didn't specifically need to be persistent. After a bit of back-and-forth, I wound up with about 1.5M volatile free, and 350K nonvolatile - more than enough headroom, so I thought, and everything seemed to be behaving, so I left it there. As an aside, though I'm not sure it has any bearing, I upgraded all my masters to firmware rev. 2.31.139, except for the sole Duet capable one, to which I loaded 3.00.316. I also updated all the panels to the latest firmware available for each.
That was last week - Thursday, to be exact. Today (Tuesday) I went back for an unrelated update, and noticed some things seemed to become unresponsive. For example, I could connect to my main master via Telnet, but NetLinx Studio would not connect in debug mode. So I checked the memory - non-volatile has stayed steady, but the volatile memory was down to 8200! I had to reboot to even be able to load new code.
There is something very fishy here - something is eating up memory. I have no disk writes going on; I'm only using the virtual hard drive to store a Homeworks database to read button labeling information from, but that never changes. Nothing does any actual writes with the exception of an Ademco alarm module (not my own, downloaded form AMX) that stores the password on the drive. An FTP list confirms there are no other files on the drive.
I am completely at a loss here, and Tech Support only recommends I load the queue and threshold include file to optimize those. I really think there is a memory leak though - and in an interpreted language this should never happen. One other change, now that I think of it, is all this started after the release of Studio 1.2. Before that, this system ran without any hitches I couldn't explain by plain buggy code; there were certainly no memory dropouts like I am seeing now. I just left the site two hours previous to posting this - when I left, there was 14.M volatile memory. As of this moment, there is 951K. When I started this post, there was 998K, and there have been no intervening error messages, or online/offline events, my Telnet session was open the entire time. Until I resolve this, I am going to have to reboot this system daily to prevent it locking up.
Follow up since original post in the Studio forum:
The Queue_and_Threshold_Sizes include file did not affect this problem, but monitoring the system closely on site, I noticed this in the system log:
1: 04-20-2005 WED 11:28:21 ConnectionManager Memory Available = 935448 <314016> 2: 04-20-2005 WED 11:18:14 ConnectionManager Memory Available = 1249464 <10120> 3: 04-20-2005 WED 11:14:40 Interpreter CIpEvent::OnLine 10001:11:4 CIpEvent::OnLine 10001:11:4And previousley in the day:
1: 04-20-2005 WED 10:44:31 Interpreter CIpDiag::CloseSession 32001:1:1 2: 04-20-2005 WED 10:44:23 Interpreter CIpDiag::OpenSession 32001:1:1 3: 04-20-2005 WED 10:44:22 Interpreter CIpEvent::OnLine 32001:1:1 4: 04-20-2005 WED 10:44:22 ConnectionManager Memory Available = 1512304 <28896> 5: 04-20-2005 WED 10:29:09 ConnectionManager Memory Available = 1541200 <698620> 6: 04-20-2005 WED 10:29:09 ConnectionManager Memory Available = 2239820 <174200> 7: 04-20-2005 WED 10:29:09 ConnectionManager Memory Available = 2414020 <73328>Looks to me like the connection manager is at fault. Notice there are no error messages, nor messages pending, but in the first log, the big drop between #1 and #2, and in the second log between #5 and #6 - though all of them have losses of some degree.
0
Comments
In any case, I'm back to thinking it somehow has something to do with my addition of all the VOLATILE keywords. Something is allocating memory and not giving it back...and it's related to the connection manager.
Kenny A
Chuck