Parse web data
davidv
Posts: 90
I am trying to see in DEBUG what data is actually coming into the parser and I can see any values. Is this structure of code correct???
PROGRAM_NAME='Tides'
Define_device
dvdevice1 = 0:19:0
define_constant
CRLF[2]= {$0D,$0A}
Define_variable
cZip[5] = '02134'
char cbuf1[99999]
char Tide_cWeather_Buffer[65535]
char tide_cWeather_Trash[65535]
char cClient_Connection_Status
Define_Function Tide_Client_Connection(STARTUP)
{
Ip_Client_Open(dvDevice1.Port,'tidesandcurrents.noaa.gov',80,1)
Wait 10
Tide_Grab_Data()
}
Define_Function Tide_Grab_Data()
{
Send_String dvDevice1,"'GET /geo/index.jsp?location=',cZip,CRLF"
Send_String dvDevice1,"'Host: tidesandcurrents.noaa.gov',CRLF"
//Send_String dvDevice1,"'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.5',CRLF"
// Send_String dvDevice1,"'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',CRLF"
Send_String dvDevice1,"CRLF"
}
Define_Function Tide_Parse_Data(Char cBuff1[65535])
{
Local_Var Char cTrash1[65535]
Local_Var Char cTemp1[65535]
Local_Var Char cWork1[65535]
Local_Var Integer j
Local_Var Integer z
j=0
(*
If(Find_String(cBuff1,'Predicted Tides:',1))
{
cTrash1 = Remove_String(cBuff1,'Predicted Tides:',1)
cTemp1 = Remove_String(cBuff1,'Predicted Tides:',1)
}*)
}
Define_Start
Tide_Client_Connection(cbuf1)
Define_Event
Data_Event[dvDevice1]
{
Online:
{
send_string 0,"'You are connected to the tide page'"
}
String:
{
If(Find_String(tide_cWeather_Buffer,'Predicted Tides:',1))
{
tide_cWeather_Trash = Remove_String(tide_cWeather_Buffer,'Predicted Tides:',1)
Tide_Parse_Data(tide_cWeather_Buffer)
}
}
Offline:
{
send_string 0,"'You have disconnected from the tide page'"
}
}
PROGRAM_NAME='Tides'
Define_device
dvdevice1 = 0:19:0
define_constant
CRLF[2]= {$0D,$0A}
Define_variable
cZip[5] = '02134'
char cbuf1[99999]
char Tide_cWeather_Buffer[65535]
char tide_cWeather_Trash[65535]
char cClient_Connection_Status
Define_Function Tide_Client_Connection(STARTUP)
{
Ip_Client_Open(dvDevice1.Port,'tidesandcurrents.noaa.gov',80,1)
Wait 10
Tide_Grab_Data()
}
Define_Function Tide_Grab_Data()
{
Send_String dvDevice1,"'GET /geo/index.jsp?location=',cZip,CRLF"
Send_String dvDevice1,"'Host: tidesandcurrents.noaa.gov',CRLF"
//Send_String dvDevice1,"'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.5',CRLF"
// Send_String dvDevice1,"'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',CRLF"
Send_String dvDevice1,"CRLF"
}
Define_Function Tide_Parse_Data(Char cBuff1[65535])
{
Local_Var Char cTrash1[65535]
Local_Var Char cTemp1[65535]
Local_Var Char cWork1[65535]
Local_Var Integer j
Local_Var Integer z
j=0
(*
If(Find_String(cBuff1,'Predicted Tides:',1))
{
cTrash1 = Remove_String(cBuff1,'Predicted Tides:',1)
cTemp1 = Remove_String(cBuff1,'Predicted Tides:',1)
}*)
}
Define_Start
Tide_Client_Connection(cbuf1)
Define_Event
Data_Event[dvDevice1]
{
Online:
{
send_string 0,"'You are connected to the tide page'"
}
String:
{
If(Find_String(tide_cWeather_Buffer,'Predicted Tides:',1))
{
tide_cWeather_Trash = Remove_String(tide_cWeather_Buffer,'Predicted Tides:',1)
Tide_Parse_Data(tide_cWeather_Buffer)
}
}
Offline:
{
send_string 0,"'You have disconnected from the tide page'"
}
}
0
Comments
I need tide info based on Zip code location. I couldnt find any. This is the only place I could find Tide information. Unless you know of a good website. I dont.
Will that work?
Im not worried so much about parsing as I am curious to figure out how to watch the data come in through the string into a buffer.
is my code correct?
Or are you using buffer=DATA.TEXT somewhere?
BTW: I'd use CREATE_BUFFER in this case since it's made to handle larger hunks of data.
Or you can go into the Netlinx.AXI and modify data.text
Since there is a lot of very efficient code for parsing xml/html available in java, I recently started to use duet modules for more complex tasks like parsing data structures.
I attache a duet module I recently wrote for xml parsing. I changed it in order to parse html in a very basic way. You are welcome to add your specific parsing logic in the HtmlParserWorker class (the module is not encrypted). In its current state the module simply extracts any html tag together with any content and sends it to the controller (as STRING). Have a look at the sample programm and you'll see how it works.
Patrick
Edit: for some reason the system didn't let me attach the module file. Instead I put it on my website: http://www.wepcom.ch/amx/HtmlParser.AXW
Define it as a buffer in define_start for you IP device port and try again.
Eric wrote: You could use data.text and append it to a large var (non buffer) since the max MTUs that can be returned IAW with the Ethernet specs are 1500 and data.text is sized for 2048. Unless it is a very small web page you will have to append it to a large var sized big enough for your needs in order to hold a full return prior to parsing. Typical you want to wait for some specific string or ending tag to make sure you have the complete data you're looking for before sending it to your parsing function.
Go to this thread: http://www.amxforums.com/showthread.php?t=4406&highlight=MTU
There some good info on data.text vs create_buffer that might be helpful. There is also complete code for scrapping web pages that you can cut and paste into your code. In the define constant section you'll need to put in the beginning search string and the ending search string and it will start adding the data.text data returned to a var when it finds the "beginning string and stop when it finds the "ending string" which then triggers the parsing function which you'll have to fill in to match your needs. Now it's just an empty function. It was set up to scrap Google weather just for testing so you will need to replace that data w/ your NOAA stuff.
It's already configured to collect data larger than 15999 in case you have a large section of the web page that needs to be held in the var before parsing. Using this code will allow you to size the holding var just big enough to hold the data bewteen the beginning and ending string where as if you create a buffer you'll need to size it large enough to hold everything up to the ending string. Sometimes you just need a couple hundred bytes in the middle or end so its considerably more efficient.
I used to be in the create a buffer camp but this thread converted me.
Thanks guys
I forgot to use a create_buffer in the define start.
Its working great now.
Thanks again
Hi Patrick,
I downloaded your HtmlParser and it works great. The problem is that in some addreses I get the following error: ch.wepcom.htmlparser.dr1_0_0.Htmlparser: PARSEURL: IOExecption for: http://weather.yahoo.com/Nea
The complete address I put in your module is the following:
SEND_COMMAND vdvHtmlParser, 'PARSEURL-http://weather.yahoo.com/Nea-Ionia-Greece/GRXX0012/forecast.html?unit=c'
If I use the following command it works ok:
SEND_COMMAND vdvHtmlParser, 'PARSEURL-http://weather.yahoo.com/forecast/GRXX0004_c.html'
I don't know where this error comes from.
Since I cannot edit your module I thought of let you know in case there is something you can fix.
Best Regards,
Costas Theoharis
Hi
Sorry for the late answer - I enjoyed some days off over X-mas and new year...
Attached you find a fixed version of the module including source code. If you have access to a duet development platform you are welcome to make any further changes as you need them. In fact the module was mainly thought of as a sample to show the possibility to reuse existing java code in netlinx for common tasks. However, if you can need it as is, that's even better.
Patrick
PS: the problem was the '-' in the URL, which was treated as a command/parameter separater. A similar situation could still happen when using ',' in the URL.
Hi Patrick,
Thanks for your reply. I will test it and let you know if it's ok. I think it's a great tool for parsing HTML pages. Unfortunately I'm good in Netlinx code but not so good in Java (I haven't been into it yet). Maybe it's a good opportunity to start. Which editor do you use for programming such modules?
Best Regards,
Costas Theoharis
I'm using the Cafe Duet IDE. This is an eclipse based programming environment of AMX, which includes some additional plugins to develop duet modules. However, Cafe Duet is not free. You can find more information about licencing on the AMX web pages and in this forum.
The eclipse IDE is an extremly productive programming environment for coding Java. When it comes to code refactoring or code completion tasks, it is by far superior to NS2; at least that's how I see it.
Regards,
Patrick
Patrick,
Thanks for your reply. Another last (I hope) thing I want to ask you is if you can increase the array index (I think now is for 500 lines) so that I can grab larger sites with a lot of info such as stock prices (a size of 2000 or 3000 I think will be ok). Now after line 501 I get the following error:
Line 1776 (16:08:08):: line:926 tag::A text::$A91999-2000$C4$E7$EC$EF$F3$E9$EF$E3$F1$E1$F6$E9$EA$FC$F2 $CF$F1$E3$E1$ED$E9$F3$EC$FC$F2 $CB$E1$EC$F0$F1$DC$EA$E7 $C1.$C5.
Line 1777 (16:08:08):: Ref Error ^STOCKINFO Index to large Line=125
Line 1778 (16:08:08):: CopyString (Reference) - Error 2 S=0x1011 D=0x0000
Sorry for my requests, you don't actually have to do that but until I learn how to compile and edit you module I don't have an alternative and your code is working really well for me.
Thanks again.
Regards,
Costas Theoharis
I did some tests with long files and long tags today, but I wasn't able to reproduce the error you described. What URL did you try to parse?
The error message doesn't look like a problem with the java code; uncaught errors in java modules result in lengthy error stack outputs. Do you use the sample netlinx program I provided in the package? If not, can you post your netlinx code as well?
Best regards,
Patrick
PS: Since the further steps to isolate the problem may probably not be of public interest you can also post your e-mail address in a private message to me.