Home AMX User Forum AMXForums Archive Threads AMX Applications and Solutions

Parse web data

I am trying to see in DEBUG what data is actually coming into the parser and I can see any values. Is this structure of code correct???



PROGRAM_NAME='Tides'

Define_device
dvdevice1 = 0:19:0

define_constant
CRLF[2]= {$0D,$0A}

Define_variable
cZip[5] = '02134'
char cbuf1[99999]
char Tide_cWeather_Buffer[65535]
char tide_cWeather_Trash[65535]
char cClient_Connection_Status

Define_Function Tide_Client_Connection(STARTUP)
{
Ip_Client_Open(dvDevice1.Port,'tidesandcurrents.noaa.gov',80,1)

Wait 10
Tide_Grab_Data()
}
Define_Function Tide_Grab_Data()
{
Send_String dvDevice1,"'GET /geo/index.jsp?location=',cZip,CRLF"
Send_String dvDevice1,"'Host: tidesandcurrents.noaa.gov',CRLF"
//Send_String dvDevice1,"'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.5',CRLF"
// Send_String dvDevice1,"'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',CRLF"
Send_String dvDevice1,"CRLF"
}

Define_Function Tide_Parse_Data(Char cBuff1[65535])
{
Local_Var Char cTrash1[65535]
Local_Var Char cTemp1[65535]
Local_Var Char cWork1[65535]
Local_Var Integer j
Local_Var Integer z

j=0
(*

If(Find_String(cBuff1,'Predicted Tides:',1))
{
cTrash1 = Remove_String(cBuff1,'Predicted Tides:',1)
cTemp1 = Remove_String(cBuff1,'Predicted Tides:',1)
}*)
}

Define_Start
Tide_Client_Connection(cbuf1)

Define_Event
Data_Event[dvDevice1]
{
Online:
{
send_string 0,"'You are connected to the tide page'"
}
String:
{
If(Find_String(tide_cWeather_Buffer,'Predicted Tides:',1))
{
tide_cWeather_Trash = Remove_String(tide_cWeather_Buffer,'Predicted Tides:',1)
Tide_Parse_Data(tide_cWeather_Buffer)
}
}
Offline:
{
send_string 0,"'You have disconnected from the tide page'"
}
}

Comments

  • jjamesjjames Posts: 2,908
    Tell me you're trying to read the XML data from NOAA and not HTML.
  • davidvdavidv Posts: 90
    I am trying to get the HTML data. I looked at the source file and I am trying to parse certain data out to get some tide information. What did I do wrong?
  • jjamesjjames Posts: 2,908
    Parsing XML would be 10x easier than HTML - that's all. I just know how difficult how HTML parsing is and whenever there's an XML option - I always opt for that. Take a look here, and select which XML feed you want. It'll probably be a bit more involved to get the XML - but once you do, it'll be much easier.
  • davidvdavidv Posts: 90
    I need tide information

    I need tide info based on Zip code location. I couldnt find any. This is the only place I could find Tide information. Unless you know of a good website. I dont.
  • davidvdavidv Posts: 90
    No I need times. I just need help figuring out how to get data parsed and stored into a buffer.

    Im not worried so much about parsing as I am curious to figure out how to watch the data come in through the string into a buffer.

    is my code correct?
  • ericmedleyericmedley Posts: 4,177
    I don't see anywhere where you load the data into the buffer. Is there a 'CREATE_BUFFER' command that we cannot see?

    Or are you using buffer=DATA.TEXT somewhere?

    BTW: I'd use CREATE_BUFFER in this case since it's made to handle larger hunks of data.

    Or you can go into the Netlinx.AXI and modify data.text
  • duet module for parsing html

    Since there is a lot of very efficient code for parsing xml/html available in java, I recently started to use duet modules for more complex tasks like parsing data structures.

    I attache a duet module I recently wrote for xml parsing. I changed it in order to parse html in a very basic way. You are welcome to add your specific parsing logic in the HtmlParserWorker class (the module is not encrypted). In its current state the module simply extracts any html tag together with any content and sends it to the controller (as STRING). Have a look at the sample programm and you'll see how it works.

    Patrick

    Edit: for some reason the system didn't let me attach the module file. Instead I put it on my website: http://www.wepcom.ch/amx/HtmlParser.AXW
  • viningvining Posts: 4,368
    Eric is right! You don't have the buffer Tide_cWeather_Buffer assigned as a buffer to your IP port so it will never hold your incoming data in your DATA_EVENT's string handler.

    Define it as a buffer in define_start for you IP device port and try again.

    Eric wrote:
    BTW: I'd use CREATE_BUFFER in this case since it's made to handle larger hunks of data.
    You could use data.text and append it to a large var (non buffer) since the max MTUs that can be returned IAW with the Ethernet specs are 1500 and data.text is sized for 2048. Unless it is a very small web page you will have to append it to a large var sized big enough for your needs in order to hold a full return prior to parsing. Typical you want to wait for some specific string or ending tag to make sure you have the complete data you're looking for before sending it to your parsing function.
  • viningvining Posts: 4,368
    David,

    Go to this thread: http://www.amxforums.com/showthread.php?t=4406&highlight=MTU

    There some good info on data.text vs create_buffer that might be helpful. There is also complete code for scrapping web pages that you can cut and paste into your code. In the define constant section you'll need to put in the beginning search string and the ending search string and it will start adding the data.text data returned to a var when it finds the "beginning string and stop when it finds the "ending string" which then triggers the parsing function which you'll have to fill in to match your needs. Now it's just an empty function. It was set up to scrap Google weather just for testing so you will need to replace that data w/ your NOAA stuff.

    It's already configured to collect data larger than 15999 in case you have a large section of the web page that needs to be held in the var before parsing. Using this code will allow you to size the holding var just big enough to hold the data bewteen the beginning and ending string where as if you create a buffer you'll need to size it large enough to hold everything up to the ending string. Sometimes you just need a couple hundred bytes in the middle or end so its considerably more efficient.

    I used to be in the create a buffer camp but this thread converted me.
  • davidvdavidv Posts: 90
    yo

    Thanks guys
    I forgot to use a create_buffer in the define start.

    Its working great now.

    Thanks again
  • wengerp wrote: »
    Since there is a lot of very efficient code for parsing xml/html available in java, I recently started to use duet modules for more complex tasks like parsing data structures.

    I attache a duet module I recently wrote for xml parsing. I changed it in order to parse html in a very basic way. You are welcome to add your specific parsing logic in the HtmlParserWorker class (the module is not encrypted). In its current state the module simply extracts any html tag together with any content and sends it to the controller (as STRING). Have a look at the sample programm and you'll see how it works.

    Patrick

    Edit: for some reason the system didn't let me attach the module file. Instead I put it on my website: http://www.wepcom.ch/amx/HtmlParser.AXW

    Hi Patrick,
    I downloaded your HtmlParser and it works great. The problem is that in some addreses I get the following error: ch.wepcom.htmlparser.dr1_0_0.Htmlparser: PARSEURL: IOExecption for: http://weather.yahoo.com/Nea

    The complete address I put in your module is the following:
    SEND_COMMAND vdvHtmlParser, 'PARSEURL-http://weather.yahoo.com/Nea-Ionia-Greece/GRXX0012/forecast.html?unit=c'

    If I use the following command it works ok:
    SEND_COMMAND vdvHtmlParser, 'PARSEURL-http://weather.yahoo.com/forecast/GRXX0004_c.html'

    I don't know where this error comes from.
    Since I cannot edit your module I thought of let you know in case there is something you can fix.

    Best Regards,
    Costas Theoharis
  • wengerpwengerp Posts: 29
    Since I cannot edit your module I thought of let you know in case there is something you can fix.

    Hi
    Sorry for the late answer - I enjoyed some days off over X-mas and new year...
    Attached you find a fixed version of the module including source code. If you have access to a duet development platform you are welcome to make any further changes as you need them. In fact the module was mainly thought of as a sample to show the possibility to reuse existing java code in netlinx for common tasks. However, if you can need it as is, that's even better.

    Patrick

    PS: the problem was the '-' in the URL, which was treated as a command/parameter separater. A similar situation could still happen when using ',' in the URL.
  • HTML Parser

    Hi Patrick,
    Thanks for your reply. I will test it and let you know if it's ok. I think it's a great tool for parsing HTML pages. Unfortunately I'm good in Netlinx code but not so good in Java (I haven't been into it yet). Maybe it's a good opportunity to start. Which editor do you use for programming such modules?

    Best Regards,
    Costas Theoharis
  • wengerpwengerp Posts: 29
    Which editor do you use for programming such modules?

    I'm using the Cafe Duet IDE. This is an eclipse based programming environment of AMX, which includes some additional plugins to develop duet modules. However, Cafe Duet is not free. You can find more information about licencing on the AMX web pages and in this forum.

    The eclipse IDE is an extremly productive programming environment for coding Java. When it comes to code refactoring or code completion tasks, it is by far superior to NS2; at least that's how I see it.

    Regards,
    Patrick
  • Parse web data

    Patrick,
    Thanks for your reply. Another last (I hope) thing I want to ask you is if you can increase the array index (I think now is for 500 lines) so that I can grab larger sites with a lot of info such as stock prices (a size of 2000 or 3000 I think will be ok). Now after line 501 I get the following error:

    Line 1776 (16:08:08):: line:926 tag::A text::$A91999-2000$C4$E7$EC$EF$F3$E9$EF$E3$F1$E1$F6$E9$EA$FC$F2 $CF$F1$E3$E1$ED$E9$F3$EC$FC$F2 $CB$E1$EC$F0$F1$DC$EA$E7 $C1.$C5.
    Line 1777 (16:08:08):: Ref Error ^STOCKINFO Index to large Line=125
    Line 1778 (16:08:08):: CopyString (Reference) - Error 2 S=0x1011 D=0x0000

    Sorry for my requests, you don't actually have to do that but until I learn how to compile and edit you module I don't have an alternative and your code is working really well for me.
    Thanks again.

    Regards,
    Costas Theoharis
  • wengerpwengerp Posts: 29
    Hi again

    I did some tests with long files and long tags today, but I wasn't able to reproduce the error you described. What URL did you try to parse?

    The error message doesn't look like a problem with the java code; uncaught errors in java modules result in lengthy error stack outputs. Do you use the sample netlinx program I provided in the package? If not, can you post your netlinx code as well?

    Best regards,
    Patrick

    PS: Since the further steps to isolate the problem may probably not be of public interest you can also post your e-mail address in a private message to me.
Sign In or Register to comment.