HTML interpreter

vining · September 2006

Has any one else noticed that when you retrieve and HTML source from a URL that the Netlinxs processor does a poor job interpreting it.

If you go to a URL and view the sites source in an editor like notepad++ (free from SourceForge) you get a nice clean source code displayed but if you open a client session from Netlinx and load it into a buffer there is consideralble interpreter errors which makes one pain in the a$$ to parse.

From buffer:

<td width="24" class="content_txt" valign="top"><b>95</b></td>CRLF
<!-- Check if there is a note associated with the channel -->CRLF
<td width="215" class="content_txCRLF
1001CRLF
t">CRLF
PPV The Hot Network (Adult)*<b>(Note 1)</b>CRLF

Viewed source in NotePad++:

<td width="24" class="content_txt" valign="top"><b>95</b></td>CRLF
<!-- Check if there is a note associated with the channel -->CRLF
<td width="215" class="content_txt">CRLF
CRLF
PPV The Hot Network (Adult)*<b>(Note 1)</b>CRLF

I really didn't intentionally pick this particular channel. I swear!

There are numerous examples of interpretation errors that Netlinx is adding into the URL's source making it a real pain to parse. One thing I'm not showing here is the "TAB" character "9" which was use profusely in the HTML.

Is it a timing thing and the CRLF is appended pre maturely? What's 1001? Where does that come from? Maybe line feed (LF) "10" and start of heading (SOH) "01" ?

I had to spend half my time writing the code and the other half compensating for the errors. Any fixes?

The parsing results after much aggrevation!

<!--####### BEGIN LINEUP ########-->
<ch>2</ch><BASIC>WCBS New York (CBS)
<ch>3</ch><BASIC>WFSB Hartford (CBS)
<ch>4</ch><BASIC>WNBC New York (NBC)
<ch>5</ch><BASIC>WNYW New York (FOX)
<ch>6</ch><BASIC>WVIT New Britain (NBC)
<ch>7</ch><BASIC>WABC New York (ABC)
<ch>8</ch><BASIC>WTNH New Haven (ABC)
<ch>9</ch><BASIC>WWOR Secaucus (UPN)
<ch>10</ch><BASIC>WEDW Bridgeport (PBS)
<ch>11</ch><BASIC>WPIX New York (WB)
<ch>12</ch><BASIC>News 12 Connecticut
<ch>13</ch><BASIC>WNET New York (PBS)
<ch>14</ch><BASIC>WSAH Bridgeport (IND)
<ch>15</ch><BASIC>WLNY Riverhead (IND)
<ch>16</ch><BASIC>WNJU Linden (Telemundo)
<ch>17</ch><BASIC>WFUT Newark (Telefutura)
<ch>18</ch><BASIC>WXTV Paterson (Univision)
<ch>19</ch><BASIC>WRNN Kingston (IND)
<ch>20</ch><BASIC>WTXX Waterbury (WB)
<ch>21</ch><BASIC>WLIW Plainview (PBS)
<ch>22</ch><BASIC>WNYE New York (PBS)
<ch>23</ch><BASIC>WPXN New York (i)
<ch>24</ch><BASIC>Cablevision Channel Guide
<ch>25</ch><BASIC>WTIC Hartford (FOX)
<ch>26</ch><BASIC>Travel Channel
<ch>27</ch><BASIC>Discovery Channel
<ch>28</ch><BASIC>The Learning Channel
<ch>29</ch><BASIC>Food Network
<ch>30</ch><BASIC>Home & Garden TV
<ch>31</ch><BASIC>Disney Channel
<ch>32</ch><BASIC>Cartoon Network
<ch>33</ch><BASIC>Nickelodeon
<ch>34</ch><BASIC>TV Land
<ch>35</ch><BASIC>ESPN2
<ch>36</ch><BASIC>ESPN
<ch>37</ch><BASIC>TNT
<ch>38</ch><BASIC>USA Network
<ch>39</ch><BASIC>TBS
<ch>40</ch><BASIC>FX
<ch>41</ch><BASIC>Spike TV
<ch>42</ch><BASIC>WE: Women's Entertainment
<ch>43</ch><BASIC>AMC
<ch>44</ch><BASIC>Bravo
<ch>45</ch><BASIC>Lifetime
<ch>46</ch><BASIC>A&E
<ch>47</ch><BASIC>The History Channel
<ch>48</ch><BASIC>Sci-Fi Channel
<ch>49</ch><BASIC>ABC Family
<ch>50</ch><BASIC>Comedy Central
<ch>51</ch><BASIC>E! Entertainment TV
<ch>52</ch><BASIC>VH1
<ch>53</ch><BASIC>MTV
<ch>54</ch><BASIC>BET
<ch>55</ch><BASIC>MTV2
<ch>56</ch><BASIC>Fuse
<ch>57</ch><BASIC>Animal Planet
<ch>58</ch><BASIC>Court TV
<ch>59</ch><BASIC>SoapNet
<ch>60</ch><BASIC>SportsNet New York
<ch>61</ch><BASIC>News 12 Traffic & Weather
<ch>62</ch><BASIC>The Weather Channel
<ch>63</ch><BASIC>CNN Headline News
<ch>65</ch><BASIC>C-SPAN
<ch>66</ch><BASIC>C-SPAN2
<ch>67</ch><BASIC>FOX News Channel
<ch>68</ch><BASIC>MSNBC
<ch>69</ch><BASIC>CNBC
<ch>70</ch><BASIC>CNN
<ch>71</ch><BASIC>MSG Network
<ch>72</ch><BASIC>FSN NY
<ch>73</ch><BASIC>YES Network
<ch>74</ch><BASIC>Speed Channel
<ch>75</ch><BASIC>Game Show Network
<ch>76</ch><BASIC>Turner Classic Movies
<ch>77</ch><BASIC>Public Access
<ch>78</ch><BASIC>Educational Access
<ch>79</ch><BASIC>Government Access
<ch>80</ch><BASIC>QVC
<ch>81</ch><BASIC>Home Shopping Network
<ch>82</ch><BASIC>ShopNBC
<ch>83</ch><BASIC>CT-N
<ch>84</ch><BASIC>Religious Programming/Leased Access/Local Programming
<ch>88</ch><BASIC>Starz
<ch>89</ch><BASIC>Encore
<ch>90</ch><BASIC>HBO
<ch>91</ch><BASIC>Cinemax
<ch>92</ch><BASIC>Showtime
<ch>93</ch><BASIC>The Movie Channel
<ch>94</ch><BASIC>Independent Film Channel
<ch>95</ch><BASIC>PPV The Hot Network (Adult)
<ch>96</ch><BASIC>HBO2
<ch>98</ch><BASIC>Showtime Too
<ch>99</ch><BASIC>Flix
<ch>100</ch><BASIC>MOVIEplex
<ch>104</ch><BASIC>BBC World News
<ch>106</ch><BASIC>The Golf Channel
<ch>107</ch><BASIC>Pay Per View
<ch>108</ch><BASIC>Pay Per View
<ch>109</ch><BASIC>Playboy TV (Adult)
<ch>110</ch><IO>iO Digital Channel Guide
<ch>111</ch><IO>Bloomberg TV
<ch>112</ch><IO>C-SPAN 3
<ch>113</ch><IO>EuroNews
<ch>114</ch><IO>WABC Plus
<ch>115</ch><IO>New England Cable News
<ch>117</ch><IO>Eyewitness News Now
<ch>118</ch><IO>WNBC Weather Plus 
<ch>119</ch><IO>WNBC 4.4
<ch>120</ch><IO>Discovery Kids
<ch>121</ch><IO>Toon Disney
<ch>122</ch><IO>Nicktoons TV
<ch>123</ch><IO>Noggin 
<ch>124</ch><IO>Nickelodeon GAS
<ch>131</ch><IO>Kids Thirteen
<ch>132</ch><IO>Thirteen World
<ch>133</ch><IO>WLIW Create
<ch>140</ch><IO>ESPN Classic
<ch>141</ch><IO>ESPNEWS
<ch>142</ch><IO>Fox Soccer Channel
<ch>148</ch><IO>NBA TV
<ch>160</ch><IO>The Biography Channel
<ch>161</ch><IO>History International
<ch>162</ch><IO>National Geographic Channel
<ch>170</ch><IO>The Science Channel
<ch>171</ch><IO>Discovery Times Channel
<ch>172</ch><IO>Discovery Home Channel
<ch>173</ch><IO>Military Channel
<ch>175</ch><IO>G4 videogame tv
<ch>179</ch><IO>LOGO
<ch>180</ch><IO>Oxygen
<ch>181</ch><IO>ShopNBC
<ch>182</ch><IO>Jewelry Television
<ch>184</ch><IO>The Tube
<ch>185</ch><IO>BET Jazz
<ch>186</ch><IO>VH1 Classic
<ch>187</ch><IO>CMT
<ch>188</ch><IO>MTV Hits
<ch>189</ch><IO>VH1 Soul
<ch>190</ch><IO>Fox Movie Channel
<ch>191</ch><IO>Hallmark Channel
<ch>192</ch><IO>Sundance Channel
<ch>195</ch><IO>MTV Espanol
<ch>196</ch><IO>Fox Sports en Espanol
<ch>197</ch><IO>mun2
<ch>198</ch><IO>Telemundo Puerto Rico
<ch>300</ch><IO>HBO on Demand
<ch>301</ch><IO>HBO Signature
<ch>302</ch><IO>HBO Family
<ch>303</ch><IO>HBO Comedy
<ch>304</ch><IO>HBO Zone
<ch>305</ch><IO>HBO Latino
<ch>306</ch><IO>HBO West
<ch>307</ch><IO>HBO2 West
<ch>308</ch><IO>HBO Signature West
<ch>309</ch><IO>HBO Family West
<ch>320</ch><IO>Showtime on Demand
<ch>321</ch><IO>Showtime Showcase
<ch>322</ch><IO>Showtime Extreme
<ch>323</ch><IO>Showtime Beyond
<ch>324</ch><IO>Showtime Next
<ch>325</ch><IO>Showtime Family Zone
<ch>326</ch><IO>Showtime Women
<ch>327</ch><IO>Showtime West
<ch>328</ch><IO>Showtime Too West
<ch>329</ch><IO>Showtime Showcase West
<ch>341</ch><IO>Starz Cinema
<ch>342</ch><IO>Starz Kids & Family
<ch>343</ch><IO>Starz Edge
<ch>344</ch><IO>Starz InBlack
<ch>345</ch><IO>Starz West
<ch>351</ch><IO>Encore Action
<ch>352</ch><IO>Encore Mystery
<ch>353</ch><IO>Encore Westerns
<ch>354</ch><IO>Encore Love
<ch>355</ch><IO>Encore Drama
<ch>356</ch><IO>Encore Wam
<ch>357</ch><IO>Encore West
<ch>370</ch><IO>Cinemax on Demand
<ch>371</ch><IO>ActionMAX
<ch>372</ch><IO>MoreMAX
<ch>373</ch><IO>ThrillerMAX
<ch>374</ch><IO>WMAX
<ch>375</ch><IO>@MAX
<ch>376</ch><IO>5 StarMAX
<ch>377</ch><IO>Outer MAX
<ch>378</ch><IO>Cinemax West
<ch>380</ch><IO>TMC Xtra
<ch>381</ch><IO>TMC West
<ch>382</ch><IO>TMC Xtra West
<ch>401</ch><IO>Sports Packages
<ch>408</ch><IO>OLN
<ch>430</ch><IO>NBA TV
<ch>431</ch><IO>NBA League Pass &reg; Preview
<ch>500</ch><IO>On Demand
<ch>502</ch><IO>Free On Demand
<ch>503</ch><IO>Disney Channel on Demand
<ch>506</ch><IO>here! On Demand
<ch>507</ch><IO>Anime Network on Demand
<ch>508</ch><IO>IFC in Theaters On Demand
<ch>513</ch><IO>Howard Stern On Demand
<ch>515</ch><IO>Adult On Demand
<ch>516</ch><IO>Playboy TV On Demand
<ch>517</ch><IO>Too Much for TV On Demand
<ch>604</ch><IO>MSG Sports Desk
<ch>605</ch><IO>Optimum Autos
<ch>606</ch><IO>Optimum Homes
<ch>610</ch><IO>Games
<ch>612</ch><IO>News 12 Interactive
<ch>620</ch><IO>Move n Match Puzzles
<ch>627</ch><IO>fuse Interactive
<ch>631</ch><IO>Hollywood.com TV
<ch>632</ch><IO>Broadway.com TV
<ch>652</ch><IO>FX Preview Channel
<ch>700</ch><HD>Hi-Def On Demand
<ch>701</ch><HD>IN HD
<ch>702</ch><HD>CBS HD
<ch>704</ch><HD>NBC HD
<ch>705</ch><HD>FOX HD
<ch>707</ch><HD>ABC HD
<ch>709</ch><HD>My9 HD
<ch>711</ch><HD>CW HD
<ch>713</ch><HD>Thirteen HD
<ch>715</ch><HD>YES HD
<ch>720</ch><HD>MSG Network HD
<ch>721</ch><IO>WLIW Digital
<ch>725</ch><HD>FSN NY HD
<ch>730</ch><HD>INHD2/SportsNet New York HD
<ch>736</ch><HD>ESPN HD
<ch>736</ch><HD>ESPN HD
<ch>737</ch><HD>TNT in HD
<ch>740</ch><HD>Starz HD
<ch>744</ch><HD>Universal HD
<ch>750</ch><HD>HBO HD
<ch>760</ch><HD>Showtime HD
<ch>770</ch><HD>Cinemax HD
<ch>780</ch><HD>The Movie Channel HD
<ch>801</ch><IO>Music Choice Channels
<ch>900</ch><IO>iO Upgrades
<ch>901</ch><IO>Order Optimum Online
<IO><td height="10" colspan="2">

DHawthorne · September 2006

I'm curious how you are getting the buffer data from the NetLinx. Are you using a STRING event handler, or parsing a buffer made by CREATE_BUFFER directly?

If you are using a STRING handler, don't. They are fine for single lines of data from a device if they aren't too long, but not reliable for large bursts of data. The problem is that they are timing sensitive, and may cut off a string too early, or not at all.

It may also be a transmission problem. In the process of viewing a web site's source, you have psssed it through your browser and then the notepad app, either or both of which could have cleaned up spurious errors.

vining · September 2006

It's a created buffer. I seldom if ever use data.txt unless modifying a mudule that already uses it.

DHawthorne · September 2006

I haven't done much with processing HTML, but in the little that I have, I have never seen that. It's definitely dropping characters. I really am inclined to think it's more of a connection/transmission problem, and the other apps are just cleaning it up, or have a better connection process and are getting it cleaner in the first place. It looks a lot like the kind of effect you might get even in a browser to a remote site that loses a packet here and there ... in the right place, the loss can goof up the entire page, but in some spots it's completely unnoticeable.

vining · September 2006

I ran W3C & WDG HTML mark up validators on the URL and W3C validator listed 834 errors or warning and the WDG didn't give a total but it was an equally long list. Viewing source through IE & FireFox looks goods though. I remember back in the day when you got something to work properly in IE, it often wouldn't work properly in Mozilla and vice verse.

HTML has always been a pain and it would be easier, quicker and cheaper to update the master "overall" channel list manually and then disseminate via FTP rather than converting the list posted by CableVision on their website automatically. Which wouldn't exactly be automatic any way. It would need to be proofed for errors an any time they decided to update their website I would most likely have to go through this misery again to come up with a parser.

Spire_Jeff · September 2006

vining wrote:

HTML has always been a pain

I agree, but if you talk to PianoDisc (OPUS 7 Piano Control manufacturer), they seem to think that HTML and more importantly FLASH are the control system interface of the future. The control of their product was designed to support the best control systems like M$ Media Center

They said that the HTML and Flash interfaces provide much better control then the ancient and out dated RS232 interface option. Just had to vent a little

But, also consider this a warning that we may in fact start seeing products emerge with only HTML interface options available for control.?.?

Jeff

Joe Hebert · September 2006

Xml

vining wrote:

Has any one else noticed that when you retrieve and HTML source from a URL that the Netlinxs processor does a poor job interpreting it?..There are numerous examples of interpretation errors that Netlinx is adding into the URL's source making it a real pain to parse. One thing I'm not showing here is the "TAB" character "9" which was use profusely in the HTML.

Netlinx doesn?t interpret the data it receives via a GET, it merely captures the raw stream straight from the server. Netlinx doesn?t see it as HTML source, it?s just data. The browser is the one doing the interpreting. I?ve done my fair share of trolling the web and I feel Netlinx does an excellent job.

That said, I shy away from scraping pages now (for reasons you already stated) and try to find XML/RSS feeds instead so that I can count on getting data that?s consistent in format. Maybe you can find one for your channel listings.

I did a quick Google and found XMLTV:

http://sourceforge.net/projects/xmltv/

Has anyone ever experimented with this? Looks interesting.

DHawthorne · September 2006

HTML is really just a formating tool for marking up text. What you pull into your NetLinx is just text, with text codes imbedded in it to show how it ought to be displayed. Your NetLinx isn't going to know or care, on that level, what that text says, or whether it is HTML, XML, or console output from a terminal. It's all text to your NetLinx.

There is nothing inherently wrong with HTML as a format. The problem with it has always been the implementation. There are a dozen "standards," and screwball ways of getting each one to do things it was never intended to do. People mix up tags and syntax, and think because their browser doesn't barf on the spot that it's OK, when, in fact, it's very much not OK, and the browser just chose to ignore all the silliness. You have automated HTML generators (like anything by Microsoft that outputs HTML) that bloat the code beyond any reasonable measure; you have any one of a dozen or so scripting languages, that can be mixed and matched within the code; you have imbeds and plugins. You have extraordinary workarounds and kludges to make something that works fine in one browser work at all in another ... I could go on and on and on. On my personal web site, I have a png image with a transparent background - it renders fine in Firefox, but the only way I could get the transparency to work in IE was to run a script within my IMG tag. What kind of stupidness is that?

So, grabbing HTML from a web site is a crap shoot. It's next to impossible to be certain that what you get is what you expect, and the structure can be so loose as to be unuseable. However, if the originator uses a strict set of guidlines and actually adheres to them, it should not be a problem to parse and interpret.

But this is also why XML was developed. It's entire purpose is to provide a regular and predictable format for including data within the HTML superset. Any mission-critical data in HTML, IMO, really needs to be XML.

vining · September 2006

Joe Hebert wrote:

Netlinx doesn?t interpret the data it receives via a GET, it merely captures the raw stream straight from the server.

Yes, in hind sight I chose my words poorply. IE and FireFox are interpreters and Netlinx simply sees it and text data.

After doing a little testing with the validators it became obvious that the original URL source is most likely to blame and not Netlinx. It's just first impressions would lead one to think other wise.

In regard to format and rendering If my memory serves me FireFox is true to the W3C standards while IE, well let's just say they want to be the standard and W3C should conform to them.

As Dave as mentioned most browser are very forgiving to srcipting errors. Sure some applets may not run but generally you'll get most of it displayed correctly.

I downloaded the XMLTV but it appears from reading the README's that the listing has to be in XML to begin with like their example of the Replay/TV site http://www.myreplaytv.com/.

I agree that HTML parsing will never be consistent but that's because of constant changes made for visual appearance sake or content and format changes not because the HTML format is flawed. Most HTML may be written poorly but as long as the URL doesn't get a facelift and I compensate for the poor code it will work fine and work consistantly.

In this instance it's for in house use to render the current list by parsing it into a comma delimited format and loaded into a structure. It can be viewed in excel, converted to XML, what ever. Then after proofing it would be placed on an FTP server, probably an in house Netlinx master and then sent via FTP to clients or just left in the master for clients to automatically retreive. Because HTML is subject to change at any time for what ever reason it has to be proofed before being placed in a file for automatic dissemination.

I'd like to say I 'm doing this because I'm lazy but the reality is it would have beed easier to generate the list and make updates the old fashioned way, by hand. But sometimes doing what's easy is just boring and every challenge is a learning experiance.

DHawthorne · September 2006

One of the beauties of XML is that you can leave your data structure untouched, and fiddle with the HTML that displays it as much as you like. Any facelifts, changes, or graphical do-dads you stick in there won't affect the XML at all. You do, however, have to add another layer to extract the data for display purposes, which is why it isn't used any more in web sites than it is. However, I would venture to guess that a great many script-based sites with dynamic data have some manner of XML underneath them entirely in the background. If you could get at that structure, it would be a great deal easier to deal with.

HTML interpreter

Comments