473,785 Members | 3,142 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Extract data from web page.

I've got this type of info on a web page:

----------------------------------------------------------------------------
--------------------------------------------
<tr height="25">
<td nowrap class="odd" align="center"> <img
src="/forums/images/icon_topic_new. gif" width=14 height=14 alt='New Topic'
border=0></td>

<td nowrap class="odd" align="center"> &nbsp;</td>

<td nowrap class="odd" align="center"> &nbsp;</td>
<td width="85%" class="even" align="left"><f ont class="new-row"><a
href="topic.asp ?tid=106110">
Quality ebay auction</a>&nbsp;</font>
<font class="sub-row">in General&nbsp;/&nbsp;The Lounge</font><font
class="sub-row"><br>Starte d 7/15/2005 - pages <a
href="topic.asp ?tid=106110">1</a> - last posted by <a
href="profile.a sp?action=view& id=Shandy" onmouseover="wi ndow.status='Sh ow
the authors profile'; return true;" onmouseout="win dow.status=''; return
true;">Shandy</a></font></td>
<td width="15%" class="even" valign="middle" align="left"><f ont
class="new-row"><a href="profile.a sp?action=view& id=DiscoInferno "
onmouseover="wi ndow.status='Sh ow the authors profile'; return true;"
onmouseout="win dow.status=''; return true;">DiscoInf <BR>erno</a></font></td>
<td nowrap class="odd" valign="middle" align="center"> <font
class="new-row">9</font></td>
<td nowrap class="odd" valign="middle" align="left">
<font class="new-row">7/15/2005<br>
<font class="sub-row">5:02:16 PM</font></font></td>
</tr>
----------------------------------------------------------------------------
--------------------------------------------

It's a table which shows the latest posts of a forum. I'd like to pull out
the following information:
Topic: Quality ebay auction
Original poster: DiscoInferno
Started: 7/15/2005
Last Post By: Shandy
Last Post Date: 7/15/2005 5:02:16 PM

This *type* of information is repeated down the web page although the data
will change.
.....

and I want to do this with the whole page/table. Should I use RegEx to get
the data or simply do a string search when I download the page's source into
my application?
--
|
+-- Thief_
|
Nov 21 '05 #1
2 4492
Hi,

Here is a start. It uses a regex to extract links.
Dim wc As New System.Net.WebC lient

Dim sr As New System.IO.Strea mReader(wc.Open Read("http://news.google.com/"))

Dim strHtml As String

Dim regLink As New
System.Text.Reg ularExpressions .Regex("\""(?<u rl>[^\""]*)\""")

Dim regTitle As New System.Text.Reg ularExpressions .Regex(">(.*?)\ <")

Dim regHref As New System.Text.Reg ularExpressions .Regex("\<a
href=""(.*?)""\ >(.*?)\<\/a\>")

Dim m As System.Text.Reg ularExpressions .Match

strHtml = sr.ReadToEnd

Try

For Each m In regHref.Matches (strHtml)

Dim mLink As System.Text.Reg ularExpressions .Match

For Each mLink In regLink.Matches (m.ToString())

Trace.WriteLine (String.Format( "Link {0}", mLink.ToString) )

Next

For Each mLink In regTitle.Matche s(m.ToString())

Dim strTitle As String = mLink.ToString

strTitle = strTitle.Replac e(">", "")

strTitle = strTitle.Replac e("<", "")

Trace.WriteLine (String.Format( "Title {0}", strTitle))

Next

Next

Catch

End Try

sr.Close()

wc.Dispose()

Good resource for Regular Expression Examples.

http://www.regexlib.com/DisplayPatte...4&categoryId=8

Ken

----------------------------

"Thief_" <th****@hotmail .com> wrote in message
news:OZ******** ******@TK2MSFTN GP12.phx.gbl...
I've got this type of info on a web page:

----------------------------------------------------------------------------
--------------------------------------------
<tr height="25">
<td nowrap class="odd" align="center"> <img
src="/forums/images/icon_topic_new. gif" width=14 height=14 alt='New Topic'
border=0></td>

<td nowrap class="odd" align="center"> &nbsp;</td>

<td nowrap class="odd" align="center"> &nbsp;</td>
<td width="85%" class="even" align="left"><f ont class="new-row"><a
href="topic.asp ?tid=106110">
Quality ebay auction</a>&nbsp;</font>
<font class="sub-row">in General&nbsp;/&nbsp;The Lounge</font><font
class="sub-row"><br>Starte d 7/15/2005 - pages <a
href="topic.asp ?tid=106110">1</a> - last posted by <a
href="profile.a sp?action=view& id=Shandy" onmouseover="wi ndow.status='Sh ow
the authors profile'; return true;" onmouseout="win dow.status=''; return
true;">Shandy</a></font></td>
<td width="15%" class="even" valign="middle" align="left"><f ont
class="new-row"><a href="profile.a sp?action=view& id=DiscoInferno "
onmouseover="wi ndow.status='Sh ow the authors profile'; return true;"
onmouseout="win dow.status=''; return true;">DiscoInf <BR>erno</a></font></td>
<td nowrap class="odd" valign="middle" align="center"> <font
class="new-row">9</font></td>
<td nowrap class="odd" valign="middle" align="left">
<font class="new-row">7/15/2005<br>
<font class="sub-row">5:02:16 PM</font></font></td>
</tr>
----------------------------------------------------------------------------
--------------------------------------------

It's a table which shows the latest posts of a forum. I'd like to pull out
the following information:
Topic: Quality ebay auction
Original poster: DiscoInferno
Started: 7/15/2005
Last Post By: Shandy
Last Post Date: 7/15/2005 5:02:16 PM

This *type* of information is repeated down the web page although the data
will change.
.....

and I want to do this with the whole page/table. Should I use RegEx to get
the data or simply do a string search when I download the page's source into
my application?
--
|
+-- Thief_
|

Nov 21 '05 #2
"Thief_" <th****@hotmail .com> schrieb:
I've got this type of info on a web page:

----------------------------------------------------------------------------
--------------------------------------------
<tr height="25">
<td nowrap class="odd" align="center"> <img
src="/forums/images/icon_topic_new. gif" width=14 height=14 alt='New
Topic'
border=0></td>

<td nowrap class="odd" align="center"> &nbsp;</td>

<td nowrap class="odd" align="center"> &nbsp;</td>
<td width="85%" class="even" align="left"><f ont class="new-row"><a
href="topic.asp ?tid=106110">
Quality ebay auction</a>&nbsp;</font>
<font class="sub-row">in General&nbsp;/&nbsp;The Lounge</font><font
class="sub-row"><br>Starte d 7/15/2005 - pages <a
href="topic.asp ?tid=106110">1</a> - last posted by <a
href="profile.a sp?action=view& id=Shandy" onmouseover="wi ndow.status='Sh ow
the authors profile'; return true;" onmouseout="win dow.status=''; return
true;">Shandy</a></font></td>
<td width="15%" class="even" valign="middle" align="left"><f ont
class="new-row"><a href="profile.a sp?action=view& id=DiscoInferno "
onmouseover="wi ndow.status='Sh ow the authors profile'; return true;"
onmouseout="win dow.status=''; return
true;">DiscoInf <BR>erno</a></font></td>
<td nowrap class="odd" valign="middle" align="center"> <font
class="new-row">9</font></td>
<td nowrap class="odd" valign="middle" align="left">
<font class="new-row">7/15/2005<br>
<font class="sub-row">5:02:16 PM</font></font></td>
</tr>
----------------------------------------------------------------------------
--------------------------------------------

It's a table which shows the latest posts of a forum. I'd like to pull out
the following information:
Topic: Quality ebay auction
Original poster: DiscoInferno
Started: 7/15/2005
Last Post By: Shandy
Last Post Date: 7/15/2005 5:02:16 PM

This *type* of information is repeated down the web page although the data
will change.
....

and I want to do this with the whole page/table. Should I use RegEx to get
the data or simply do a string search when I download the page's source
into
my application?


Parsing an HTML file:

MSHTML Reference
<URL:http://msdn.microsoft. com/library/default.asp?url =/workshop/browser/mshtml/reference/reference.asp>

- or -

..NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>

Download:

<URL:http://www.codefluent. com/smourier/download/htmlagilitypack .zip>

- or -

SgmlReader 1.4
<URL:http://www.gotdotnet.c om/Community/UserSamples/Details.aspx?Sa mpleGuid=B90FDD CE-E60D-43F8-A5C4-C3BD760564BC>

If the file read is in XHTML format, you can use the classes contained in
the 'System.Xml' namespace for reading information from the file.

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://classicvb.org/petition/>

Nov 21 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
7357
by: jrefactors | last post by:
How to extract data from html page? For example, if i want to get the information of weather (http://weather.yahoo.com/forecast/USCA1005.html) and put in my web page. Is it possible to do that? please advise. thanks!!
1
5339
by: Chris | last post by:
If this is not the right place to post, please someone direct me to the correct place. I'm having problems extracting the binary data that's included in an xml response back from a server. It's an embedded PDF file that's been base64 encoded. I've got it narrowed down to to a single node using Msxml2.DOMDocument.4.0 and selectSingleNode but how do I get the data from there to a PDF file and get the browser to display it. Can someone...
2
2758
by: jjouett | last post by:
I'm trying to setup an ASPX web page to service requests from an existing Java Client that posts multi-part data as a way to upload files, and I can't find a straightforward way to process the request. >From the HttpRequest, the Request.Form and Request.Files collections are empty. The Request.ContentType states the following: multipart/form-data, boundary=012AhbCfFg225929 and using the Request.InputStream to get the content results in...
9
3502
by: chrisspencer02 | last post by:
I am looking for a method to extract the links embedded within the Javascript in a web page: an ActiveX component, or example code in C++/Pascal/etc. I am looking for a general solution, not one tailored to a particular page/script. Hopefully, the problem can be solved without recreating a complete Javascript interpreter. Any ideas?
1
2676
by: caine | last post by:
I want to extract web data from a news feed page http://everling.nierchi.net/mmubulletins.php. Just want to extract necessary info between open n closing tags of <title>, <categoryand <link>. Whenever I initiated the extraction, first news title is always "MMU Bulletin Board RSS Feed" with the proper bulletin's link stored, but not the correct news title being stored. Necessary info only appears within <itemand </itemwhich consists...
4
3863
by: seberino | last post by:
I'm trying to extract some data from an XHTML Transitional web page. What is best way to do this? xml.dom.minidom.parseString("text of web page") gives errors about it not being well formed XML. Do I just need to add something like <?xml ...?or what? Chris
11
1820
by: seberino | last post by:
How extract the visible numerical data from this Microsoft financial web site? http://tinyurl.com/yw2w4h If you simply download the HTML file you'll see the data is *not* embedded in it but loaded from some other file. Surely if I can see the data in my browser I can grab it somehow right in a Python script?
3
10592
by: SteveB | last post by:
I have posted this question in the Visual Basic 2005 and Visual Basic .Net 2005 discussion groups, also. Hi. I am developing an application/web page with VB.Net that will populate a SQL database from text extracted from PDF documents. However, I am having a difficult time finding or developing the appropriate code to convert the PDF streams into text strings. Has anyone developed code to convert PDF's to Text? I was able write a...
5
5765
by: Steve | last post by:
Hi all Does anybody please know a way to extract an Image from a pdf file and save it as a TIFF? I have used a scanner to scan documents which are then placed on a server, but I need to extract the image of the document (just the first page if there are multiple pages) and save it as a TIFF so I can then use the Tesseract OCR to get the text in the image.
0
9480
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10327
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10151
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9950
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8973
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7499
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5381
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5511
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
2879
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.