473,396 Members | 1,815 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

C# and reading websites; parsing HTML

Is it possible to write a function like the following:

string ReadURL(string URL)
{
....
}

The purpose is that it reads the URL (determined by the parameter) and
returns the string in which there is HTML-code, for example:

string websiteContents;

websiteContents = ReadURL("http://www.microsoft.com");

processHTMLCode(websiteContents);
....

Is there also functions that can parse HTML-code in a given string?

Hans Kamp.
Nov 15 '05 #1
6 17287
"Justin Rogers" <Ju****@games4dotnet.com> wrote in message news:<#z**************@tk2msftngp13.phx.gbl>...
First, it is very simple to get the contents of an URL but you have to do
some extra work...

string contents = null;
string url = "http://www.microsoft.com";
HttpWebRequest wreq = (HttpWebRequest) WebRequest.Create(url);
using(HttpWebResponse wresp = (HttpWebResponse) wreq.GetResponse()) {
using(StreamReader sr = new StreamReader(wresp.GetResponseStream())) {
contents = sr.ReadToEnd();
sr.Close();
}
wresp.Close();
}

You could easily place that inside of a function call to make it a bit
easier. You also need to be aware of encodings, but for the most part the
StreamReader should handle that for you.
Aha, thanks. The use of "using" (besides importing libraries as at the
beginning of a C# source code) is a bit new to me. If I understand it
correctly, it declares and initializes the part between ( and )
immediately after "using", and try to execute the statements between {
and }. Exceptions are suppressed but in case of it, the initialized
variable is disposed. Is that correct?
Now for parsing the HTML, you have
two options. You can either custom parse the HTML using regular expressions
or you can try to load it into an XML DOM.
Are there URLs that explain that?
If you know the site is XHTML
compliant then you won't have any problems loading it into an XML DOM. Many
sites that have converted to ASP .NET actually return XHTML compliant code
so good luck with whatever site you are trying to parse.


I want to parse a forum site. To be more specific: There is an attempt
to count from 1 to 10,000,000. And with the speed of sending messages
to the forum I want to calculate the estimated finishing date.

Hans Kamp.
Nov 15 '05 #2
"Justin Rogers" <Ju****@games4dotnet.com> wrote in message news:<#z**************@tk2msftngp13.phx.gbl>...
First, it is very simple to get the contents of an URL but you have to do
some extra work...

string contents = null;
string url = "http://www.microsoft.com";
HttpWebRequest wreq = (HttpWebRequest) WebRequest.Create(url);
using(HttpWebResponse wresp = (HttpWebResponse) wreq.GetResponse()) {
using(StreamReader sr = new StreamReader(wresp.GetResponseStream())) {
contents = sr.ReadToEnd();
sr.Close();
}
wresp.Close();
}

You could easily place that inside of a function call to make it a bit
easier. You also need to be aware of encodings, but for the most part the
StreamReader should handle that for you. Now for parsing the HTML, you have
two options. You can either custom parse the HTML using regular expressions
or you can try to load it into an XML DOM.


BTW, I discovered that
http://www.3dmirc.com/apron/tutorial...5/tutorial.htm gives
useful steps how to parse an XML file. I think this is also useful for
parsing an HTML file, since HTML can be considered as an XML
application.

Hans Kamp.
Nov 15 '05 #3
If it is xhtml, anyway, some HTML is not xml compliant and will likely cause
errors.
the mshtml DOM may be of use otherwise. (reference Microsoft.mshtml.dll in
your references dialog)
"Hans Kamp" <in**@hanskamp.com> wrote in message
news:b3**************************@posting.google.c om...
"Justin Rogers" <Ju****@games4dotnet.com> wrote in message

news:<#z**************@tk2msftngp13.phx.gbl>...
First, it is very simple to get the contents of an URL but you have to do some extra work...

string contents = null;
string url = "http://www.microsoft.com";
HttpWebRequest wreq = (HttpWebRequest) WebRequest.Create(url);
using(HttpWebResponse wresp = (HttpWebResponse) wreq.GetResponse()) {
using(StreamReader sr = new StreamReader(wresp.GetResponseStream())) { contents = sr.ReadToEnd();
sr.Close();
}
wresp.Close();
}

You could easily place that inside of a function call to make it a bit
easier. You also need to be aware of encodings, but for the most part the StreamReader should handle that for you. Now for parsing the HTML, you have two options. You can either custom parse the HTML using regular expressions or you can try to load it into an XML DOM.


BTW, I discovered that
http://www.3dmirc.com/apron/tutorial...5/tutorial.htm gives
useful steps how to parse an XML file. I think this is also useful for
parsing an HTML file, since HTML can be considered as an XML
application.

Hans Kamp.

Nov 15 '05 #4

"Daniel O'Connell" <on******@comcast.net> schreef in bericht
news:VS63b.277982$uu5.61641@sccrnsc04...
If it is xhtml, anyway, some HTML is not xml compliant and will likely cause errors.
the mshtml DOM may be of use otherwise. (reference Microsoft.mshtml.dll in
your references dialog)


How do you do that with C#Builder?

I have done some programming with TreeViews and XML Documents:

private void showXmlNodeAtTreeNode(XmlNodeList xnl, TreeNode tn)
{
int i;

for (i = 0; i < xnl.Count; i++) // how many nodes are there in the XML
document?
{
XmlNode xn = xnl[i]; // take the next node
XmlNodeType nodeType = xn.NodeType; // determine its type
if (nodeType == XmlNodeType.Element) // is it an element?
{
tn.Nodes.Add("Element: " + xn.Name); // add its name in the tree view
showXmlNodeAtTreeNode(xn.ChildNodes, tn.Nodes[i]); // add the XML child
nodes to this node
} else
if (nodeType == XmlNodeType.Text) // is it text?
{
tn.Nodes.Add("Text: " + xn.InnerText); // yes? then add it to the node
}
}
}

private void parseButton_Click(object sender, System.EventArgs e)
{
XmlDocument xd = new XmlDocument();

xd.LoadXml(xmlBox.Text); // load the text from a MultiLine EditBox

xmlView.Nodes.Clear(); // clear the TreeView

xmlView.Nodes.Add("Start"); // add "Start" at the root of the tree.

showXmlNodeAtTreeNode(xd.ChildNodes, xmlView.Nodes[0]); // add the XML
child nodes to the first TreeView nodes, and do that using recursion.

}

It could word with HTML but it is very strict. A small HTML syntax error can
crash the program, because no exceptions are catched.

I do mention it, because other newbies have an idea:
- how an XML document is parsed;
- how the XML nodes are read;
- how the treeview nodes are programmed.

Hans Kamp.
Nov 15 '05 #5
"Daniel O'Connell" <on******@comcast.net> wrote in message news:<yO*******************@rwcrnsc52.ops.asp.att. net>...
some HTML will not work in an xml parser, because elements aren't closed or
attributes aren't handled properly, which will fail in stndard xml readers
other bits inline


I have noticed (possibly wrongly), that newer versions of HTML - I
believe - 4.0 can have the modifier "strict" in the beginning, and
then they have to be according to the XML syntax.
It could word with HTML but it is very strict. A small HTML syntax error

can
crash the program, because no exceptions are catched.

I do mention it, because other newbies have an idea:
- how an XML document is parsed;
- how the XML nodes are read;
- how the treeview nodes are programmed.


i don't precisely understand what you mean here


It partly has to do with my own behaviour in newsgroups with a
teaching/learning purpose like this one.

For me there are two ways of finding the answer to a specific
question. I can start a thread and wait for answers that others give.
But I can lurk in the older threads and looking for the questions and
read the answers that are replies to those questions.

Maybe others have the same attitude. I mean, if others wants to know
how to parse XML (although not perfectly at this moment) and how to
add nodes to a TreeView, they can lurk in this thread and learn how
the things have to be programmed.

Hans Kamp.
Nov 15 '05 #6

"Hans Kamp" <in**@hanskamp.com> wrote in message
news:b3**************************@posting.google.c om...
"Daniel O'Connell" <on******@comcast.net> wrote in message news:<yO*******************@rwcrnsc52.ops.asp.att. net>...
some HTML will not work in an xml parser, because elements aren't closed or
attributes aren't handled properly, which will fail in stndard xml readers other bits inline


I have noticed (possibly wrongly), that newer versions of HTML - I
believe - 4.0 can have the modifier "strict" in the beginning, and
then they have to be according to the XML syntax.


I am not to sure(I am not an HTML expert), but I know SOME HTML will parse
ok. XHTML surely. The problem is you can't really rely on whatever site you
want to nessecerily support a version of HTML that works.
It could word with HTML but it is very strict. A small HTML syntax
error can
crash the program, because no exceptions are catched.

I do mention it, because other newbies have an idea:
- how an XML document is parsed;
- how the XML nodes are read;
- how the treeview nodes are programmed.


i don't precisely understand what you mean here


It partly has to do with my own behaviour in newsgroups with a
teaching/learning purpose like this one.

For me there are two ways of finding the answer to a specific
question. I can start a thread and wait for answers that others give.
But I can lurk in the older threads and looking for the questions and
read the answers that are replies to those questions.

Maybe others have the same attitude. I mean, if others wants to know
how to parse XML (although not perfectly at this moment) and how to
add nodes to a TreeView, they can lurk in this thread and learn how
the things have to be programmed.


I prefer to read through the newsgroups, myself. Surprisingly i have only
posted about 3 questions to the groups in the last year, all of which i
ended up answering myself, either through reading back or discovering my own
bug before anyone else.

I just didn't quite understand the reasoning behind your post, i do now,
lol.
Hans Kamp.

Nov 15 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Gerrit Holl | last post by:
Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...
6
by: KevinD | last post by:
assumption: I am new to C and old to COBOL I have been reading a lot (self teaching) but something is not sinking in with respect to reading a simple file - one record at a time. Using C, I am...
28
by: Andrew Poulos | last post by:
When loading an rss feed into Windows IE, doc.childNodes.length always equals 0. If I manually delete the <!DOCTYPE tag doc.childNodes.length is correct. I'm using doc = new...
8
by: Andrew Robert | last post by:
Hi Everyone. I tried the following to get input into optionparser from either a file or command line. The code below detects the passed file argument and prints the file contents but the...
3
by: darren via AccessMonster.com | last post by:
Hi I'm based in the UK and I've drifted into Access from building a simple db for myself, to then being asked to build a simple db for someone else, to now spending time building increasingly...
2
by: Jean-Marie Vaneskahian | last post by:
Reading - Parsing Records From An LDAP LDIF File In .Net? I am in need of a .Net class that will allow for the parsing of a LDAP LDIF file. An LDIF file is the standard format for representing...
1
by: syhzaidi | last post by:
How can we do Parsing of Hexdecimel in C# reading string from stream file for eg.. i have a file like.......... 0f 2f 12 2d 3a.......in hexa decimal save in a file.txt and i m reading it from...
6
by: ankitks.mital | last post by:
Folks, Is it possible to read hash values from txt file. I have script which sets options. Hash table has key set to option, and values are option values. Way we have it, we set options in a...
6
by: Victory | last post by:
Hi, I need to know the compression type of jpeg (jpg) files. I am using the System.Drawing.Imaging and loading the file using an Image object. The next thing i do, is to examine the propertyItems...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.