473,890 Members | 1,357 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

C# and reading websites; parsing HTML

Is it possible to write a function like the following:

string ReadURL(string URL)
{
....
}

The purpose is that it reads the URL (determined by the parameter) and
returns the string in which there is HTML-code, for example:

string websiteContents ;

websiteContents = ReadURL("http://www.microsoft.c om");

processHTMLCode (websiteContent s);
....

Is there also functions that can parse HTML-code in a given string?

Hans Kamp.
Nov 15 '05 #1
6 17312
"Justin Rogers" <Ju****@games4d otnet.com> wrote in message news:<#z******* *******@tk2msft ngp13.phx.gbl>. ..
First, it is very simple to get the contents of an URL but you have to do
some extra work...

string contents = null;
string url = "http://www.microsoft.c om";
HttpWebRequest wreq = (HttpWebRequest ) WebRequest.Crea te(url);
using(HttpWebRe sponse wresp = (HttpWebRespons e) wreq.GetRespons e()) {
using(StreamRea der sr = new StreamReader(wr esp.GetResponse Stream())) {
contents = sr.ReadToEnd();
sr.Close();
}
wresp.Close();
}

You could easily place that inside of a function call to make it a bit
easier. You also need to be aware of encodings, but for the most part the
StreamReader should handle that for you.
Aha, thanks. The use of "using" (besides importing libraries as at the
beginning of a C# source code) is a bit new to me. If I understand it
correctly, it declares and initializes the part between ( and )
immediately after "using", and try to execute the statements between {
and }. Exceptions are suppressed but in case of it, the initialized
variable is disposed. Is that correct?
Now for parsing the HTML, you have
two options. You can either custom parse the HTML using regular expressions
or you can try to load it into an XML DOM.
Are there URLs that explain that?
If you know the site is XHTML
compliant then you won't have any problems loading it into an XML DOM. Many
sites that have converted to ASP .NET actually return XHTML compliant code
so good luck with whatever site you are trying to parse.


I want to parse a forum site. To be more specific: There is an attempt
to count from 1 to 10,000,000. And with the speed of sending messages
to the forum I want to calculate the estimated finishing date.

Hans Kamp.
Nov 15 '05 #2
"Justin Rogers" <Ju****@games4d otnet.com> wrote in message news:<#z******* *******@tk2msft ngp13.phx.gbl>. ..
First, it is very simple to get the contents of an URL but you have to do
some extra work...

string contents = null;
string url = "http://www.microsoft.c om";
HttpWebRequest wreq = (HttpWebRequest ) WebRequest.Crea te(url);
using(HttpWebRe sponse wresp = (HttpWebRespons e) wreq.GetRespons e()) {
using(StreamRea der sr = new StreamReader(wr esp.GetResponse Stream())) {
contents = sr.ReadToEnd();
sr.Close();
}
wresp.Close();
}

You could easily place that inside of a function call to make it a bit
easier. You also need to be aware of encodings, but for the most part the
StreamReader should handle that for you. Now for parsing the HTML, you have
two options. You can either custom parse the HTML using regular expressions
or you can try to load it into an XML DOM.


BTW, I discovered that
http://www.3dmirc.com/apron/tutorial...5/tutorial.htm gives
useful steps how to parse an XML file. I think this is also useful for
parsing an HTML file, since HTML can be considered as an XML
application.

Hans Kamp.
Nov 15 '05 #3
If it is xhtml, anyway, some HTML is not xml compliant and will likely cause
errors.
the mshtml DOM may be of use otherwise. (reference Microsoft.mshtm l.dll in
your references dialog)
"Hans Kamp" <in**@hanskamp. com> wrote in message
news:b3******** *************** ***@posting.goo gle.com...
"Justin Rogers" <Ju****@games4d otnet.com> wrote in message

news:<#z******* *******@tk2msft ngp13.phx.gbl>. ..
First, it is very simple to get the contents of an URL but you have to do some extra work...

string contents = null;
string url = "http://www.microsoft.c om";
HttpWebRequest wreq = (HttpWebRequest ) WebRequest.Crea te(url);
using(HttpWebRe sponse wresp = (HttpWebRespons e) wreq.GetRespons e()) {
using(StreamRea der sr = new StreamReader(wr esp.GetResponse Stream())) { contents = sr.ReadToEnd();
sr.Close();
}
wresp.Close();
}

You could easily place that inside of a function call to make it a bit
easier. You also need to be aware of encodings, but for the most part the StreamReader should handle that for you. Now for parsing the HTML, you have two options. You can either custom parse the HTML using regular expressions or you can try to load it into an XML DOM.


BTW, I discovered that
http://www.3dmirc.com/apron/tutorial...5/tutorial.htm gives
useful steps how to parse an XML file. I think this is also useful for
parsing an HTML file, since HTML can be considered as an XML
application.

Hans Kamp.

Nov 15 '05 #4

"Daniel O'Connell" <on******@comca st.net> schreef in bericht
news:VS63b.2779 82$uu5.61641@sc crnsc04...
If it is xhtml, anyway, some HTML is not xml compliant and will likely cause errors.
the mshtml DOM may be of use otherwise. (reference Microsoft.mshtm l.dll in
your references dialog)


How do you do that with C#Builder?

I have done some programming with TreeViews and XML Documents:

private void showXmlNodeAtTr eeNode(XmlNodeL ist xnl, TreeNode tn)
{
int i;

for (i = 0; i < xnl.Count; i++) // how many nodes are there in the XML
document?
{
XmlNode xn = xnl[i]; // take the next node
XmlNodeType nodeType = xn.NodeType; // determine its type
if (nodeType == XmlNodeType.Ele ment) // is it an element?
{
tn.Nodes.Add("E lement: " + xn.Name); // add its name in the tree view
showXmlNodeAtTr eeNode(xn.Child Nodes, tn.Nodes[i]); // add the XML child
nodes to this node
} else
if (nodeType == XmlNodeType.Tex t) // is it text?
{
tn.Nodes.Add("T ext: " + xn.InnerText); // yes? then add it to the node
}
}
}

private void parseButton_Cli ck(object sender, System.EventArg s e)
{
XmlDocument xd = new XmlDocument();

xd.LoadXml(xmlB ox.Text); // load the text from a MultiLine EditBox

xmlView.Nodes.C lear(); // clear the TreeView

xmlView.Nodes.A dd("Start"); // add "Start" at the root of the tree.

showXmlNodeAtTr eeNode(xd.Child Nodes, xmlView.Nodes[0]); // add the XML
child nodes to the first TreeView nodes, and do that using recursion.

}

It could word with HTML but it is very strict. A small HTML syntax error can
crash the program, because no exceptions are catched.

I do mention it, because other newbies have an idea:
- how an XML document is parsed;
- how the XML nodes are read;
- how the treeview nodes are programmed.

Hans Kamp.
Nov 15 '05 #5
"Daniel O'Connell" <on******@comca st.net> wrote in message news:<yO******* ************@rw crnsc52.ops.asp .att.net>...
some HTML will not work in an xml parser, because elements aren't closed or
attributes aren't handled properly, which will fail in stndard xml readers
other bits inline


I have noticed (possibly wrongly), that newer versions of HTML - I
believe - 4.0 can have the modifier "strict" in the beginning, and
then they have to be according to the XML syntax.
It could word with HTML but it is very strict. A small HTML syntax error

can
crash the program, because no exceptions are catched.

I do mention it, because other newbies have an idea:
- how an XML document is parsed;
- how the XML nodes are read;
- how the treeview nodes are programmed.


i don't precisely understand what you mean here


It partly has to do with my own behaviour in newsgroups with a
teaching/learning purpose like this one.

For me there are two ways of finding the answer to a specific
question. I can start a thread and wait for answers that others give.
But I can lurk in the older threads and looking for the questions and
read the answers that are replies to those questions.

Maybe others have the same attitude. I mean, if others wants to know
how to parse XML (although not perfectly at this moment) and how to
add nodes to a TreeView, they can lurk in this thread and learn how
the things have to be programmed.

Hans Kamp.
Nov 15 '05 #6

"Hans Kamp" <in**@hanskamp. com> wrote in message
news:b3******** *************** ***@posting.goo gle.com...
"Daniel O'Connell" <on******@comca st.net> wrote in message news:<yO******* ************@rw crnsc52.ops.asp .att.net>...
some HTML will not work in an xml parser, because elements aren't closed or
attributes aren't handled properly, which will fail in stndard xml readers other bits inline


I have noticed (possibly wrongly), that newer versions of HTML - I
believe - 4.0 can have the modifier "strict" in the beginning, and
then they have to be according to the XML syntax.


I am not to sure(I am not an HTML expert), but I know SOME HTML will parse
ok. XHTML surely. The problem is you can't really rely on whatever site you
want to nessecerily support a version of HTML that works.
It could word with HTML but it is very strict. A small HTML syntax
error can
crash the program, because no exceptions are catched.

I do mention it, because other newbies have an idea:
- how an XML document is parsed;
- how the XML nodes are read;
- how the treeview nodes are programmed.


i don't precisely understand what you mean here


It partly has to do with my own behaviour in newsgroups with a
teaching/learning purpose like this one.

For me there are two ways of finding the answer to a specific
question. I can start a thread and wait for answers that others give.
But I can lurk in the older threads and looking for the questions and
read the answers that are replies to those questions.

Maybe others have the same attitude. I mean, if others wants to know
how to parse XML (although not perfectly at this moment) and how to
add nodes to a TreeView, they can lurk in this thread and learn how
the things have to be programmed.


I prefer to read through the newsgroups, myself. Surprisingly i have only
posted about 3 questions to the groups in the last year, all of which i
ended up answering myself, either through reading back or discovering my own
bug before anyone else.

I just didn't quite understand the reasoning behind your post, i do now,
lol.
Hans Kamp.

Nov 15 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
9449
by: Gerrit Holl | last post by:
Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $ Last-Modified: $Date: 2003/10/28 19:48:44 $ Author: A.M. Kuchling <amk@amk.ca> Status: Draft Type: Standards Track
6
3805
by: KevinD | last post by:
assumption: I am new to C and old to COBOL I have been reading a lot (self teaching) but something is not sinking in with respect to reading a simple file - one record at a time. Using C, I am trying to read a flatfile. In COBOL, my simple file layout and READ statement would look like below. Question: what is the standard, simple coding convention for reading in a flatfile - one record at a time?? SCANF does not work because of...
28
2207
by: Andrew Poulos | last post by:
When loading an rss feed into Windows IE, doc.childNodes.length always equals 0. If I manually delete the <!DOCTYPE tag doc.childNodes.length is correct. I'm using doc = new ActiveXObject("Microsoft.XMLDOM"); to load the rss. Is this where the problem lies? (Using document.implementation.createDocument with FF reads the XML correctly with or without a DOCTYPE.)
8
3528
by: Andrew Robert | last post by:
Hi Everyone. I tried the following to get input into optionparser from either a file or command line. The code below detects the passed file argument and prints the file contents but the individual swithces do not get passed to option parser.
3
1782
by: darren via AccessMonster.com | last post by:
Hi I'm based in the UK and I've drifted into Access from building a simple db for myself, to then being asked to build a simple db for someone else, to now spending time building increasingly more sophisticated (for me)databases. So far my learning curve has been based upon a handleful of books and this forum (which I think is fantastic and the level of help and knowledge sharing has astounded me). Aside from the very basic dummies...
2
3614
by: Jean-Marie Vaneskahian | last post by:
Reading - Parsing Records From An LDAP LDIF File In .Net? I am in need of a .Net class that will allow for the parsing of a LDAP LDIF file. An LDIF file is the standard format for representing LDAP objects. I need to be able to read the records from an LDIF file into ..Net. There exists a Perl module that will do exactly this called Net::LDAP::LDIF but I am wanting to port my code over to .Net and cannot find anything with similar...
1
1653
by: syhzaidi | last post by:
How can we do Parsing of Hexdecimel in C# reading string from stream file for eg.. i have a file like.......... 0f 2f 12 2d 3a.......in hexa decimal save in a file.txt and i m reading it from the file....... now i have to convert this in decimal and save in an array.of integers.......i thought it can be achieved through parsing ..means 0f could be stored in array converted in decimal...but remmber i m reading from a file///////////////i...
6
2932
by: ankitks.mital | last post by:
Folks, Is it possible to read hash values from txt file. I have script which sets options. Hash table has key set to option, and values are option values. Way we have it, we set options in a different file (*.txt), and we read from that file. Is there easy way for just reading file and setting options instead of parsing it.
6
7528
by: Victory | last post by:
Hi, I need to know the compression type of jpeg (jpg) files. I am using the System.Drawing.Imaging and loading the file using an Image object. The next thing i do, is to examine the propertyItems property of the object which gives me an ID, a type and Value. I am able to check for an ID of 259 (259 is the compression id) for Tiff images and check the value upper and lower significant bits of the value to see the compression. But i don't...
0
9979
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, weíll explore What is ONU, What Is Router, ONU & Routerís main usage, and What is the difference between ONU and Router. Letís take a closer look ! Part I. Meaning of...
0
9823
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11234
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10827
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10924
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10463
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7170
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5854
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
3
3281
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.