473,396 Members | 2,093 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

parsing html files

Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Nov 15 '05 #1
5 1945
How about just parse the raw HTML and look for the word <title>?

--
--itai
"Philip Townsend" <pt*******@v1tech.com> wrote in message
news:ez**************@TK2MSFTNGP10.phx.gbl...
Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Nov 15 '05 #2
Philip,

If all you are looking for is the title, then I would recommend using
Regular Expressions. It will just be more performant. If you need more
information from the object model, then I would use COM interop and create
an instance of MSHTML.HTMLDocument. This will allow you to load a document
into the object, and access the DOM.

Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Philip Townsend" <pt*******@v1tech.com> wrote in message
news:ez**************@TK2MSFTNGP10.phx.gbl...
Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Nov 15 '05 #3
Hi Philip

If it's only the title, I would just search for the <title> element as
Itai suggested. However, if you need more flexibility, there is a
library available on gotdotnet that is available to convert HTML to an
XML DOM. Using this, you can easily use XPath (I don't have the link).

Regards, Philipp

Philip Townsend wrote:
Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!


Nov 15 '05 #4
Philipp Sumi wrote:
Hi Philip

If it's only the title, I would just search for the <title> element as
Itai suggested. However, if you need more flexibility, there is a
library available on gotdotnet that is available to convert HTML to an
XML DOM. Using this, you can easily use XPath (I don't have the link).

The project is SgmlReader by Chris Lovett (clovett). You can find it at:
http://www.gotdotnet.com/Community/U...4-C3BD760564BC
Regards, Philipp

Philip Townsend wrote:
Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!


--
mikeb
Nov 15 '05 #5
You have two options
1- Use Regular Expression.
2- Convert html into XHtml and load that document into XmlDom and
check for the title tag.

Regular Expression:

Match Title = Regex.Match(html, "<title>([a-z0-9\\s]*)</title>",
RegexOptions.IgnoreCase | RegexOptions.Multiline );
string strTitle = Title.Groups[1].Value;

and for converter that convert the html into xhtml see the below link
http://www.eggheadcafe.com/articles/20030317.asp

user that lib and convert your document into Xhtml and then load that
converted documented into XmlDom and search for title tag.

regards,
Zeeshan Anwar.

Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!


Nov 15 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Gerrit Holl | last post by:
Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...
3
by: CodeGuru73 | last post by:
I am trying to find the best way to parse a bunch of html files. They are all simillar in structure and I need to get them into a database. Their relevant structure is: <html><head></head> <body>...
35
by: .:mmac:. | last post by:
I have a bunch of files (Playlist files for media player) and I am trying to create an automatically generated web page that includes the last 20 or 30 of these files. The files are created every...
0
by: david | last post by:
Hi all, I am trying to do something which would seem to be simple. I need to parse message files (.msg extensions) in order to find certain words in the content of the message. I thought that the...
6
by: g_no_mail_please | last post by:
Python 2.3.5 seems to choke when trying to parse html files, because it doesn't realize that what's inside <!-- --> is a comment in HTML, even if this comment is inside <script> </script>,...
1
by: yonido | last post by:
hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...
0
by: firelli | last post by:
Hi, I would like to be able to read (parse) an html file into my Java program. Once I'm able to do this, I need to be able to analyse the html code. If you could offer any help in meeting for...
4
by: Neil.Smith | last post by:
I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance...
1
by: Robert Neville | last post by:
Basically, I want to create a table in html, xml, or xslt; with any number of regular expressions; a script (Perl or Python) which reads each table row (regex and replacement); and performs the...
2
by: Felipe De Bene | last post by:
I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.