parsing html files

Philip Townsend

Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Nov 15 '05 #1

Subscribe Post Reply

1945

Itai Raz

How about just parse the raw HTML and look for the word <title>?

--
--itai
"Philip Townsend" <pt*******@v1tech.com> wrote in message
news:ez**************@TK2MSFTNGP10.phx.gbl...

Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Nov 15 '05 #2

Nicholas Paldino [.NET/C# MVP]

Philip,

If all you are looking for is the title, then I would recommend using
Regular Expressions. It will just be more performant. If you need more
information from the object model, then I would use COM interop and create
an instance of MSHTML.HTMLDocument. This will allow you to load a document
into the object, and access the DOM.

Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"Philip Townsend" <pt*******@v1tech.com> wrote in message
news:ez**************@TK2MSFTNGP10.phx.gbl...

Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Nov 15 '05 #3

Philipp Sumi

Hi Philip

If it's only the title, I would just search for the <title> element as
Itai suggested. However, if you need more flexibility, there is a
library available on gotdotnet that is available to convert HTML to an
XML DOM. Using this, you can easily use XPath (I don't have the link).

Regards, Philipp

Philip Townsend wrote:

Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Nov 15 '05 #4

mikeb

Philipp Sumi wrote:

Hi Philip

If it's only the title, I would just search for the <title> element as
Itai suggested. However, if you need more flexibility, there is a
library available on gotdotnet that is available to convert HTML to an
XML DOM. Using this, you can easily use XPath (I don't have the link).

The project is SgmlReader by Chris Lovett (clovett). You can find it at:
http://www.gotdotnet.com/Community/U...4-C3BD760564BC
Regards, Philipp

Philip Townsend wrote:
Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

--
mikeb

Nov 15 '05 #5

Zeeshan

You have two options
1- Use Regular Expression.
2- Convert html into XHtml and load that document into XmlDom and
check for the title tag.

Regular Expression:

Match Title = Regex.Match(html, "<title>([a-z0-9\\s]*)</title>",
RegexOptions.IgnoreCase | RegexOptions.Multiline );
string strTitle = Title.Groups[1].Value;

and for converter that convert the html into xhtml see the below link
http://www.eggheadcafe.com/articles/20030317.asp

user that lib and convert your document into Xhtml and then load that
converted documented into XmlDom and search for title tag.

regards,
Zeeshan Anwar.

Does anybody know of a way to parse HTML files when it is unknown what
the file will look like? I need to extract the <title> element from a
group of pages, where some pages may not be titled. There is no .net
object available that I can see. Does anybody know of any controls
available for purchase? Thaks...

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Nov 15 '05 #6

by: Gerrit Holl | last post by:

Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $...

Python

Parsing an HTML file

by: CodeGuru73 | last post by:

I am trying to find the best way to parse a bunch of html files. They are all simillar in structure and I need to get them into a database. Their relevant structure is: <html><head></head> <body>...

Python

Parsing text into web page table entries?

by: .:mmac:. | last post by:

I have a bunch of files (Playlist files for media player) and I am trying to create an automatically generated web page that includes the last 20 or 30 of these files. The files are created every...

ASP / Active Server Pages

Parsing message (.msg) files

by: david | last post by:

Hi all, I am trying to do something which would seem to be simple. I need to parse message files (.msg extensions) in order to find certain words in the content of the message. I thought that the...

C# / C Sharp

HTML parsing bug?

by: g_no_mail_please | last post by:

Python 2.3.5 seems to choke when trying to parse html files, because it doesn't realize that what's inside  is a comment in HTML, even if this comment is inside <script> </script>,...

Python

html parsing / regular expressions

by: yonido | last post by:

hello, my goal is to get patterns out of email files - say "message forwarding" patterns (message forwarded from: xx to: yy subject: zz) now lets say there are tons of these patterns (by gmail,...

.NET Framework

Parsing HTML files into Java

by: firelli | last post by:

Hi, I would like to be able to read (parse) an html file into my Java program. Once I'm able to do this, I need to be able to analyse the html code. If you could offer any help in meeting for...

Java

Parsing an html/aspx file

by: Neil.Smith | last post by:

I can't seem to find any references to this, but here goes: In there anyway to parse an html/aspx file within an asp.net application to gather a collection of controls in the file. For instance...

ASP.NET

Building several parsing modules

by: Robert Neville | last post by:

Basically, I want to create a table in html, xml, or xslt; with any number of regular expressions; a script (Perl or Python) which reads each table row (regex and replacement); and performs the...

Python

HTML File Parsing

by: Felipe De Bene | last post by:

I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Similar topics