html parsing / regular expressions

yonido

hello,

my goal is to get patterns out of email files - say "message
forwarding" patterns (message forwarded from: xx to: yy subject: zz)
now lets say there are tons of these patterns (by gmail, outlook, etc)
- and i want to create some rules of how to get them out of the mail's
html body.

so at first i tried using regular expressions: for example - "any
pattern that starts with a and contains "from:"..." etc.
then i understood that its not that simple, because different engines
change the content of them html - and i cant expect spefic tags (what
if a is added? or a )

then ive been guided to use an html parser, heard of GOLD and ANTLR.
but no clue how that can help.

html parsing sounds better - because i really care for what the final
SEEN result is, and not the STRUCTURE of it.

any slightest light of how this problem would be appreceated.

May 21 '06 #1

Subscribe Post Reply

2404

Kevin Spencer

You really don't want to get into the whole HTML-parsing mess. HTML itself
is a mess, and parsing it is quite difficult.

I think you were on the right track with looking for patterns. The HTML tags
enclosing the data are unimportant. But the data is. So, the first thing you
probably want to do is locate email addresses. There are a number of
patterns for identifying and even parsing email addresses. Just look for
them.

Next, you need to get the context in which these messages appear. For that,
you'll need to figure out the rules, which means that you may need to
separate content from HTML tags. And for that, what you really need to do is
to remove all HTML tags, not parse them. But an email address may contain
"<" and ">" characters around different parts, depending on the format (to
enclose a user name, etc, that is not part of the email address). But those
characters, if they are in the HTML, will not be those characters, but
HTML-Encoding for those characters, i.e. "&LT;" and ">". In the pure
HTML, anything inside an actual "<" or ">" will be an HTML tag. So, you may
want to remove all of them first, and then look for the data you're seeking,
by figuring out the rules for the patterns that a regulaar expression can
recognize.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

The man who questions opinions is wise.
The man who quarrels with facts is a fool.

<yo****@gmail.com> wrote in message
news:11**********************@38g2000cwa.googlegro ups.com...

hello,

my goal is to get patterns out of email files - say "message
forwarding" patterns (message forwarded from: xx to: yy subject: zz)
now lets say there are tons of these patterns (by gmail, outlook, etc)
- and i want to create some rules of how to get them out of the mail's
html body.

so at first i tried using regular expressions: for example - "any
pattern that starts with a and contains "from:"..." etc.
then i understood that its not that simple, because different engines
change the content of them html - and i cant expect spefic tags (what
if a is added? or a )

then ive been guided to use an html parser, heard of GOLD and ANTLR.
but no clue how that can help.

html parsing sounds better - because i really care for what the final
SEEN result is, and not the STRUCTURE of it.

any slightest light of how this problem would be appreceated.

May 21 '06 #2

Similar topics

Help with a regular expression

by: YoBro | last post by:

Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. Was hoping somebody could offer a hand. The output of this will put the name of a form...

PHP

Regular expression to find <tr> tags in 2nd level HTML tables

by: Shannon Jacobs | last post by:

Trying to solve this with a regex approach rather than the programmatic approach of counting up and down the levels. I have a fairly complicated HTML page that I want to simplify. I've been able to...

Javascript

Regular expressions: parsing an "OLEDB like" connection string ...

by: Martin Robins | last post by:

I am trying to parse a string that is similar in form to an OLEDB connection string using regular expressions; in principle it is working, but certain character combinations in the string being...

.NET Framework

Parsing / processing a stream of HTML

by: Mark Rae | last post by:

Hi, I'm using HttpWebRequest and HttpWebResponse to return a stream of HTML. Looking for advice as to the accepted / easiest / most efficient way to process this HTML in the background i.e. I...

C# / C Sharp

Regular Expressions to parse HTML

by: Patrick | last post by:

I need to parse and HTML document of the following format. I am interested to obtain all the HTML from and including the first <div class="data"> up to and including Data updated dd/mm/yyyy...

.NET Framework

parsing VB code with a regex

by: Mark | last post by:

I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions...

.NET Framework

Parsing HTML to remove pictures and stylesheets

by: Seb | last post by:

Hello, I am trying to find some object/function able to take an HTML page (code) as an input, strip out all images, stylesheets and other external references, and returns "cleaned" HTML only...

ASP.NET

Need help creating regular expression for html lists...

by: bharathitm | last post by:

I'm working on regular expressions to parse html tags into the wiki syntax. i.e. for example, if i encounter text like - some <bmore </ btext, my regular expression should be able to convert that...

Visual Basic .NET

Re: Parsing VHDL with python, where to start.

by: Svenn Are Bjerkem | last post by:

On Jul 23, 1:03 pm, christopher.saun...@durham.ac.uk (c d saunter) wrote: As a start I want to parse VHDL which is going to be synthesised, and I am limiting myself to the entities and the...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA