473,240 Members | 1,579 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,240 software developers and data experts.

html parsing / regular expressions

hello,

my goal is to get patterns out of email files - say "message
forwarding" patterns (message forwarded from: xx to: yy subject: zz)
now lets say there are tons of these patterns (by gmail, outlook, etc)
- and i want to create some rules of how to get them out of the mail's
html body.

so at first i tried using regular expressions: for example - "any
pattern that starts with a <p> and contains "from:"..." etc.
then i understood that its not that simple, because different engines
change the content of them html - and i cant expect spefic tags (what
if a <p> is added? or a <span>)

then ive been guided to use an html parser, heard of GOLD and ANTLR.
but no clue how that can help.

html parsing sounds better - because i really care for what the final
SEEN result is, and not the STRUCTURE of it.

any slightest light of how this problem would be appreceated.

May 21 '06 #1
1 2394
You really don't want to get into the whole HTML-parsing mess. HTML itself
is a mess, and parsing it is quite difficult.

I think you were on the right track with looking for patterns. The HTML tags
enclosing the data are unimportant. But the data is. So, the first thing you
probably want to do is locate email addresses. There are a number of
patterns for identifying and even parsing email addresses. Just look for
them.

Next, you need to get the context in which these messages appear. For that,
you'll need to figure out the rules, which means that you may need to
separate content from HTML tags. And for that, what you really need to do is
to remove all HTML tags, not parse them. But an email address may contain
"<" and ">" characters around different parts, depending on the format (to
enclose a user name, etc, that is not part of the email address). But those
characters, if they are in the HTML, will not be those characters, but
HTML-Encoding for those characters, i.e. "&LT;" and "&gt;". In the pure
HTML, anything inside an actual "<" or ">" will be an HTML tag. So, you may
want to remove all of them first, and then look for the data you're seeking,
by figuring out the rules for the patterns that a regulaar expression can
recognize.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

The man who questions opinions is wise.
The man who quarrels with facts is a fool.

<yo****@gmail.com> wrote in message
news:11**********************@38g2000cwa.googlegro ups.com...
hello,

my goal is to get patterns out of email files - say "message
forwarding" patterns (message forwarded from: xx to: yy subject: zz)
now lets say there are tons of these patterns (by gmail, outlook, etc)
- and i want to create some rules of how to get them out of the mail's
html body.

so at first i tried using regular expressions: for example - "any
pattern that starts with a <p> and contains "from:"..." etc.
then i understood that its not that simple, because different engines
change the content of them html - and i cant expect spefic tags (what
if a <p> is added? or a <span>)

then ive been guided to use an html parser, heard of GOLD and ANTLR.
but no clue how that can help.

html parsing sounds better - because i really care for what the final
SEEN result is, and not the STRUCTURE of it.

any slightest light of how this problem would be appreceated.

May 21 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: YoBro | last post by:
Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. Was hoping somebody could offer a hand. The output of this will put the name of a form...
18
by: Shannon Jacobs | last post by:
Trying to solve this with a regex approach rather than the programmatic approach of counting up and down the levels. I have a fairly complicated HTML page that I want to simplify. I've been able to...
11
by: Martin Robins | last post by:
I am trying to parse a string that is similar in form to an OLEDB connection string using regular expressions; in principle it is working, but certain character combinations in the string being...
6
by: Mark Rae | last post by:
Hi, I'm using HttpWebRequest and HttpWebResponse to return a stream of HTML. Looking for advice as to the accepted / easiest / most efficient way to process this HTML in the background i.e. I...
1
by: Patrick | last post by:
I need to parse and HTML document of the following format. I am interested to obtain all the HTML from and including the first <div class="data"> up to and including Data updated dd/mm/yyyy...
17
by: Mark | last post by:
I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions...
3
by: Seb | last post by:
Hello, I am trying to find some object/function able to take an HTML page (code) as an input, strip out all images, stylesheets and other external references, and returns "cleaned" HTML only...
0
by: bharathitm | last post by:
I'm working on regular expressions to parse html tags into the wiki syntax. i.e. for example, if i encounter text like - some <bmore </ btext, my regular expression should be able to convert that...
5
by: Svenn Are Bjerkem | last post by:
On Jul 23, 1:03 pm, christopher.saun...@durham.ac.uk (c d saunter) wrote: As a start I want to parse VHDL which is going to be synthesised, and I am limiting myself to the entities and the...
3
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.