469,636 Members | 1,544 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,636 developers. It's quick & easy.

html parsing / regular expressions

hello,

my goal is to get patterns out of email files - say "message
forwarding" patterns (message forwarded from: xx to: yy subject: zz)
now lets say there are tons of these patterns (by gmail, outlook, etc)
- and i want to create some rules of how to get them out of the mail's
html body.

so at first i tried using regular expressions: for example - "any
pattern that starts with a <p> and contains "from:"..." etc.
then i understood that its not that simple, because different engines
change the content of them html - and i cant expect spefic tags (what
if a <p> is added? or a <span>)

then ive been guided to use an html parser, heard of GOLD and ANTLR.
but no clue how that can help.

html parsing sounds better - because i really care for what the final
SEEN result is, and not the STRUCTURE of it.

any slightest light of how this problem would be appreceated.

May 21 '06 #1
1 2196
You really don't want to get into the whole HTML-parsing mess. HTML itself
is a mess, and parsing it is quite difficult.

I think you were on the right track with looking for patterns. The HTML tags
enclosing the data are unimportant. But the data is. So, the first thing you
probably want to do is locate email addresses. There are a number of
patterns for identifying and even parsing email addresses. Just look for
them.

Next, you need to get the context in which these messages appear. For that,
you'll need to figure out the rules, which means that you may need to
separate content from HTML tags. And for that, what you really need to do is
to remove all HTML tags, not parse them. But an email address may contain
"<" and ">" characters around different parts, depending on the format (to
enclose a user name, etc, that is not part of the email address). But those
characters, if they are in the HTML, will not be those characters, but
HTML-Encoding for those characters, i.e. "&LT;" and "&gt;". In the pure
HTML, anything inside an actual "<" or ">" will be an HTML tag. So, you may
want to remove all of them first, and then look for the data you're seeking,
by figuring out the rules for the patterns that a regulaar expression can
recognize.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

The man who questions opinions is wise.
The man who quarrels with facts is a fool.

<yo****@gmail.com> wrote in message
news:11**********************@38g2000cwa.googlegro ups.com...
hello,

my goal is to get patterns out of email files - say "message
forwarding" patterns (message forwarded from: xx to: yy subject: zz)
now lets say there are tons of these patterns (by gmail, outlook, etc)
- and i want to create some rules of how to get them out of the mail's
html body.

so at first i tried using regular expressions: for example - "any
pattern that starts with a <p> and contains "from:"..." etc.
then i understood that its not that simple, because different engines
change the content of them html - and i cant expect spefic tags (what
if a <p> is added? or a <span>)

then ive been guided to use an html parser, heard of GOLD and ANTLR.
but no clue how that can help.

html parsing sounds better - because i really care for what the final
SEEN result is, and not the STRUCTURE of it.

any slightest light of how this problem would be appreceated.

May 21 '06 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

7 posts views Thread by YoBro | last post: by
6 posts views Thread by Mark Rae | last post: by
1 post views Thread by Patrick | last post: by
17 posts views Thread by Mark | last post: by
5 posts views Thread by Svenn Are Bjerkem | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.