473,569 Members | 2,698 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

html parsing / regular expressions

hello,

my goal is to get patterns out of email files - say "message
forwarding" patterns (message forwarded from: xx to: yy subject: zz)
now lets say there are tons of these patterns (by gmail, outlook, etc)
- and i want to create some rules of how to get them out of the mail's
html body.

so at first i tried using regular expressions: for example - "any
pattern that starts with a <p> and contains "from:"..." etc.
then i understood that its not that simple, because different engines
change the content of them html - and i cant expect spefic tags (what
if a <p> is added? or a <span>)

then ive been guided to use an html parser, heard of GOLD and ANTLR.
but no clue how that can help.

html parsing sounds better - because i really care for what the final
SEEN result is, and not the STRUCTURE of it.

any slightest light of how this problem would be appreceated.

May 21 '06 #1
1 2419
You really don't want to get into the whole HTML-parsing mess. HTML itself
is a mess, and parsing it is quite difficult.

I think you were on the right track with looking for patterns. The HTML tags
enclosing the data are unimportant. But the data is. So, the first thing you
probably want to do is locate email addresses. There are a number of
patterns for identifying and even parsing email addresses. Just look for
them.

Next, you need to get the context in which these messages appear. For that,
you'll need to figure out the rules, which means that you may need to
separate content from HTML tags. And for that, what you really need to do is
to remove all HTML tags, not parse them. But an email address may contain
"<" and ">" characters around different parts, depending on the format (to
enclose a user name, etc, that is not part of the email address). But those
characters, if they are in the HTML, will not be those characters, but
HTML-Encoding for those characters, i.e. "&LT;" and "&gt;". In the pure
HTML, anything inside an actual "<" or ">" will be an HTML tag. So, you may
want to remove all of them first, and then look for the data you're seeking,
by figuring out the rules for the patterns that a regulaar expression can
recognize.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

The man who questions opinions is wise.
The man who quarrels with facts is a fool.

<yo****@gmail.c om> wrote in message
news:11******** **************@ 38g2000cwa.goog legroups.com...
hello,

my goal is to get patterns out of email files - say "message
forwarding" patterns (message forwarded from: xx to: yy subject: zz)
now lets say there are tons of these patterns (by gmail, outlook, etc)
- and i want to create some rules of how to get them out of the mail's
html body.

so at first i tried using regular expressions: for example - "any
pattern that starts with a <p> and contains "from:"..." etc.
then i understood that its not that simple, because different engines
change the content of them html - and i cant expect spefic tags (what
if a <p> is added? or a <span>)

then ive been guided to use an html parser, heard of GOLD and ANTLR.
but no clue how that can help.

html parsing sounds better - because i really care for what the final
SEEN result is, and not the STRUCTURE of it.

any slightest light of how this problem would be appreceated.

May 21 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
2581
by: YoBro | last post by:
Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. Was hoping somebody could offer a hand. The output of this will put the name of a form field beside name. I want to get the following but not sure how to modify the code below. 1. Field Name (to appear beside NAME:) 2. Field Type (to...
18
3736
by: Shannon Jacobs | last post by:
Trying to solve this with a regex approach rather than the programmatic approach of counting up and down the levels. I have a fairly complicated HTML page that I want to simplify. I've been able to mung most of it using several regular expressions, but I've become stuck at this point. I can't figure out how to grab only the <tr> tags that are...
11
3886
by: Martin Robins | last post by:
I am trying to parse a string that is similar in form to an OLEDB connection string using regular expressions; in principle it is working, but certain character combinations in the string being parsed can completely wreck it. The string I am trying to parse is as follows: commandText=insert into (Text) values (@message + N': ' +...
6
6292
by: Mark Rae | last post by:
Hi, I'm using HttpWebRequest and HttpWebResponse to return a stream of HTML. Looking for advice as to the accepted / easiest / most efficient way to process this HTML in the background i.e. I don't want to display it all to the user, just pull out certain pieces of it. Specifically, I'm looking to evaluate the tabledefs it contains - walk...
1
2255
by: Patrick | last post by:
I need to parse and HTML document of the following format. I am interested to obtain all the HTML from and including the first <div class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy will change). what kind of regular expressions can I use? Note I want everything in the core of the HTML including all the tags...
17
2771
by: Mark | last post by:
I must create a routine that finds tokens in small, arbitrary VB code snippets. For example, it might have to find all occurrences of {Formula} I was thinking that using regular expressions might be a neat way to solve this, but I am new to them. Can anyone give me a hint here? The catch is, it must only find tokens that are not quoted...
3
2019
by: Seb | last post by:
Hello, I am trying to find some object/function able to take an HTML page (code) as an input, strip out all images, stylesheets and other external references, and returns "cleaned" HTML only (without external references) or a text-only version of the page. Any ideas? Thanks,
0
1172
by: bharathitm | last post by:
I'm working on regular expressions to parse html tags into the wiki syntax. i.e. for example, if i encounter text like - some <bmore </ btext, my regular expression should be able to convert that to some 'more' text. Simple things like the above, i was able to write but the real problem lies when it comes to parsing lists and tables. For...
5
3989
by: Svenn Are Bjerkem | last post by:
On Jul 23, 1:03 pm, christopher.saun...@durham.ac.uk (c d saunter) wrote: As a start I want to parse VHDL which is going to be synthesised, and I am limiting myself to the entities and the structural component placement. I will drop the processes and the concurrent assignments even if that will mask important information. It is a design...
0
7698
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7612
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8122
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
6284
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
0
5219
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3653
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
2113
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1213
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
937
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.