"Brett" <no@spam.com> schrieb:
I'd like to find URLs inside of an email message. If there is anything
between the <a></a>, I'd like to also get that and associate it with the
URL in the <a> tag.
Content between the <a></a> might be text or an image. The <img> tag will
also have a URL, which I need to get. It will be associated with the <a>
tag's URL. When I say associate, I'll just store the two with some
relational ID into a database.
Besides brute regular expression parsing of the text, is there a better
way to extract this content?
For parsing the HTML file:
MSHTML Reference
<URL:http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/reference.asp>
- or -
..NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>
Download:
<URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip>
If the file read is in XHTML format, you can use the classes contained in
the 'System.Xml' namespace for reading information from the file.
As you already said, regular expressions can be used to do what you try to
archieve:
..NET Framework Developer's Guide -- Example: Scanning for 'HREF's
<URL:http://msdn.microsoft.com/library/en-us/cpguide/html/cpconexamplescanningforhrefs.asp>
--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://dotnet.mvps.org/dotnet/faqs/>