By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,458 Members | 1,280 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,458 IT Pros & Developers. It's quick & easy.

Can this be RegEx, or do I have to go DOM?

P: n/a
Hi there

I have a series of HTML tables (well-formed, with elements ID'd quite
nicely) and I need to extract the contents from certain TDs.

For example, I'd like to get "Hi Mom!" from the example below:
<td class="RSCWeb MainMsg">Hi Mom!</td>

My RegEx skill leave much to be desired, I don't know how to capture data
*between* two things (ie: the <td blah blah></td>)... can it be done? If
so, can someone point me to how it can be done, or give me a big tip?

If it can't be done, do I have to load the <table>s as XML and go through
the nodes searching for my content? That seems like a long-winded way to
go, and though the table is well-formed, they are quite large and deep.

There must be an easy RegEx solution if I always want to capture data
between <x attributes="y"and </x>?

Tips, guidance appreciated!
Sep 29 '07 #1
Share this Question
Share on Google+
2 Replies


P: n/a
On Sep 29, 9:43 am, Good Man <he...@letsgo.comwrote:
Hi there

I have a series of HTML tables (well-formed, with elements ID'd quite
nicely) and I need to extract the contents from certain TDs.

For example, I'd like to get "Hi Mom!" from the example below:
<td class="RSCWeb MainMsg">Hi Mom!</td>

My RegEx skill leave much to be desired, I don't know how to capture data
*between* two things (ie: the <td blah blah></td>)... can it be done? If
so, can someone point me to how it can be done, or give me a big tip?

If it can't be done, do I have to load the <table>s as XML and go through
the nodes searching for my content? That seems like a long-winded way to
go, and though the table is well-formed, they are quite large and deep.

There must be an easy RegEx solution if I always want to capture data
between <x attributes="y"and </x>?

Tips, guidance appreciated!
---
There must be an easy RegEx solution if I always want to capture data
between <x attributes="y"and </x>?
There is. In fact, it's the example used at the PHP preg_match_all
page:

http://www.php.net/manual/en/functio...-match-all.php

To learn more about regex, see the PHP pattern syntax docs:

http://www.php.net/manual/en/referen...ern.syntax.php

There are some helpful references in the user comments.

One additional bit of advice that might help. When trying to parse
data from a particular section of a large mass of tags like a web
page, I find it easier, if possible, to first isolate the section I'll
be focusing on by clipping at some consistent "landmarks". The <body>
tag would be one example. This doesn't even require regex per se but
can use other PHP string functions like strpos and substr.

For example, say you want to parse the last result from a page of
google search results (http://lastgoogle.com/), you could look for a
unique constant marker at the bottom of the page like '<div
id=navbar', clip there, then use strrpos to backtrack from there to
another landmark to isolate the section you'll be parsing by regex.

Good defensive programming here also helps as stuff like this usually
requires some trial and error and it can be used to alert you in the
event any of the patterns you're expecting to be there unexpectedly
change.

Good luck,
Tom

Sep 29 '07 #2

P: n/a
There is of course a way to do it with Regex, but if your XHTML is
Valid, you can just use a XML parser and get all those items in a
simple function.

Sep 29 '07 #3

This discussion thread is closed

Replies have been disabled for this discussion.