473,327 Members | 1,892 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,327 software developers and data experts.

Can this be RegEx, or do I have to go DOM?

Hi there

I have a series of HTML tables (well-formed, with elements ID'd quite
nicely) and I need to extract the contents from certain TDs.

For example, I'd like to get "Hi Mom!" from the example below:
<td class="RSCWeb MainMsg">Hi Mom!</td>

My RegEx skill leave much to be desired, I don't know how to capture data
*between* two things (ie: the <td blah blah></td>)... can it be done? If
so, can someone point me to how it can be done, or give me a big tip?

If it can't be done, do I have to load the <table>s as XML and go through
the nodes searching for my content? That seems like a long-winded way to
go, and though the table is well-formed, they are quite large and deep.

There must be an easy RegEx solution if I always want to capture data
between <x attributes="y"and </x>?

Tips, guidance appreciated!
Sep 29 '07 #1
2 1589
On Sep 29, 9:43 am, Good Man <he...@letsgo.comwrote:
Hi there

I have a series of HTML tables (well-formed, with elements ID'd quite
nicely) and I need to extract the contents from certain TDs.

For example, I'd like to get "Hi Mom!" from the example below:
<td class="RSCWeb MainMsg">Hi Mom!</td>

My RegEx skill leave much to be desired, I don't know how to capture data
*between* two things (ie: the <td blah blah></td>)... can it be done? If
so, can someone point me to how it can be done, or give me a big tip?

If it can't be done, do I have to load the <table>s as XML and go through
the nodes searching for my content? That seems like a long-winded way to
go, and though the table is well-formed, they are quite large and deep.

There must be an easy RegEx solution if I always want to capture data
between <x attributes="y"and </x>?

Tips, guidance appreciated!
---
There must be an easy RegEx solution if I always want to capture data
between <x attributes="y"and </x>?
There is. In fact, it's the example used at the PHP preg_match_all
page:

http://www.php.net/manual/en/functio...-match-all.php

To learn more about regex, see the PHP pattern syntax docs:

http://www.php.net/manual/en/referen...ern.syntax.php

There are some helpful references in the user comments.

One additional bit of advice that might help. When trying to parse
data from a particular section of a large mass of tags like a web
page, I find it easier, if possible, to first isolate the section I'll
be focusing on by clipping at some consistent "landmarks". The <body>
tag would be one example. This doesn't even require regex per se but
can use other PHP string functions like strpos and substr.

For example, say you want to parse the last result from a page of
google search results (http://lastgoogle.com/), you could look for a
unique constant marker at the bottom of the page like '<div
id=navbar', clip there, then use strrpos to backtrack from there to
another landmark to isolate the section you'll be parsing by regex.

Good defensive programming here also helps as stuff like this usually
requires some trial and error and it can be used to alert you in the
event any of the patterns you're expecting to be there unexpectedly
change.

Good luck,
Tom

Sep 29 '07 #2
There is of course a way to do it with Regex, but if your XHTML is
Valid, you can just use a XML parser and get all those items in a
simple function.

Sep 29 '07 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Jon Maz | last post by:
Hi All, Am getting frustrated trying to port the following (pretty simple) function to CSharp. The problem is that I'm lousy at Regular Expressions.... //from...
9
by: Tim Conner | last post by:
Is there a way to write a faster function ? public static bool IsNumber( char Value ) { if (Regex.IsMatch( Value.ToString(), @"^+$" )) { return true; } else return false; }
20
by: jeevankodali | last post by:
Hi I have an .Net application which processes thousands of Xml nodes each day and for each node I am using around 30-40 Regex matches to see if they satisfy some conditions are not. These Regex...
17
by: clintonG | last post by:
I'm using an .aspx tool I found at but as nice as the interface is I think I need to consider using others. Some can generate C# I understand. Your preferences please... <%= Clinton Gallagher ...
6
by: Extremest | last post by:
I have a huge regex setup going on. If I don't do each one by itself instead of all in one it won't work for. Also would like to know if there is a faster way tried to use string.replace with all...
7
by: Extremest | last post by:
I am using this regex. static Regex paranthesis = new Regex("(\\d*/\\d*)", RegexOptions.IgnoreCase); it should find everything between parenthesis that have some numbers onyl then a forward...
3
by: aspineux | last post by:
My goal is to write a parser for these imaginary string from the SMTP protocol, regarding RFC 821 and 1869. I'm a little flexible with the BNF from these RFC :-) Any comment ? tests= def...
15
by: morleyc | last post by:
Hi, i would like to remove a number of characters from my string (\t \r \n which are throughout the string), i know regex can do this but i have no idea how. Any pointers much appreciated. Chris
4
by: CJ | last post by:
Is this the format to parse a string and return the value between the item? Regex pRE = new Regex("<File_Name>.*>(?<insideText>.*)</File_Name>"); I am trying to parse this string. ...
0
by: Karch | last post by:
I have these two methods that are chewing up a ton of CPU time in my application. Does anyone have any suggestions on how to optimize them or rewrite them without Regex? The most time-consuming...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.