473,379 Members | 1,335 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,379 software developers and data experts.

Using regex in html code

Hi all.

I have a html table with multiple rows (one row example below). I
would like to extract everything within the <tdtags into groups on a
row by row basis. The process would be: find the first row, then
extract the column data, store data in a textfile, find the next row,
extract the column data, store data in a textfile.... and so on till
we go through all the rows in the document.

Please help.

Thanks in advance.

<tr>
<td>1</td>
<td>GET UP </td>
<td>CIARA FT CHAMILLIONAIRE</td>
<td>04:25</td>
<td>128.66</td>
<td></td>
<td>Step Up [Soundtrack]</td>
<td></td>
<td>R&B/Rap</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>D:\Ciara feat. Chamillionare - Get Up.mp3</td>
<td>Stripe, (-1.6 dB, -0.7 dB)</td>
<td></td>
<td></td>
<td>2006/01/01</td>
<td>256000</td>
<td></td>
<td>2</td>
<td>2007/03/28</td>
<td>2006/12/04</td>
<td>2007/3/28 20:50:16</td>
<td>00:07</td>
<td>B</td>
</tr>

May 23 '07 #1
6 4116
* Nightcrawler wrote, On 23-5-2007 6:59:
Hi all.

I have a html table with multiple rows (one row example below). I
would like to extract everything within the <tdtags into groups on a
row by row basis. The process would be: find the first row, then
extract the column data, store data in a textfile, find the next row,
extract the column data, store data in a textfile.... and so on till
we go through all the rows in the document.
You're better off using the HTML Agility Pack.

But it can be done using regex:

<tr((?!<td).)*(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
ExplicitCapure ON
SingleLine ON
SaseInsensitive ON

This will give you one group which will hold all the TD's found. I've
written it quite robust, but this isn't the best available
implementation. If the HTML tables are of a well known format, this
would be no problem. If they come from an external source, you might wat
to test more rigorously.

I'll try to explain:
<tr((?!<td).)*
Find every a TR starting tag and capture anything after that till you
find a <td

(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*
snip off the TD tag and capture it's content till you're at a </td. Then
caputure the </tdand any whitespace or newline that might follow.
Repeat till all TD's have been tagged for this row.

((?!</tr).)*</tr[^>"*]*>
Capture everything that follows the last <td>...</tdcombination

Executing Regex.Matches will give you a MatchCollection. Each item in
the matchcollection will have 1 Group named "TD". This group has a list
of Captures which will contain all the values captured in this Group name.

Kind Regards,

Jesse Houwing
>
Please help.

Thanks in advance.

<tr>
<td>1</td>
<td>GET UP </td>
<td>CIARA FT CHAMILLIONAIRE</td>
<td>04:25</td>
<td>128.66</td>
<td></td>
<td>Step Up [Soundtrack]</td>
<td></td>
<td>R&B/Rap</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>D:\Ciara feat. Chamillionare - Get Up.mp3</td>
<td>Stripe, (-1.6 dB, -0.7 dB)</td>
<td></td>
<td></td>
<td>2006/01/01</td>
<td>256000</td>
<td></td>
<td>2</td>
<td>2007/03/28</td>
<td>2006/12/04</td>
<td>2007/3/28 20:50:16</td>
<td>00:07</td>
<td>B</td>
</tr>
May 23 '07 #2
You will need to split the string in order to do this. It can be done by
using 2 regular expressions, very similar:

(?s)<tr[^>]*>(?<content>.*?)</tr>

Splits the table into a match for each row.

Once you have the array of row strings, you can use:

(?s)<td[^>]*>(?<content>.*?)</td>

Splits the row into a match for each column.

The reason it can't be done in one pass is that you need to create a match
for each row, and the match cannot contain "sub-matches," only groups, and
unless you know how many columns there are, you can't create a group for
each column. If you DO know how many columns there are, you can, as in:

(?s)<tr[^>]*>.*?(?<row1><td[^>]*>(?<row1content>.*?)</td>).*?(?<row2><td[^>]*>(?<row2content>.*?)</td>).*?</tr>

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Nightcrawler" <th************@gmail.comwrote in message
news:11*********************@k79g2000hse.googlegro ups.com...
Hi all.

I have a html table with multiple rows (one row example below). I
would like to extract everything within the <tdtags into groups on a
row by row basis. The process would be: find the first row, then
extract the column data, store data in a textfile, find the next row,
extract the column data, store data in a textfile.... and so on till
we go through all the rows in the document.

Please help.

Thanks in advance.

<tr>
<td>1</td>
<td>GET UP </td>
<td>CIARA FT CHAMILLIONAIRE</td>
<td>04:25</td>
<td>128.66</td>
<td></td>
<td>Step Up [Soundtrack]</td>
<td></td>
<td>R&B/Rap</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>D:\Ciara feat. Chamillionare - Get Up.mp3</td>
<td>Stripe, (-1.6 dB, -0.7 dB)</td>
<td></td>
<td></td>
<td>2006/01/01</td>
<td>256000</td>
<td></td>
<td>2</td>
<td>2007/03/28</td>
<td>2006/12/04</td>
<td>2007/3/28 20:50:16</td>
<td>00:07</td>
<td>B</td>
</tr>

May 23 '07 #3
<SNIP>
The reason it can't be done in one pass is that you need to create a match
for each row, and the match cannot contain "sub-matches," only groups, and
unless you know how many columns there are, you can't create a group for
each column. If you DO know how many columns there are, you can, as in:
Kevin,

You actually can get multiple results for the same named group. the
structure is as follows:

MatchCollection 1 ----* Groups 1 ----* Captures

Which - sort of - translates to:

Rows ----* Cells ----* Cell Values

The expression which will capture this info correctly would then be
something like this:

<tr((?!<td).)*(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
ExplicitCapure ON
SingleLine ON
SaseInsensitive ON

I tested it and it works like a charm.

Kind regards,

Jesse Houwing
May 23 '07 #4
<SNIP>
The reason it can't be done in one pass is that you need to create a
match for each row, and the match cannot contain "sub-matches," only
groups, and unless you know how many columns there are, you can't create
a group for each column. If you DO know how many columns there are, you
can, as in:
>
Kevin,

You actually can get multiple results for the same named group. the
structure is as follows:

MatchCollection 1 ----* Groups 1 ----* Captures

Which - sort of - translates to:

Rows ----* Cells ----* Cell Values

The expression which will capture this info correctly would then be
something like this:

<tr((?!<td).)*(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
ExplicitCapure ON
SingleLine ON
SaseInsensitive ON

I tested it and it works like a charm.

Kind regards,

Jesse Houwing
May 23 '07 #5
I've got to hand it to you, Jesse.That is possibly the most creative use
I've ever seen of regular expressions and the System.Text.RegularExpressions
NameSpace and classes. I tested it too, and while it took me a good while to
get my head around what it was doing, and I will have to mull it over some
more before I fully understand it, it does work beautifully. I'd love to see
some more of your regex work some time.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

"Jesse Houwing" <je***********@nospam-sogeti.nlwrote in message
news:46**************@nospam-sogeti.nl...
<SNIP>
>The reason it can't be done in one pass is that you need to create a
match for each row, and the match cannot contain "sub-matches," only
groups, and unless you know how many columns there are, you can't create
a group for each column. If you DO know how many columns there are, you
can, as in:

Kevin,

You actually can get multiple results for the same named group. the
structure is as follows:

MatchCollection 1 ----* Groups 1 ----* Captures

Which - sort of - translates to:

Rows ----* Cells ----* Cell Values

The expression which will capture this info correctly would then be
something like this:

<tr((?!<td).)*(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
ExplicitCapure ON
SingleLine ON
SaseInsensitive ON

I tested it and it works like a charm.

Kind regards,

Jesse Houwing

May 24 '07 #6
* Kevin Spencer wrote, On 24-5-2007 13:48:
I've got to hand it to you, Jesse.That is possibly the most creative use
I've ever seen of regular expressions and the System.Text.RegularExpressions
NameSpace and classes. I tested it too, and while it took me a good while to
get my head around what it was doing, and I will have to mull it over some
more before I fully understand it, it does work beautifully. I'd love to see
some more of your regex work some time.
Kevin,

Thank you :).

Jesse
May 24 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: laredotornado | last post by:
Hi, I'm using PHP 4 and trying to parse through HTML to look for HREF attributes of anchor tags and SRC attributes of IMG tags. Does anyone know of any libraries/freeware to help parse through...
75
by: Xah Lee | last post by:
http://python.org/doc/2.4.1/lib/module-re.html http://python.org/doc/2.4.1/lib/node114.html --------- QUOTE The module defines several functions, constants, and an exception. Some of the...
1
by: jason | last post by:
I have exhausted all resources, so perhaps someone out there can help. I have a 8MB string (that represents an html document) and I am trying to run a couple of regular expressions on this...
2
by: tshad | last post by:
I am trying to get a regular expression to work and I keep getting the following error: Compiler Error Message: BC30469: Reference to a non-shared member requires an object reference. The...
2
by: Dennis | last post by:
I am trying to implement a "Find and Replace" dialog that allows using wildcards in the find string, much like the Find and Replace Dialogs in Ms Word, etc. Are there any references or examples on...
3
by: jab3 | last post by:
Hello. I"m new to this group, and to JavaScript in general, so please forgive me if I breach local etiquette. I'm trying to implement some client-side 'dynamic' validation on a form. I'm having...
1
by: jonnyboy6969 | last post by:
Hi All Really hoping someone can help me out here with my deficient regex skills :) I have a function which takes a string of HTML and replaces a term (word or phrase) with a link. The pupose...
14
by: Andy B | last post by:
I need to create a regular expression that will match a 5 digit number, a space and then anything up to but not including the next closing html tag. Here is an example: <startTag>55555 any...
3
rizwan6feb
by: rizwan6feb | last post by:
I am trying to extract php code from a php file (php file also contains html, css and javascript code). I am using the following regex for this <\?*?\?> but this doesn't cater quotation marks...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.