469,303 Members | 1,900 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,303 developers. It's quick & easy.

Code for returning HTML table data into array?

Before I go building this, I want to know if it already exists.

I need some PHP code that will read a web page and return all text that
comes between <td></td> tags in an array.

So if there were three tables on that page, it would return the first
table's fourth row, third column in a variable such as:

$tableArray[0][3][1]
// ^ ^ ^ - 2nd <td></td>
// ^ ^ - 4th <tr></tr>
// ^ - 1st <table></table>

Does something like this exist somewhere where I can grab it, or do I have
to build it from scratch?
--
[ Sugapablo ]
[ http://www.sugapablo.com <--music ]
[ http://www.sugapablo.net <--personal ]
[ su*******@12jabber.com <--jabber IM ]
Jul 17 '05 #1
7 14849
Sugapablo wrote:
Before I go building this, I want to know if it already exists.

I need some PHP code that will read a web page and return all text that
comes between <td></td> tags in an array.

So if there were three tables on that page, it would return the first
table's fourth row, third column in a variable such as:

$tableArray[0][3][1]
// ^ ^ ^ - 2nd <td></td>
// ^ ^ - 4th <tr></tr>
// ^ - 1st <table></table>

Does something like this exist somewhere where I can grab it, or do I have
to build it from scratch?


I just recently posted a routine that gets all <input>s from within
<form>s. This tiny URL fetches it from the Google archive:
http://tinyurl.com/3629k

You just have to change it to fetch all <table>s, and all <tr>s from
each table, then all <td>s (maybe <th>s too?) from each <tr>.
Something like

preg_match_all($table_regexp, $html, $tables);
foreach ($tables as $table) {
preg_match_all($tr_regexp, $table_html, $trs);
foreach ($trs as $tr) {
preg_match_all($tr_regexp, $tr_html, $tds);
}
}
Happy Coding :)
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Jul 17 '05 #2
Pedro Graca wrote:
You just have to change it to fetch all <table>s, and all <tr>s from
each table, then all <td>s (maybe <th>s too?) from each <tr>.


Oops, I just remembered something that turns this into a nasty problem:
you can have <table>s inside <table>s (and, in fact, often do!)
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Jul 17 '05 #3
Are there nest tables in the file? If there are, then the HTML will be
rather difficult to parsed.

Uzytkownik "Sugapablo" <ru********@sugapablo.com> napisal w wiadomosci
news:sl***********************@dell.sugapablo.net. ..
Before I go building this, I want to know if it already exists.

I need some PHP code that will read a web page and return all text that
comes between <td></td> tags in an array.

So if there were three tables on that page, it would return the first
table's fourth row, third column in a variable such as:

$tableArray[0][3][1]
// ^ ^ ^ - 2nd <td></td>
// ^ ^ - 4th <tr></tr>
// ^ - 1st <table></table>

Does something like this exist somewhere where I can grab it, or do I have
to build it from scratch?
--
[ Sugapablo ]
[ http://www.sugapablo.com <--music ]
[ http://www.sugapablo.net <--personal ]
[ su*******@12jabber.com <--jabber IM ]

Jul 17 '05 #4
Yeah, and another problem is missing end tags. Parsing HTML is such a pain.
Almost makes you want to try some hack like outputting the captured HTML in
an invisible inline frame, then use Javascript to grab the data and post it
back to the server.

Uzytkownik "Pedro Graca" <he****@hotpop.com> napisal w wiadomosci
news:bt************@ID-203069.news.uni-berlin.de...
Pedro Graca wrote:
You just have to change it to fetch all <table>s, and all <tr>s from
each table, then all <td>s (maybe <th>s too?) from each <tr>.


Oops, I just remembered something that turns this into a nasty problem:
you can have <table>s inside <table>s (and, in fact, often do!)
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--

Jul 17 '05 #5
On Sun, 11 Jan 2004 20:01:30 -0000, Sugapablo <ru********@sugapablo.com> wrote:
Before I go building this, I want to know if it already exists.

I need some PHP code that will read a web page and return all text that
comes between <td></td> tags in an array.

So if there were three tables on that page, it would return the first
table's fourth row, third column in a variable such as:

$tableArray[0][3][1]
// ^ ^ ^ - 2nd <td></td>
// ^ ^ - 4th <tr></tr>
// ^ - 1st <table></table>

Does something like this exist somewhere where I can grab it, or do I have
to build it from scratch?


Parsing HTML is not trivial, and coping with marginal and outright broken HTML
is a real pain. Perl has some excellent HTML parsing modules, and one in
particular ideal for this: HTML::TableExtract.

You could write a Perl script and pass it the data you want, and have it
return the information in some more convenient form. Not particularly elegant
since you have to start up a perl intepreter (although that can be mitigated
using something like PersistentPerl which keeps the interpreter running for a
while afterwards so it's reusable by the next request, saving startup times),
but it's got to beat trying to write an HTML parser!

--
Andy Hassall <an**@andyh.co.uk> / Space: disk usage analysis tool
<http://www.andyh.co.uk> / <http://www.andyhsoftware.co.uk/space>
Jul 17 '05 #6
In article <Q4********************@comcast.com>, Chung Leong wrote:
Are there nest tables in the file? If there are, then the HTML will be
rather difficult to parsed.


There could be. But I actually don't forsee that as being too much of a
problem as each time the script would come across a new table, it would
realize it. Then what would be in that table data, would be another
table variable.

I knwo it sounds wierd but, hey...things are wierd.

--
[ Sugapablo ]
[ http://www.sugapablo.com <--music ]
[ http://www.sugapablo.net <--personal ]
[ su*******@12jabber.com <--jabber IM ]
Jul 17 '05 #7
If you write the code that realizes it, then there's no problem. The problem
is writing the code realizes it :-)

You can't write a regular expression pattern that would extract the data,
that's why I said it's difficult to do.

Uzytkownik "Sugapablo" <ru********@sugapablo.com> napisal w wiadomosci
news:sl***********************@dell.sugapablo.net. ..
In article <Q4********************@comcast.com>, Chung Leong wrote:
Are there nest tables in the file? If there are, then the HTML will be
rather difficult to parsed.


There could be. But I actually don't forsee that as being too much of a
problem as each time the script would come across a new table, it would
realize it. Then what would be in that table data, would be another
table variable.

I knwo it sounds wierd but, hey...things are wierd.

--
[ Sugapablo ]
[ http://www.sugapablo.com <--music ]
[ http://www.sugapablo.net <--personal ]
[ su*******@12jabber.com <--jabber IM ]

Jul 17 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

12 posts views Thread by Steven T. Hatton | last post: by
10 posts views Thread by Fraser Ross | last post: by
5 posts views Thread by Robert | last post: by
3 posts views Thread by josh.kuo | last post: by
15 posts views Thread by Joseph Geretz | last post: by
11 posts views Thread by rich | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
reply views Thread by harlem98 | last post: by
1 post views Thread by Geralt96 | last post: by
reply views Thread by harlem98 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.