By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,952 Members | 1,674 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,952 IT Pros & Developers. It's quick & easy.

PHP4 : Extract text from HTML file

P: n/a
Hi,

I would like to extract the text in an HTML file
For the moment, I'm trying to get all text between <tdand </td>. I
used a regular expression because i don't know the "format between
<tdand </td>

It can be :
<tdtext1 </td>
or
<td>
text1
</td>
or anything else

eregi("<td(.*)>(.*)(</td>?)",$text,$regtext);

The problem is that, if I have
<tdtext</td>
<td>text2</td>

regtext will return text</td><td>text2.

How can I change the expression so that it stops at the first occurence
of </td>?

Thanks

Jul 5 '06 #1
Share this Question
Share on Google+
9 Replies


P: n/a
tr********@gmail.com wrote:
Hi,

I would like to extract the text in an HTML file
For the moment, I'm trying to get all text between <tdand </td>. I
used a regular expression because i don't know the "format between
<tdand </td>

It can be :
<tdtext1 </td>
or
<td>
text1
</td>
or anything else

eregi("<td(.*)>(.*)(</td>?)",$text,$regtext);

The problem is that, if I have
<tdtext</td>
<td>text2</td>

regtext will return text</td><td>text2.

How can I change the expression so that it stops at the first occurence
of </td>?

Thanks
Hi.

Not sure, but I think this is what you want.
http://fi.php.net/manual/en/ref.dom.php
These function should be able to extract the text from any tags!

Sorry if I'm wrong.

Jul 5 '06 #2

P: n/a
e.ahlb...@gmail.com wrote:
tr********@gmail.com wrote:
Hi,

I would like to extract the text in an HTML file
For the moment, I'm trying to get all text between <tdand </td>. I
used a regular expression because i don't know the "format between
<tdand </td>

It can be :
<tdtext1 </td>
or
<td>
text1
</td>
or anything else

eregi("<td(.*)>(.*)(</td>?)",$text,$regtext);

The problem is that, if I have
<tdtext</td>
<td>text2</td>

regtext will return text</td><td>text2.

How can I change the expression so that it stops at the first occurence
of </td>?

Thanks

Hi.

Not sure, but I think this is what you want.
http://fi.php.net/manual/en/ref.dom.php
These function should be able to extract the text from any tags!

Sorry if I'm wrong.
Of course, I was wrong. Didn't notice that you were using PHP4.
Take a look at http://fi.php.net/manual/en/ref.domxml.php instead.

Jul 5 '06 #3

P: n/a
It looks like these functons are used for XML files, can it still be
used for html files?

e.*******@gmail.com wrote:
e.ahlb...@gmail.com wrote:
tr********@gmail.com wrote:
Hi,
>
I would like to extract the text in an HTML file
For the moment, I'm trying to get all text between <tdand </td>. I
used a regular expression because i don't know the "format between
<tdand </td>
>
It can be :
<tdtext1 </td>
or
<td>
text1
</td>
or anything else
>
eregi("<td(.*)>(.*)(</td>?)",$text,$regtext);
>
The problem is that, if I have
<tdtext</td>
<td>text2</td>
>
regtext will return text</td><td>text2.
>
How can I change the expression so that it stops at the first occurence
of </td>?
>
Thanks
Hi.

Not sure, but I think this is what you want.
http://fi.php.net/manual/en/ref.dom.php
These function should be able to extract the text from any tags!

Sorry if I'm wrong.

Of course, I was wrong. Didn't notice that you were using PHP4.
Take a look at http://fi.php.net/manual/en/ref.domxml.php instead.
Jul 5 '06 #4

P: n/a

tr********@gmail.com wrote:
It looks like these functons are used for XML files, can it still be
used for html files?
That should be what they're for... Try it!

Jul 5 '06 #5

P: n/a
tr********@gmail.com wrote:
Hi,

I would like to extract the text in an HTML file
For the moment, I'm trying to get all text between <tdand </td>. I
used a regular expression because i don't know the "format between
<tdand </td>

It can be :
<tdtext1 </td>
or
<td>
text1
</td>
or anything else

eregi("<td(.*)>(.*)(</td>?)",$text,$regtext);

The problem is that, if I have
<tdtext</td>
<td>text2</td>

regtext will return text</td><td>text2.

How can I change the expression so that it stops at the first occurence
of </td>?
If that's all you want to change, then you can just add the '?' (minimal
match) qualifier to the '.*' within your regexp. By default, the '*'
operator is "greedy" (that is, matches as much data as possible). If you
replace that with '.*?' it will find the minimum amount of text that
satisfies your requirements.

If you want heavier-duty HTML parsing, you're probably better of looking
for a library rather than trying to do it all by hand anyway, as the
other poster suggested.

Tim
Jul 5 '06 #6

P: n/a
Thanks for your advice :D Well the 'ungreedy' solution worked for the
moment ;)
I will try the library later :)
Tim Martin wrote:
tr********@gmail.com wrote:
Hi,

I would like to extract the text in an HTML file
For the moment, I'm trying to get all text between <tdand </td>. I
used a regular expression because i don't know the "format between
<tdand </td>

It can be :
<tdtext1 </td>
or
<td>
text1
</td>
or anything else

eregi("<td(.*)>(.*)(</td>?)",$text,$regtext);

The problem is that, if I have
<tdtext</td>
<td>text2</td>

regtext will return text</td><td>text2.

How can I change the expression so that it stops at the first occurence
of </td>?

If that's all you want to change, then you can just add the '?' (minimal
match) qualifier to the '.*' within your regexp. By default, the '*'
operator is "greedy" (that is, matches as much data as possible). If you
replace that with '.*?' it will find the minimum amount of text that
satisfies your requirements.

If you want heavier-duty HTML parsing, you're probably better of looking
for a library rather than trying to do it all by hand anyway, as the
other poster suggested.

Tim
Jul 5 '06 #7

P: n/a
un fortunatelly, the document must be valid xml file. As thinking of
most of the web masters, it is a idealistic case.

e.*******@gmail.com wrote:
e.ahlb...@gmail.com wrote:
tr********@gmail.com wrote:
Hi,
>
I would like to extract the text in an HTML file
For the moment, I'm trying to get all text between <tdand </td>. I
used a regular expression because i don't know the "format between
<tdand </td>
>
It can be :
<tdtext1 </td>
or
<td>
text1
</td>
or anything else
>
eregi("<td(.*)>(.*)(</td>?)",$text,$regtext);
>
The problem is that, if I have
<tdtext</td>
<td>text2</td>
>
regtext will return text</td><td>text2.
>
How can I change the expression so that it stops at the first occurence
of </td>?
>
Thanks
Hi.

Not sure, but I think this is what you want.
http://fi.php.net/manual/en/ref.dom.php
These function should be able to extract the text from any tags!

Sorry if I'm wrong.

Of course, I was wrong. Didn't notice that you were using PHP4.
Take a look at http://fi.php.net/manual/en/ref.domxml.php instead.
Jul 6 '06 #8

P: n/a
tr********@gmail.com wrote:
>eregi("<td(.*)>(.*)(</td>?)",$text,$regtext);

The problem is that, if I have
<tdtext</td>
<td>text2</td>

regtext will return text</td><td>text2.

How can I change the expression so that it stops at the first occurence
of </td>?
The cause of the problem is that the regex is greedy (i.e., matches as
much as possible given the constraints of the expression). The simplest
solution, if you are sure that the table cell contents will have no
other markup, is to change the regex to "<td[^>]*>([^<]*)</td>". This
specifies that no open angle bracket can exist between the td and /td.

If you can't be sure of that, I'd suggest something like this:

preg_match('/<td[^>]*>(.*)<\/td>/imsU', $text, $regtext);

The modifiers in this regex specify that it should be non-greedy, case
insensitive, and regard newlines and not special. It only returns
information about the first <td></td>; if you want to get them all,
preg_match_all will do the trick with the same regex. (Tested on version
4.1.2.)

HTH,
Gertjan.
--
Gertjan Klein <gk****@xs4all.nl>
Jul 6 '06 #9

P: n/a
tr********@gmail.com wrote:
Hi,

I would like to extract the text in an HTML file
For the moment, I'm trying to get all text between <tdand </td>. I
used a regular expression because i don't know the "format between
<tdand </td>
[snip]

By the way, please don't waste people's time by multi-posting. If you
think this question is appropriate both to here and comp.programming [1]
then please cross-post it so that people in both groups can see the
responses and can avoid spending time answering a question that's
already been answered in the other group.

Tim

[1] In my opinion, it isn't
Jul 6 '06 #10

This discussion thread is closed

Replies have been disabled for this discussion.