473,388 Members | 1,322 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,388 software developers and data experts.

about preg_match_all statement

118 100+
hi,

i write the below code to capture the images from the website when i submit the url.In the same way i want to capture the Text information from the website.plz tell that whats the code for that.plz help me.


[php]

<?php

$content= file_get_contents($url);
preg_match_all( "/<img(.*)src=(\"|')(.*)(\"|\')(.*)[\/]?>/siU", $content, $match, PREG_PATTERN_ORDER);

echo "<b>Capture Images :</b><br>";
echo "<br>";
print_r($match[0]);
echo "<br>";
echo "<br>";
echo "<b>Capture Images URLS :</b><br><br>";
preg_match_all( "/<img(.*)src=(\"|')(.*)(\"|\')(.*)[\/]?>/siU", $content, $match, PREG_PATTERN_ORDER);
print_r($match[3]);
[/php]
Jul 27 '08 #1
10 2662
pbmods
5,821 Expert 4TB
Heya, Swethak.

What is your code doing now that is different from what you want it to do?
Jul 27 '08 #2
swethak
118 100+
Heya, Swethak.

What is your code doing now that is different from what you want it to do?

It is for capture the images from website. i Want capture the text information from website.
Jul 28 '08 #3
Gulzor
27
If you are working with PHP5, you can use the DOM API for that.

Adapt this to your needs :
[php]
<?php
$htmlString = file_get_contents('url_or_path_to_html_file');
$htmlDoc = DOMDocument::loadHTML($htmlString);
$xpath = new DOMXPath($htmlDoc);

/* fetch the content of all <p> tags */
$pNodesList = $xpath->query('//p');
for ($i=0; $i<$pNodesList->length; $i++) {
$pNode = $pNodesList->item($i);
echo $pNode->nodeValue, "\n";
}

?>
[/php]

May not be the best method but I prefer handling HTML document with the DOM API instead of knocking my head on the walls with regex :P
Jul 28 '08 #4
swethak
118 100+
If you are working with PHP5, you can use the DOM API for that.

Adapt this to your needs :
[php]
<?php
$htmlString = file_get_contents('url_or_path_to_html_file');
$htmlDoc = DOMDocument::loadHTML($htmlString);
$xpath = new DOMXPath($htmlDoc);

/* fetch the content of all <p> tags */
$pNodesList = $xpath->query('//p');
for ($i=0; $i<$pNodesList->length; $i++) {
$pNode = $pNodesList->item($i);
echo $pNode->nodeValue, "\n";
}

?>
[/php]

May not be the best method but I prefer handling HTML document with the DOM API instead of knocking my head on the walls with regex :P
I used like that way i got below errors.plz tell that whats the mistake.

Warning: DOMDocument::loadHTML() [function.DOMDocument-loadHTML]: htmlParseEntityRef: expecting ';' in Entity, line: 34 in C:\wamp\www\test\textdata.php on line 3

Warning: DOMDocument::loadHTML() [function.DOMDocument-loadHTML]: htmlParseEntityRef: expecting ';' in Entity, line: 34 in C:\wamp\www\test\textdata.php on line 3

Warning: DOMDocument::loadHTML() [function.DOMDocument-loadHTML]: htmlParseEntityRef: expecting ';' in Entity, line: 34 in C:\wamp\www\test\textdata.php on line 3

Warning: DOMDocument::loadHTML() [function.DOMDocument-loadHTML]: htmlParseEntityRef: expecting ';' in Entity, line: 34 in C:\wamp\www\test\textdata.php on line 3
Jul 28 '08 #5
Gulzor
27
These are "just" warnings resulting in wrong or unsupported html entities or something else. It's just impossible to parse a html document without getting these warnings...

If your texts are not between <p></p>, you can replace //p by //td. Like I said, you need to adapt it to your needs.
Jul 28 '08 #6
swethak
118 100+
These are "just" warnings resulting in wrong or unsupported html entities or something else. It's just impossible to parse a html document without getting these warnings...

If your texts are not between <p></p>, you can replace //p by //td. Like I said, you need to adapt it to your needs.

If i use the condition as if the data is in between <p> tags it shows the data otherwise it didn't give any error.How i use the condition for that .Plz help me.
Jul 28 '08 #7
Gulzor
27
If i use the condition as if the data is in between <p> tags it shows the data otherwise it didn't give any error.How i use the condition for that .Plz help me.
I don't understand what your problem is now... not only <p> tag hold texts. <li>, <td>, <span> and more also do.
Jul 28 '08 #8
mobs
1
Say that I just wanted to retrieve the number 30735 from the following code, how would you go about doing that?

Expand|Select|Wrap|Line Numbers
  1. <a href="/?item=30735">River Runner</a>
Aug 6 '08 #9
pbmods
5,821 Expert 4TB
Heya, Mobs. Welcome to Bytes!

The only part that we really care about is:
Expand|Select|Wrap|Line Numbers
  1. <a href="/?item=30735
Now, we have to make a couple of assumptions:
  • The URL might have a path and/or other query variables prepended. E.g.:
    Expand|Select|Wrap|Line Numbers
    1. <a href="/path/to/some.php?file=test&item=123456"
  • The URL might have some stuff after it. E.g.,:
    Expand|Select|Wrap|Line Numbers
    1. <a href="/?item=654321&amp;visitor=1"
  • The anchor tag might have attributes before the href attribute. E.g.,:
    Expand|Select|Wrap|Line Numbers
    1. <a target="_blank" href="/?item=13579"

We are going to assume that the tag is well-formed (ends with a '>' and the href attribute is properly-quoted with any quotes inside of it percent- or ampersand-escaped).

With that in mind, we need to be able to skip over anything we don't care about and focus only on what we want:

Expand|Select|Wrap|Line Numbers
  1. /<a[^>]*href="[^"]+item=(\d+)/
  2.  
This should be enough to harvest item IDs from anchor tags on the page.
Aug 6 '08 #10
swethak
118 100+
hi,

i write a code to capture all the information in between <p> tags.But In between the <p> tags some <img> tags also there.And i write a condition as i capture all the information in between <p> tags and didn't take the img tags information.How i write the condition for that.plz help me.

[php]
<?php
$content= file_get_contents('http://www.website.com');
preg_match_all( '/<p (.*)>(.*)<\/p>/s', $content, $match, PREG_PATTERN_ORDER);

echo "<b>Capture Images :</b><br>";
echo "<br>";
print_r($match[0]);
?>
[/php]


In that preg_match_all(( '/<p (.*)>(.* In that how i add the condition as not take image tags.Anybody plz give reply.
Aug 7 '08 #11

Sign in to post your reply or Sign up for a free account.

Similar topics

2
by: Han | last post by:
I'm wondering if someone can explain why the following works with preg_match_all, but not preg_match: $html = "product=3456789&amp;" preg_match_all ("|product=(\d{5,10})&amp;|i", $html, $out); $out...
3
by: Han | last post by:
I know this is possible (because preg can do almost anything!), but can't get a handle on the syntax. I have an HTML string: <font size="3"><a...
2
by: Han | last post by:
The following pattern (which is one subpattern in a string of several) looks for the following $xxx,xxx.xx (with the dollar sign) or xxx,xxx.xx (space in replace of missing dollar sign) ...
0
by: petrovitch | last post by:
While using the following loop to extract images from the google search engine I discovered that preg_match_all works much faster parsing small strings in a loop than extracting all of the urls at...
10
by: greatprovider | last post by:
i'm starting with a string such as "Na**3C**6H**5O**7*2H**20" im attempting to match all **\d+ ...once i can match all the double asterix \d i intend to wrap the \d in "<sub>" tags for display...
1
by: ngmr80 | last post by:
Hi, I'm experiencing a problem when trying to capture substrings with preg_match_all() from strings like "set('Hello','World')" using the following Regular Expression (PERL syntax): ...
6
by: PaulB | last post by:
Hello, as a newbie I'm requesting some help in understanding the regular expression below preg_match_all("|<tr(.*)</tr>|U",$table,$rows); Would anybody please just run through...
2
by: swethak | last post by:
hi, In the below mentioned statement i did understand if subject part matchs to content part then it display the output.But i want know about the subject part as "|<+>(.*)</+>|U" .And what purpose...
2
loriann
by: loriann | last post by:
hi, I have a problem with preg_match function returning empty arrays for my wonderful regexes. However, I can't see what I am doing wrong - maybe one of you could help? I'm loading the source...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.