geturl.php
Too much code to paste here, but have a look at http://www.liarsscourge.com/
So far, I have not found a string that can break this...
Any built-in functions or suggestions for improvement?
Thanks in advance. 14 2494
On Feb 10, 8:41 pm, "deko" <d...@nospam.co mwrote:
geturl.php
Too much code to paste here, but have a look athttp://www.liarsscourg e.com/
So far, I have not found a string that can break this...
Any built-in functions or suggestions for improvement?
Thanks in advance.
I don't want to sound negative here, but what exactly is the point?
The reason I ask, is because I see no reason why you can't extract it
with a single regex expression, and then you could use another one to
validate it. It's simple enough to validate, the question is how valid
do you want it to be. Should you need to specify each TLD, or do you
just need to match a pattern. In either case, 2 or 3 regex expressions
max should be able to do what you are after. With a little extra
crafting, you should be able to extract multiple URLs in one go.
deko wrote:
geturl.php
Too much code to paste here, but have a look at http://www.liarsscourge.com/
So far, I have not found a string that can break this...
Any built-in functions or suggestions for improvement?
1. Increase the error_reporting level and you will find some sloppy notices
2. Have a look at parse_url(), which might be useful
3. Use preg_* functions instead of POSIX ereg* function (performance)
4. Strings like the following cause infinite loops:
getURL('fofo http://discovery.co.uk/../foo');
Probable fix:
= Replace:
if (!eregi("^(com| net|org...)$", $urlString_a[$i])) {
...
}
= With:
if (preg_match("!^ (com|net|org... )[^$]!", $urlString_a[$i], $m)) {
$urlString_a[$i] = $m[1];
}
JW
1. Increase the error_reporting level and you will find some sloppy notices
2. Have a look at parse_url(), which might be useful
3. Use preg_* functions instead of POSIX ereg* function (performance)
4. Strings like the following cause infinite loops:
getURL('fofo http://discovery.co.uk/../foo');
Probable fix:
= Replace:
if (!eregi("^(com| net|org...)$", $urlString_a[$i])) {
...
}
= With:
if (preg_match("!^ (com|net|org... )[^$]!", $urlString_a[$i], $m)) {
$urlString_a[$i] = $m[1];
}
Outstanding. Thanks for the constructive feedback.
geturl 2.0...
1. Increase the error_reporting level and you will find some sloppy notices
so I have a few undefined variables... I thought this did not matter with PHP
(script vs. compiled code)
2. Have a look at parse_url(), which might be useful
well, if urlString = http://netforex.subdom ain.net.foo'>, parse_url() returns:
netforex.subdom ain.net.foo'- which is not much help. Nevertheless, I'm using
it on line 79 for validation - but I think it would be better validation if I
could do this:
if ($urlLink = parse_url($urlL ink))
{
return $urlLink['host'];
}
??
3. Use preg_* functions instead of POSIX ereg* function (performance)
I was able to replace an eregi with preg_match on lines 18 and 29, but unsure of
the syntax for line 44 ... suggestions?
4. Strings like the following cause infinite loops:
getURL('fofo http://discovery.co.uk/../foo');
fixed! see line 38.
- - - - -
I'm wondering how much of a hack that for loop is (lines 26 - 53). Seems to
work okay... http://www.liarsscourge.com <<== the script is here
"deko" <de**@nospam.co mwrote in message
news:5K******** *************** *******@comcast .com...
geturl 2.0...
>1. Increase the error_reporting level and you will find some sloppy notices
so I have a few undefined variables... I thought this did not matter with PHP
(script vs. compiled code)
>2. Have a look at parse_url(), which might be useful
well, if urlString = http://netforex.subdom ain.net.foo'>, parse_url()
returns: netforex.subdom ain.net.foo'- which is not much help. Nevertheless,
I'm using it on line 79 for validation - but I think it would be better
validation if I could do this:
if ($urlLink = parse_url($urlL ink))
{
return $urlLink['host'];
}
??
>3. Use preg_* functions instead of POSIX ereg* function (performance)
I was able to replace an eregi with preg_match on lines 18 and 29, but unsure
of the syntax for line 44 ... suggestions?
>4. Strings like the following cause infinite loops:
getURL('fofo http://discovery.co.uk/../foo');
fixed! see line 38.
- - - - -
I'm wondering how much of a hack that for loop is (lines 26 - 53). Seems to
work okay...
deko wrote:
so I have a few undefined variables... I thought this did not matter
with PHP (script vs. compiled code)
It doesn't, but it's just good practice to do so (and increasing the
error_reporting level saves you a lot of time debugging when making typos in
variable names).
well, if urlString = http://netforex.subdom ain.net.foo'>,
parse_url() returns: netforex.subdom ain.net.foo'- which is not much
The parse_url function can indeed easily be fooled, so you will have to add
some additional validation. But it's quite useful for what you are doing
(and pretty fast).
I was able to replace an eregi with preg_match on lines 18 and 29,
but unsure of the syntax for line 44 ... suggestions?
You mean the following line?
if (eregi("^(com|n et|org...)$", $urlString_a[$i]))
You can simply replace this with:
if (preg_match("!^ (com|net|org... )$!i", $urlString_a[$i]))
JW
>so I have a few undefined variables... I thought this did not matter
>with PHP (script vs. compiled code)
It doesn't, but it's just good practice to do so (and increasing the
error_reporting level saves you a lot of time debugging when making typos in
variable names).
Understood. Adjusting error_reporting is helpful for debugging and
optimization.
>well, if urlString = http://netforex.subdom ain.net.foo'>, parse_url() returns: netforex.subdom ain.net.foo'- which is not much
The parse_url function can indeed easily be fooled, so you will have to add
some additional validation. But it's quite useful for what you are doing (and
pretty fast).
I'm not sure where parse_url is useful, precisely because it is so easily
fooled. The
strings I'm working with are entirely unpredictable. That's why I explode the
candidate URL into domains and inspect each one in a for loop.
>I was able to replace an eregi with preg_match on lines 18 and 29, but unsure of the syntax for line 44 ... suggestions?
You mean the following line?
>if (eregi("^(com|n et|org...)$", $urlString_a[$i]))
You can simply replace this with:
>if (preg_match("!^ (com|net|org... )$!i", $urlString_a[$i]))
Thanks, but I'm still baffled by that syntax - why the second negation '!' after
the first '$' ? I understand that '$' means end of line, and 'i' means case
insensitive... is '$!' saying "not ending with"?
I still want to replace the eregi code (lines 44, 51, 55, 85) with preg_match,
but the preg_match syntax seems counter intuitive... just have not figured it
out yet. What's up with delimiters? http://www.liarsscourge.com <<== latest code here
Thanks again for the help! http://www.liarsscourge.com/ <<== this is better
known bug: if an email address appears in the test string before a valid URL,
the script will not find the URL
deko wrote:
http://www.liarsscourge.com/ <<== this is better
known bug: if an email address appears in the test string before a valid
URL, the script will not find the URL
Hard coding TLDs is generally not useful, as you never know when
unexpected ones may be put in use. Plus, you do not allow variance for
different schemes other than http(s).
You do not support valid URLs that have the authority:
< http://bob1234z_.sss:li*****@example.com/foo/bar
baz/index.bak.php?q =my&q2=query#fr ag2>
This is indeed a valid URL, but your algorithm fails. It's far more
useful to use a single regex, anticipate any scheme (look at wikipedia
or search some RFCs for valid URI format), and any TLD.
parse_url is not meant to be used for validation, as stated in the PHP
docs themselves.
This is an example implementing the regex I made, recently:
<?php
$re = '%
( [\w.+-]+ : (?://)? ) # scheme name
( [^/]+ ) # authority, domain
( / [^?]+ )? # path, if exists
# query and fragment, which may or may not exist
(?:
\\? # query initializer
( [^#]+ ) # grab query
(?: \\# ([\w-]+) )? # fragment, if exists
)?
%x';
$s = 'Welcome.to{"ht tp://user:pa**@examp le.com/foo
bar".-/index.bak.php?q =query&r=arrr#f rag2_2borky borked!';
if (preg_match($re , $s, $m)) {
echo '<p>Original: <code>' . $s . '</code></p>';
echo '<p>Extrapolati on: <a href="'
. htmlentities($m[0], ENT_QUOTES) . '">' . ($m[1].$m[2])
. '</a(full URI in link, see status bar).</p>';
}
else {
echo 'Not a valid URI.';
}
?>
Curtis This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Sharon |
last post by:
hi,
I want to extract a string from a file,
if the file is like this:
1 This is the string 2 3 4
how could I extract the string, starting from the 10th position (i.e. "T")
and extract 35 characters (including "T") from a file and then go to next
line?
|
by: Mohammad-Reza |
last post by:
Hi
I want to extract icon of an exe file and want to know how.
I look at the MSDN and find out that I can use ExtractIconEx() Windows API
but in there are some changes to that api in c# I made those changes like
this :
public static extern uint ExtractIconEx(
string szFile,
|
by: teo |
last post by:
hallo,
I need to extract a word and few text that
precedes and follows it (about 30 + 30 chars)
from a long textual document.
Like the description that Google returns when
it has found a given word.
In example from:
|
by: deko |
last post by:
If I have random and unpredictable user agent strings containing URLs, what is
the best way to extract the URL?
For example, let's say the string looks like this:
registered NYSE 943 <a href="http://netforex.net"Forex Trading Network
Organization </ainfo@netforex.org
What's the best way to extract http://netforex.net ?
|
by: caine |
last post by:
I want to extract web data from a news feed page
http://everling.nierchi.net/mmubulletins.php.
Just want to extract necessary info between open n closing tags of
<title>, <categoryand <link>. Whenever I initiated the extraction,
first news title is always "MMU Bulletin Board RSS Feed" with the
proper bulletin's link stored, but not the...
| |
by: nkg1234567 |
last post by:
I'm trying to extract HTML from a website in the form of a string, and then I want to extract particular elements from the string using the substr function:
here is some sample code that I have thus far:
use HTTP::Request::Common;
use LWP::UserAgent;
use LWP::Simple;
$ua = LWP::UserAgent->new;
|
by: erikcw |
last post by:
Hi all,
I'm trying to extract zip file (containing an xml file) from an email
so I can process it. But I'm running up against some brick walls.
I've been googling and reading all afternoon, and can't seem to figure
it out.
Here is what I have so far.
p = POP3("mail.server.com")
|
by: rcamarda |
last post by:
I'd need to have a function that allows me to extract 'fields' from
within the string
I.E. (kinda pseudo code)
declare @foo as varchar(100)
set @foo = "Robert*Camarda*123 Main Street"
select EXTRACT(@foo, '*', 2) ; -- would return 'Camarda'
select EXTRACT(@foo, '*', 3) ;-- returns '123 Main Street'
select EXTRACT(@foo, '*', 0) ;-- would...
|
by: GS |
last post by:
I need to extract sections out of a long string of about 5 to 10 KB, change
any date format of dd Mmm yyyy to yyyy-mm-dd, then further from each section
extract columns of tables.
what is the best approach in using regex for this? I can see match and
replace the dates, extract section with regex, and then for each section
extract again...
|
by: Steve |
last post by:
Hi all
Does anybody please know a way to extract an Image from a pdf file and save
it as a TIFF?
I have used a scanner to scan documents which are then placed on a server,
but I need to extract the image of the document (just the first page if
there are multiple pages) and save it as a TIFF so I can then use the
Tesseract OCR to get the...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
| |
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in...
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
|
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...
| |