473,806 Members | 2,330 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

URLs and preg_match (again)

OK, I found a thread that help out from a while back (Oct 9, 2002) to
give me this pattern:

`(((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?)`i

OK, all is well and good with this until the URL is used at the end of a
sentance. I am assuming that I will need a negative lookahead somehow,
but I just can't wrap my mind around this one...

Want to match only the URL...

$string="... http://www.example.com .";
$string="... http://www.example.com/.";
$string="... http://www.example.com/page1.html?";
$string="... http://www.example.com/info.php?id=4!" ;
etc...

The pattern above is pulling the last character from each string when I
don't want it. Unfortunately, the URL can be _anything_ valid, and I
don't have control of how it will be input. Can anyone help with this?

TIA

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
Jul 17 '05 #1
22 4340
Justin Koivisto <sp**@koivi.com > wrote in message news:<eo******* *********@news7 .onvoy.net>...
OK, I found a thread that help out from a while back (Oct 9, 2002) to
give me this pattern:

`(((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?)`i

OK, all is well and good with this until the URL is used at the end of a
sentance. I am assuming that I will need a negative lookahead somehow,
but I just can't wrap my mind around this one...

Want to match only the URL...

$string="... http://www.example.com .";
$string="... http://www.example.com/.";
$string="... http://www.example.com/page1.html?";
$string="... http://www.example.com/info.php?id=4!" ;
etc...

The pattern above is pulling the last character from each string when I
don't want it. Unfortunately, the URL can be _anything_ valid, and I
don't have control of how it will be input. Can anyone help with this?

It is somewhat shocking to see even experts like Justin are lost in
regular expressions.

I must admit the fact that I'm still poor in regular expression even
though I use two good tools: <http://www.weitz.de/regex-coach/> and
<http://laurent.riester er.free.fr/regexp/>

I think, for your requirement this will work fine: (stolen from
<http://groups.google.c om/groups?selm=4d1 9834f.030307041 8.506bb9a5%40po sting.google.co m>
;-) )

(((http|ftp|htt ps):\/\/[\w]+(.[\w]+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?))|(www.)|(ft p.)

--
http://www.sendmetoindia.com - Send Me to India!
Email: rrjanbiah-at-Y!com
Jul 17 '05 #2
Justin Koivisto wrote:
OK, I found a thread that help out from a while back (Oct 9, 2002)
If it was http://groups.google.com/groups?th=a48518c6e18574d9 , I
caution you against blindly following advice from that thread. It's
clear from Jeff Donnici's original article he wanted to *match* URIs,
not *parse* them. Why on earth this curiosity was put forward I don't
know (note, in particular, the delimiter appears unescaped in the
pattern proper -- a sure sign of insufficient testing):

| /((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?/

You evidently noticed and mended the delimiter:
`(((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?)`i

OK, all is well and good with this
Consider the string: "http:// Lorem ipsum ... est laborum".

Negative character classes match every character not in the class. In
this case, that includes whitespace characters, which aren't allowed
in URIs.
until the URL is used at the end of a sentance.
It's not just sentence terminators that rattle that regular
expression: intra sentence spacing, words and punctuation all wreak
havoc too. It might be OK for *parsing* URIs [1], I don't know, I
didn't examine it and I'm not conversant with FTP URI syntax; but its
URI *matching* rates very poorly.
I am assuming that I will need a negative lookahead somehow,
but I just can't wrap my mind around this one...

Want to match only the URL...
That depends on the URL, obviously; regular expressions, although
powerful in some senses, afford no mind-reading capabilities.
$string="... http://www.example.com .";
Any URI parser would recognise <http://www.example.com .> as a URI: an
HTTP URI with a complete, or absolute, domain name of
"www.example.co m." (including the final period; the root label).
$string="... http://www.example.com/.";
Again, an HTTP URI, this time with a path segment of ".". The final
period does not have any special meaning here; it's simply a path
segment.
$string="... http://www.example.com/page1.html?";
Another HTTP URI, but with a path segment of "page1.html " and an empty
query component.
$string="... http://www.example.com/info.php?id=4!" ;
Yet another HTTP URI, this time with a path segment of "info.php" and
a query component of "id=4!".
The pattern above is pulling the last character from each string when I
don't want it.
So you know what the URIs are beforehand, right?
Unfortunately, the URL can be _anything_ valid, and I don't have control
of how it will be input.
I'm not sure what you mean.
Can anyone help with this?


We'd need more information to offer any help. You might be interested
in Appendix E of RFC2396, which discusses recommended ways to delimit
URIs. I'd like to say, in passing, that it makes no mention of the
increasing use of parentheses ("(" and ")") to delimit URIs; the right
parenthesis is allowed in a path segment, so the URI

http://www.php.net/manual/en/)

results in a 404 (at least it did at the time I wrote this,
20040414T0703Z) .

As an example of how involved URI *matching* can be, here's a regular
expression (PCRE) to match HTTP URIs (please excuse any line wrap):

`(?:(?i)http)://(?:(?:(?:(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])\.)*(?:[a-
z][a-z\d-]*[a-z\d]|[a-
z])\.?)|(?:\d+\.\ d+\.\d+\.\d+))( ?::\d*)?(?:(?:/(?:(?:(?:[a-
z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*(?:;(?:[a-z\d_.!~*\'()\-
:@&=+$,]|%[\da-f]{2})*)*)(?:/(?:(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-
f]{2})*(?:;(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-
f]{2})*)*))*))(?: \?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)?)?`i

Splitting it into more manageable chunks:

$scheme = '(?:(?i)http)';
$domainlabel = '(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])';
$toplabel = '(?:[a-z][a-z\d-]*[a-z\d]|[a-z])';
$hostname = "(?:(?:$domainl abel\.)*$toplab el\.?)";
$ipv4address = '(?:\d+\.\d+\.\ d+\.\d+)';
$host = "(?:$hostname|$ ipv4address)";
$port = '(?::\d*)';
$pchar = '(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})';
$param = "$pchar*";
$segment = "(?:$pchar*(?:; $param)*)";
$path_segments = "(?:$segmen t(?:/$segment)*)";
$abspath = "(?:/$path_segments) ";
$query = '(?:\?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)';
$http_uri = "$scheme://$host$port?(?:$ abspath$query?) ?";
$pattern = "`$http_uri `i";

preg_match_all( $pattern,$subje ct,$matches)

You might care to omit the case sensitive internal option affecting
the scheme name. Scheme names should be lowercase, but, "[f]or
resiliency, programs interpreting URI should treat upper case letters
as equivalent to lower case" (RFC2396, sec. 3.1). Technically, a URI
with an uppercase scheme name isn't an absolute URI.

Refs.:

RFC2396, "Uniform Resource Identifiers (URI): Generic Syntax",
http://www.ietf.org/rfc/rfc2396.txt

RFC2616, "Hypertext Transfer Protocol -- HTTP/1.1", section 3.2,
http://www.ietf.org/rfc/rfc2616.txt

RFC1738, "Uniform Resource Locators (URL)", section 3.2,
http://www.ietf.org/rfc/rfc1738.txt
[1] You'll have read the example POSIX regular expression in RFC2396
for parsing URI references. From Appendix B:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

http://www.ietf.org/rfc/rfc2396.txt

--
Jock
Jul 17 '05 #3
John Dunlop <us*********@jo hn.dunlop.name> wrote in message news:<MP******* *************** **@News.Individ ual.NET>...
Justin Koivisto wrote: <snip> As an example of how involved URI *matching* can be, here's a regular
expression (PCRE) to match HTTP URIs (please excuse any line wrap):

`(?:(?i)http)://(?:(?:(?:(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])\.)*(?:[a-
z][a-z\d-]*[a-z\d]|[a-
z])\.?)|(?:\d+\.\ d+\.\d+\.\d+))( ?::\d*)?(?:(?:/(?:(?:(?:[a-
z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*(?:;(?:[a-z\d_.!~*\'()\-
:@&=+$,]|%[\da-f]{2})*)*)(?:/(?:(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-
f]{2})*(?:;(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-
f]{2})*)*))*))(?: \?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)?)?`i

<snip>

WOW!! I wonder, how could you swap the whole RFCs in your brain...

--
http://www.sendmetoindia.com - Send Me to India!
Email: rrjanbiah-at-Y!com
Jul 17 '05 #4
R. Rajesh Jeba Anbiah wrote:
Justin Koivisto <sp**@koivi.com > wrote in message news:<eo******* *********@news7 .onvoy.net>...
OK, I found a thread that help out from a while back (Oct 9, 2002) to
give me this pattern:

`(((f|ht)tp ://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?)`i

OK, all is well and good with this until the URL is used at the end of a
sentance. I am assuming that I will need a negative lookahead somehow,
but I just can't wrap my mind around this one...

Want to match only the URL...

$string=".. . http://www.example.com .";
$string=".. . http://www.example.com/.";
$string=".. . http://www.example.com/page1.html?";
$string=".. . http://www.example.com/info.php?id=4!" ;
etc...

The pattern above is pulling the last character from each string when I
don't want it. Unfortunately, the URL can be _anything_ valid, and I
don't have control of how it will be input. Can anyone help with this?
It is somewhat shocking to see even experts like Justin are lost in
regular expressions.


I'm considered an expert? THANKS FOR THE COMPLIMENT! ;) I'm OK with
regex, but I have only started really using perl regex in the last year,
so I have a way to go to learn about it.
I must admit the fact that I'm still poor in regular expression even
though I use two good tools: <http://www.weitz.de/regex-coach/> and
<http://laurent.riester er.free.fr/regexp/>
Heh, I think I will have to check those out... thanks for the links.
I think, for your requirement this will work fine: (stolen from
<http://groups.google.c om/groups?selm=4d1 9834f.030307041 8.506bb9a5%40po sting.google.co m>
;-) )

(((http|ftp|htt ps):\/\/[\w]+(.[\w]+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?))|(www.)|(ft p.)


hmm... Never even thought of looking in a VB newsgroup...

I've pasted it in, and it worked for the 2 available tests I had in
place. I'll let you know if there are any problems with it, and post
whatever I can come up with for fixes.

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
Jul 17 '05 #5
John Dunlop wrote:

We'd need more information to offer any help. You might be interested
in Appendix E of RFC2396
Basically, what I am doing is making hyperlinks out of urls typed into a
text area. So I am trying to match urls, but need to parse for
punctuation after them.

....
As an example of how involved URI *matching* can be, here's a regular
expression (PCRE) to match HTTP URIs (please excuse any line wrap):

$scheme = '(?:(?i)http)';
$domainlabel = '(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])';
$toplabel = '(?:[a-z][a-z\d-]*[a-z\d]|[a-z])';
$hostname = "(?:(?:$domainl abel\.)*$toplab el\.?)";
$ipv4address = '(?:\d+\.\d+\.\ d+\.\d+)';
$host = "(?:$hostname|$ ipv4address)";
$port = '(?::\d*)';
$pchar = '(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})';
$param = "$pchar*";
$segment = "(?:$pchar*(?:; $param)*)";
$path_segments = "(?:$segmen t(?:/$segment)*)";
$abspath = "(?:/$path_segments) ";
$query = '(?:\?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)';
$http_uri = "$scheme://$host$port?(?:$ abspath$query?) ?";
$pattern = "`$http_uri `i";

preg_match_all( $pattern,$subje ct,$matches)


Honestly, I've never actually read an RFC dealing with internet
protocols. (The only one I read was on the RTF file format, and gave up
on that about 1/2 way through.)

Interesting though that the above pattern makes some odd results:
http://waf.rangenet.com/Edit1.php

I think I will have to play some more with the patterns that were posted
in this thread and see what I can make of them.

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
Jul 17 '05 #6
R. Rajesh Jeba Anbiah wrote:
(((http|ftp|htt ps):\/\/[\w]+(.[\w]+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?))|(www.)|(ft p.)


This was nearly what I was looking for. I ended up editing a bit to come
up with:
(((https?|ftp)://[\w]+(\.[\w]+)([\w.,@?^=%&:/~\+#-]*[^?.,!:;\s])?))

Which seems to be working right now. I'm a little suspicious of the
"[^?.,!:;\s]" part - I assumed that I'd need some kind of look-ahead to
do it. Anyway, here's the results:

http://waf.rangenet.com/Edit2.php

The last 7 urls had a space appeneded to the end (see source), so the
results are what I was expecting to get.

Thanks to those who replied!

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
Jul 17 '05 #7
Justin Koivisto wrote:
John Dunlop wrote:
[ ... ]
$http_uri = "$scheme://$host$port?(?:$ abspath$query?) ?";


[ ... ]
Interesting though that the above pattern makes some odd results:
http://waf.rangenet.com/Edit1.php


The pattern on that page isn't identical to my offering. Compare

$http_uri = "$scheme://$host$port?(?:$ abspath$query?) ?";

and

| $http_uri*= $scheme.'://'.$host.$port.' ?(?:'.$abspath. $query.'.?)?';
----------------------------------------------------------------^

That final period significantly alters the match. If the URI contains
a path, it must now end with a query component (or, at least, a "?"
followed by an empty query component), for $query is no longer
optional. Therefore, when applied to <http://domain.example/foo>, the
pattern matches <http://domain.example> only.

Fix it by removing the offending period.

--
Jock
Jul 17 '05 #8
"R. Rajesh Jeba Anbiah" <ng**********@r ediffmail.com> wrote in message
news:ab******** *************** ***@posting.goo gle.com...
Justin Koivisto <sp**@koivi.com > wrote in message news:<eo******* *********@news7 .onvoy.net>... It is somewhat shocking to see even experts like Justin are lost in
regular expressions.

I must admit the fact that I'm still poor in regular expression even
though I use two good tools: <http://www.weitz.de/regex-coach/> and
<http://laurent.riester er.free.fr/regexp/>


The camel book sits on my desk ever though I'm programming in PHP.

What is needed here is the magical \b meta character:

$re = '/(((https?|ftp): \/\/[\w]+(\.[\w]+)([^\s]*)?))(\/|\b)/i';


Jul 17 '05 #9
Justin Koivisto <sp**@koivi.com > wrote in message news:<XS******* *********@news7 .onvoy.net>...
John Dunlop wrote: <snip>
Interesting though that the above pattern makes some odd results:
http://waf.rangenet.com/Edit1.php

I think I will have to play some more with the patterns that were posted
in this thread and see what I can make of them.


IMHO, it is better to stay with RFC standards as John says. As John
said, I think, the problem might be within splitting the pattern. I
have tried all your urls with "The Regex Coach"
<http://www.weitz.de/regex-coach/> with the following pattern:

(?:(?i)http)://(?:(?:(?:(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])\.)*(?:[a-z][a-z\d-]*[a-z\d]|[a-z])\.?)|(?:\d+\.\ d+\.\d+\.\d+))( ?::\d*)?(?:(?:/(?:(?:(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*(?:;(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*)*)(?:/(?:(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*(?:;(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*)*))*))(?: \?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)?)?

It is working fine for all your urls *except* for domains with "_"
(underscores) eg. http://www.ex_ample.com/ . If I'm right that is also
correct. Hope, John will comment on it.

--
http://www.sendmetoindia.com - Send Me to India!
Email: rrjanbiah-at-Y!com
Jul 17 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
4248
by: fartsniff | last post by:
hello all, here is a preg_match routine that i am using. basically, $image is set in some code above, and it can be either st-1.gif or sb-1.gif (actually it randomly picks them from about 100 gifs). then it processes them based off of which image type it selected, either the st- 's or the sb- 's.
10
3164
by: tom | last post by:
hi group, i desperately need a function that will transform relative URLs to absolute URLs in the SRC part of <img> tags. ie: function makeAbsolute($html,$basehref) { //if regex match = relative URL ==> return img tag with absolute URL
9
2204
by: Laeronai | last post by:
I'm making a blog cms and have been having trouble with making the URLs look good. Each post goes to a URL like "viewpost.php?id=40" but I want the URL to look like "YYYY/MM/TITLE" so it would come out to be "/2006/03/hey-look-its-march." Does anyone know how to do this in PHP? And also, what is this process called? I'd like for the cms to do it automatically for each post entered.
5
7563
by: Mark Woodward | last post by:
Hi all, I'm trying to validate text in a HTML input field. How do I *allow* a single quote? // catch any nasty characters (eg !@#$%^&*()/\) $match = '/^+$/'; $valid_srch = preg_match($match, $res_description); if (!$valid_srch) { ...
0
2503
by: xmanofsteel69 | last post by:
I'm trying to create a search function for my site and I can't ever seem to figure it out. If anybody could help, that would be awesome, because everything I try, I keep getting errors... Here's my code so far. <div style='display:none;'> <html> <head> <title>The Movie List</title> </head> <body bgcolor = black>
3
12465
by: Phil Latio | last post by:
Below is a function I've written for validating URLs. function isURL ($string, $fieldname) { if(preg_match("/^www.]+$/", $string)) { return TRUE; } else {
9
5102
by: Salve =?iso-8859-1?Q?H=E5kedal?= | last post by:
What is the best regular expression for finding urls in plain text files? (By urls I mean http://www.something.com, but also www.something.com, or salve@somewhere.com) Salve
2
4257
by: JanDoggen | last post by:
function vldLicense($lic) { echo "called with lic: ". $lic . "<br>"; echo preg_match('', $lic) . "<br>"; if (preg_match('{4}-{4}-{4}-{4}', $lic) == 0) return false; return true; } gives me:
5
3133
by: jeddiki | last post by:
Hi, I am using the following regex to check for valid email addresses, but I am getting errors. if (preg_match('\^+@+\.{2,4}$\',$regex)){ echo "VALID"; } else { echo "INVALID";
0
9719
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10618
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10366
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10110
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9187
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7649
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5546
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4329
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3850
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.