OK, I found a thread that help out from a while back (Oct 9, 2002) to
give me this pattern:
`(((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?)`i
OK, all is well and good with this until the URL is used at the end of a
sentance. I am assuming that I will need a negative lookahead somehow,
but I just can't wrap my mind around this one...
Want to match only the URL...
$string="... http://www.example.com .";
$string="... http://www.example.com/.";
$string="... http://www.example.com/page1.html?";
$string="... http://www.example.com/info.php?id=4!" ;
etc...
The pattern above is pulling the last character from each string when I
don't want it. Unfortunately, the URL can be _anything_ valid, and I
don't have control of how it will be input. Can anyone help with this?
TIA
--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/ 22 4340
Justin Koivisto <sp**@koivi.com > wrote in message news:<eo******* *********@news7 .onvoy.net>... OK, I found a thread that help out from a while back (Oct 9, 2002) to give me this pattern:
`(((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?)`i
OK, all is well and good with this until the URL is used at the end of a sentance. I am assuming that I will need a negative lookahead somehow, but I just can't wrap my mind around this one...
Want to match only the URL...
$string="... http://www.example.com ."; $string="... http://www.example.com/."; $string="... http://www.example.com/page1.html?"; $string="... http://www.example.com/info.php?id=4!" ; etc...
The pattern above is pulling the last character from each string when I don't want it. Unfortunately, the URL can be _anything_ valid, and I don't have control of how it will be input. Can anyone help with this?
It is somewhat shocking to see even experts like Justin are lost in
regular expressions.
I must admit the fact that I'm still poor in regular expression even
though I use two good tools: <http://www.weitz.de/regex-coach/> and
<http://laurent.riester er.free.fr/regexp/>
I think, for your requirement this will work fine: (stolen from
<http://groups.google.c om/groups?selm=4d1 9834f.030307041 8.506bb9a5%40po sting.google.co m>
;-) )
(((http|ftp|htt ps):\/\/[\w]+(.[\w]+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?))|(www.)|(ft p.)
-- http://www.sendmetoindia.com - Send Me to India!
Email: rrjanbiah-at-Y!com
Justin Koivisto wrote: OK, I found a thread that help out from a while back (Oct 9, 2002)
If it was http://groups.google.com/groups?th=a48518c6e18574d9 , I
caution you against blindly following advice from that thread. It's
clear from Jeff Donnici's original article he wanted to *match* URIs,
not *parse* them. Why on earth this curiosity was put forward I don't
know (note, in particular, the delimiter appears unescaped in the
pattern proper -- a sure sign of insufficient testing):
| /((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?/
You evidently noticed and mended the delimiter:
`(((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?)`i
OK, all is well and good with this
Consider the string: "http:// Lorem ipsum ... est laborum".
Negative character classes match every character not in the class. In
this case, that includes whitespace characters, which aren't allowed
in URIs.
until the URL is used at the end of a sentance.
It's not just sentence terminators that rattle that regular
expression: intra sentence spacing, words and punctuation all wreak
havoc too. It might be OK for *parsing* URIs [1], I don't know, I
didn't examine it and I'm not conversant with FTP URI syntax; but its
URI *matching* rates very poorly.
I am assuming that I will need a negative lookahead somehow, but I just can't wrap my mind around this one...
Want to match only the URL...
That depends on the URL, obviously; regular expressions, although
powerful in some senses, afford no mind-reading capabilities.
$string="... http://www.example.com .";
Any URI parser would recognise <http://www.example.com .> as a URI: an
HTTP URI with a complete, or absolute, domain name of
"www.example.co m." (including the final period; the root label).
$string="... http://www.example.com/.";
Again, an HTTP URI, this time with a path segment of ".". The final
period does not have any special meaning here; it's simply a path
segment.
$string="... http://www.example.com/page1.html?";
Another HTTP URI, but with a path segment of "page1.html " and an empty
query component.
$string="... http://www.example.com/info.php?id=4!" ;
Yet another HTTP URI, this time with a path segment of "info.php" and
a query component of "id=4!".
The pattern above is pulling the last character from each string when I don't want it.
So you know what the URIs are beforehand, right?
Unfortunately, the URL can be _anything_ valid, and I don't have control of how it will be input.
I'm not sure what you mean.
Can anyone help with this?
We'd need more information to offer any help. You might be interested
in Appendix E of RFC2396, which discusses recommended ways to delimit
URIs. I'd like to say, in passing, that it makes no mention of the
increasing use of parentheses ("(" and ")") to delimit URIs; the right
parenthesis is allowed in a path segment, so the URI http://www.php.net/manual/en/)
results in a 404 (at least it did at the time I wrote this,
20040414T0703Z) .
As an example of how involved URI *matching* can be, here's a regular
expression (PCRE) to match HTTP URIs (please excuse any line wrap):
`(?:(?i)http)://(?:(?:(?:(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])\.)*(?:[a-
z][a-z\d-]*[a-z\d]|[a-
z])\.?)|(?:\d+\.\ d+\.\d+\.\d+))( ?::\d*)?(?:(?:/(?:(?:(?:[a-
z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*(?:;(?:[a-z\d_.!~*\'()\-
:@&=+$,]|%[\da-f]{2})*)*)(?:/(?:(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-
f]{2})*(?:;(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-
f]{2})*)*))*))(?: \?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)?)?`i
Splitting it into more manageable chunks:
$scheme = '(?:(?i)http)';
$domainlabel = '(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])';
$toplabel = '(?:[a-z][a-z\d-]*[a-z\d]|[a-z])';
$hostname = "(?:(?:$domainl abel\.)*$toplab el\.?)";
$ipv4address = '(?:\d+\.\d+\.\ d+\.\d+)';
$host = "(?:$hostname|$ ipv4address)";
$port = '(?::\d*)';
$pchar = '(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})';
$param = "$pchar*";
$segment = "(?:$pchar*(?:; $param)*)";
$path_segments = "(?:$segmen t(?:/$segment)*)";
$abspath = "(?:/$path_segments) ";
$query = '(?:\?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)';
$http_uri = "$scheme://$host$port?(?:$ abspath$query?) ?";
$pattern = "`$http_uri `i";
preg_match_all( $pattern,$subje ct,$matches)
You might care to omit the case sensitive internal option affecting
the scheme name. Scheme names should be lowercase, but, "[f]or
resiliency, programs interpreting URI should treat upper case letters
as equivalent to lower case" (RFC2396, sec. 3.1). Technically, a URI
with an uppercase scheme name isn't an absolute URI.
Refs.:
RFC2396, "Uniform Resource Identifiers (URI): Generic Syntax", http://www.ietf.org/rfc/rfc2396.txt
RFC2616, "Hypertext Transfer Protocol -- HTTP/1.1", section 3.2, http://www.ietf.org/rfc/rfc2616.txt
RFC1738, "Uniform Resource Locators (URL)", section 3.2, http://www.ietf.org/rfc/rfc1738.txt
[1] You'll have read the example POSIX regular expression in RFC2396
for parsing URI references. From Appendix B:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? http://www.ietf.org/rfc/rfc2396.txt
--
Jock
John Dunlop <us*********@jo hn.dunlop.name> wrote in message news:<MP******* *************** **@News.Individ ual.NET>... Justin Koivisto wrote:
<snip> As an example of how involved URI *matching* can be, here's a regular expression (PCRE) to match HTTP URIs (please excuse any line wrap):
`(?:(?i)http)://(?:(?:(?:(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])\.)*(?:[a- z][a-z\d-]*[a-z\d]|[a- z])\.?)|(?:\d+\.\ d+\.\d+\.\d+))( ?::\d*)?(?:(?:/(?:(?:(?:[a- z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*(?:;(?:[a-z\d_.!~*\'()\- :@&=+$,]|%[\da-f]{2})*)*)(?:/(?:(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da- f]{2})*(?:;(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da- f]{2})*)*))*))(?: \?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)?)?`i
<snip>
WOW!! I wonder, how could you swap the whole RFCs in your brain...
-- http://www.sendmetoindia.com - Send Me to India!
Email: rrjanbiah-at-Y!com
R. Rajesh Jeba Anbiah wrote: Justin Koivisto <sp**@koivi.com > wrote in message news:<eo******* *********@news7 .onvoy.net>...
OK, I found a thread that help out from a while back (Oct 9, 2002) to give me this pattern:
`(((f|ht)tp ://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?)`i
OK, all is well and good with this until the URL is used at the end of a sentance. I am assuming that I will need a negative lookahead somehow, but I just can't wrap my mind around this one...
Want to match only the URL...
$string=".. . http://www.example.com ."; $string=".. . http://www.example.com/."; $string=".. . http://www.example.com/page1.html?"; $string=".. . http://www.example.com/info.php?id=4!" ; etc...
The pattern above is pulling the last character from each string when I don't want it. Unfortunately, the URL can be _anything_ valid, and I don't have control of how it will be input. Can anyone help with this? It is somewhat shocking to see even experts like Justin are lost in regular expressions.
I'm considered an expert? THANKS FOR THE COMPLIMENT! ;) I'm OK with
regex, but I have only started really using perl regex in the last year,
so I have a way to go to learn about it.
I must admit the fact that I'm still poor in regular expression even though I use two good tools: <http://www.weitz.de/regex-coach/> and <http://laurent.riester er.free.fr/regexp/>
Heh, I think I will have to check those out... thanks for the links.
I think, for your requirement this will work fine: (stolen from <http://groups.google.c om/groups?selm=4d1 9834f.030307041 8.506bb9a5%40po sting.google.co m> ;-) )
(((http|ftp|htt ps):\/\/[\w]+(.[\w]+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?))|(www.)|(ft p.)
hmm... Never even thought of looking in a VB newsgroup...
I've pasted it in, and it worked for the 2 available tests I had in
place. I'll let you know if there are any problems with it, and post
whatever I can come up with for fixes.
--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
John Dunlop wrote: We'd need more information to offer any help. You might be interested in Appendix E of RFC2396
Basically, what I am doing is making hyperlinks out of urls typed into a
text area. So I am trying to match urls, but need to parse for
punctuation after them.
....
As an example of how involved URI *matching* can be, here's a regular expression (PCRE) to match HTTP URIs (please excuse any line wrap):
$scheme = '(?:(?i)http)'; $domainlabel = '(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])'; $toplabel = '(?:[a-z][a-z\d-]*[a-z\d]|[a-z])'; $hostname = "(?:(?:$domainl abel\.)*$toplab el\.?)"; $ipv4address = '(?:\d+\.\d+\.\ d+\.\d+)'; $host = "(?:$hostname|$ ipv4address)"; $port = '(?::\d*)'; $pchar = '(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})'; $param = "$pchar*"; $segment = "(?:$pchar*(?:; $param)*)"; $path_segments = "(?:$segmen t(?:/$segment)*)"; $abspath = "(?:/$path_segments) "; $query = '(?:\?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)'; $http_uri = "$scheme://$host$port?(?:$ abspath$query?) ?"; $pattern = "`$http_uri `i";
preg_match_all( $pattern,$subje ct,$matches)
Honestly, I've never actually read an RFC dealing with internet
protocols. (The only one I read was on the RTF file format, and gave up
on that about 1/2 way through.)
Interesting though that the above pattern makes some odd results: http://waf.rangenet.com/Edit1.php
I think I will have to play some more with the patterns that were posted
in this thread and see what I can make of them.
--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
R. Rajesh Jeba Anbiah wrote: (((http|ftp|htt ps):\/\/[\w]+(.[\w]+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?))|(www.)|(ft p.)
This was nearly what I was looking for. I ended up editing a bit to come
up with:
(((https?|ftp)://[\w]+(\.[\w]+)([\w.,@?^=%&:/~\+#-]*[^?.,!:;\s])?))
Which seems to be working right now. I'm a little suspicious of the
"[^?.,!:;\s]" part - I assumed that I'd need some kind of look-ahead to
do it. Anyway, here's the results: http://waf.rangenet.com/Edit2.php
The last 7 urls had a space appeneded to the end (see source), so the
results are what I was expecting to get.
Thanks to those who replied!
--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
Justin Koivisto wrote: John Dunlop wrote:
[ ... ] $http_uri = "$scheme://$host$port?(?:$ abspath$query?) ?";
[ ... ]
Interesting though that the above pattern makes some odd results: http://waf.rangenet.com/Edit1.php
The pattern on that page isn't identical to my offering. Compare
$http_uri = "$scheme://$host$port?(?:$ abspath$query?) ?";
and
| $http_uri*= $scheme.'://'.$host.$port.' ?(?:'.$abspath. $query.'.?)?';
----------------------------------------------------------------^
That final period significantly alters the match. If the URI contains
a path, it must now end with a query component (or, at least, a "?"
followed by an empty query component), for $query is no longer
optional. Therefore, when applied to <http://domain.example/foo>, the
pattern matches <http://domain.example> only.
Fix it by removing the offending period.
--
Jock
"R. Rajesh Jeba Anbiah" <ng**********@r ediffmail.com> wrote in message
news:ab******** *************** ***@posting.goo gle.com... Justin Koivisto <sp**@koivi.com > wrote in message
news:<eo******* *********@news7 .onvoy.net>... It is somewhat shocking to see even experts like Justin are lost in regular expressions.
I must admit the fact that I'm still poor in regular expression even though I use two good tools: <http://www.weitz.de/regex-coach/> and <http://laurent.riester er.free.fr/regexp/>
The camel book sits on my desk ever though I'm programming in PHP.
What is needed here is the magical \b meta character:
$re = '/(((https?|ftp): \/\/[\w]+(\.[\w]+)([^\s]*)?))(\/|\b)/i';
Justin Koivisto <sp**@koivi.com > wrote in message news:<XS******* *********@news7 .onvoy.net>... John Dunlop wrote:
<snip> Interesting though that the above pattern makes some odd results: http://waf.rangenet.com/Edit1.php
I think I will have to play some more with the patterns that were posted in this thread and see what I can make of them.
IMHO, it is better to stay with RFC standards as John says. As John
said, I think, the problem might be within splitting the pattern. I
have tried all your urls with "The Regex Coach"
<http://www.weitz.de/regex-coach/> with the following pattern:
(?:(?i)http)://(?:(?:(?:(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])\.)*(?:[a-z][a-z\d-]*[a-z\d]|[a-z])\.?)|(?:\d+\.\ d+\.\d+\.\d+))( ?::\d*)?(?:(?:/(?:(?:(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*(?:;(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*)*)(?:/(?:(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*(?:;(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*)*))*))(?: \?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)?)?
It is working fine for all your urls *except* for domains with "_"
(underscores) eg. http://www.ex_ample.com/ . If I'm right that is also
correct. Hope, John will comment on it.
-- http://www.sendmetoindia.com - Send Me to India!
Email: rrjanbiah-at-Y!com This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: fartsniff |
last post by:
hello all,
here is a preg_match routine that i am using. basically, $image is set in
some
code above, and it can be either st-1.gif or sb-1.gif (actually it randomly
picks
them from about 100 gifs).
then it processes them based off of which image type it selected, either the
st- 's or
the sb- 's.
|
by: tom |
last post by:
hi group,
i desperately need a function that will transform relative URLs to
absolute URLs in the SRC part of <img> tags.
ie:
function makeAbsolute($html,$basehref)
{
//if regex match = relative URL ==> return img tag with absolute URL
|
by: Laeronai |
last post by:
I'm making a blog cms and have been having trouble with making the URLs
look good. Each post goes to a URL like "viewpost.php?id=40" but I want
the URL to look like "YYYY/MM/TITLE" so it would come out to be
"/2006/03/hey-look-its-march." Does anyone know how to do this in PHP?
And also, what is this process called? I'd like for the cms to do it
automatically for each post entered.
|
by: Mark Woodward |
last post by:
Hi all,
I'm trying to validate text in a HTML input field.
How do I *allow* a single quote?
// catch any nasty characters (eg !@#$%^&*()/\)
$match = '/^+$/';
$valid_srch = preg_match($match, $res_description);
if (!$valid_srch) {
...
|
by: xmanofsteel69 |
last post by:
I'm trying to create a search function for my site and I can't ever seem to figure it out. If anybody could help, that would be awesome, because everything I try, I keep getting errors...
Here's my code so far.
<div style='display:none;'>
<html>
<head>
<title>The Movie List</title>
</head>
<body bgcolor = black>
| |
by: Phil Latio |
last post by:
Below is a function I've written for validating URLs.
function isURL ($string, $fieldname)
{
if(preg_match("/^www.]+$/", $string))
{
return TRUE;
}
else
{
|
by: Salve =?iso-8859-1?Q?H=E5kedal?= |
last post by:
What is the best regular expression for finding urls in plain text
files?
(By urls I mean http://www.something.com, but also www.something.com,
or salve@somewhere.com)
Salve
|
by: JanDoggen |
last post by:
function vldLicense($lic)
{
echo "called with lic: ". $lic . "<br>";
echo preg_match('', $lic) . "<br>";
if (preg_match('{4}-{4}-{4}-{4}', $lic) == 0) return false;
return true;
}
gives me:
|
by: jeddiki |
last post by:
Hi,
I am using the following regex to check for valid email
addresses, but I am getting errors.
if (preg_match('\^+@+\.{2,4}$\',$regex)){
echo "VALID";
}
else {
echo "INVALID";
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |