By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,737 Members | 1,934 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,737 IT Pros & Developers. It's quick & easy.

URLs and preg_match (again)

P: n/a
OK, I found a thread that help out from a while back (Oct 9, 2002) to
give me this pattern:

`(((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?)`i

OK, all is well and good with this until the URL is used at the end of a
sentance. I am assuming that I will need a negative lookahead somehow,
but I just can't wrap my mind around this one...

Want to match only the URL...

$string="... http://www.example.com.";
$string="... http://www.example.com/.";
$string="... http://www.example.com/page1.html?";
$string="... http://www.example.com/info.php?id=4!";
etc...

The pattern above is pulling the last character from each string when I
don't want it. Unfortunately, the URL can be _anything_ valid, and I
don't have control of how it will be input. Can anyone help with this?

TIA

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
Jul 17 '05 #1
Share this Question
Share on Google+
22 Replies


P: n/a
Justin Koivisto <sp**@koivi.com> wrote in message news:<eo****************@news7.onvoy.net>...
OK, I found a thread that help out from a while back (Oct 9, 2002) to
give me this pattern:

`(((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?)`i

OK, all is well and good with this until the URL is used at the end of a
sentance. I am assuming that I will need a negative lookahead somehow,
but I just can't wrap my mind around this one...

Want to match only the URL...

$string="... http://www.example.com.";
$string="... http://www.example.com/.";
$string="... http://www.example.com/page1.html?";
$string="... http://www.example.com/info.php?id=4!";
etc...

The pattern above is pulling the last character from each string when I
don't want it. Unfortunately, the URL can be _anything_ valid, and I
don't have control of how it will be input. Can anyone help with this?

It is somewhat shocking to see even experts like Justin are lost in
regular expressions.

I must admit the fact that I'm still poor in regular expression even
though I use two good tools: <http://www.weitz.de/regex-coach/> and
<http://laurent.riesterer.free.fr/regexp/>

I think, for your requirement this will work fine: (stolen from
<http://groups.google.com/groups?selm=4d19834f.0303070418.506bb9a5%40posting .google.com>
;-) )

(((http|ftp|https):\/\/[\w]+(.[\w]+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?))|(www.)|(ftp.)

--
http://www.sendmetoindia.com - Send Me to India!
Email: rrjanbiah-at-Y!com
Jul 17 '05 #2

P: n/a
Justin Koivisto wrote:
OK, I found a thread that help out from a while back (Oct 9, 2002)
If it was http://groups.google.com/groups?th=a48518c6e18574d9 , I
caution you against blindly following advice from that thread. It's
clear from Jeff Donnici's original article he wanted to *match* URIs,
not *parse* them. Why on earth this curiosity was put forward I don't
know (note, in particular, the delimiter appears unescaped in the
pattern proper -- a sure sign of insufficient testing):

| /((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?/

You evidently noticed and mended the delimiter:
`(((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?)`i

OK, all is well and good with this
Consider the string: "http:// Lorem ipsum ... est laborum".

Negative character classes match every character not in the class. In
this case, that includes whitespace characters, which aren't allowed
in URIs.
until the URL is used at the end of a sentance.
It's not just sentence terminators that rattle that regular
expression: intra sentence spacing, words and punctuation all wreak
havoc too. It might be OK for *parsing* URIs [1], I don't know, I
didn't examine it and I'm not conversant with FTP URI syntax; but its
URI *matching* rates very poorly.
I am assuming that I will need a negative lookahead somehow,
but I just can't wrap my mind around this one...

Want to match only the URL...
That depends on the URL, obviously; regular expressions, although
powerful in some senses, afford no mind-reading capabilities.
$string="... http://www.example.com.";
Any URI parser would recognise <http://www.example.com.> as a URI: an
HTTP URI with a complete, or absolute, domain name of
"www.example.com." (including the final period; the root label).
$string="... http://www.example.com/.";
Again, an HTTP URI, this time with a path segment of ".". The final
period does not have any special meaning here; it's simply a path
segment.
$string="... http://www.example.com/page1.html?";
Another HTTP URI, but with a path segment of "page1.html" and an empty
query component.
$string="... http://www.example.com/info.php?id=4!";
Yet another HTTP URI, this time with a path segment of "info.php" and
a query component of "id=4!".
The pattern above is pulling the last character from each string when I
don't want it.
So you know what the URIs are beforehand, right?
Unfortunately, the URL can be _anything_ valid, and I don't have control
of how it will be input.
I'm not sure what you mean.
Can anyone help with this?


We'd need more information to offer any help. You might be interested
in Appendix E of RFC2396, which discusses recommended ways to delimit
URIs. I'd like to say, in passing, that it makes no mention of the
increasing use of parentheses ("(" and ")") to delimit URIs; the right
parenthesis is allowed in a path segment, so the URI

http://www.php.net/manual/en/)

results in a 404 (at least it did at the time I wrote this,
20040414T0703Z).

As an example of how involved URI *matching* can be, here's a regular
expression (PCRE) to match HTTP URIs (please excuse any line wrap):

`(?:(?i)http)://(?:(?:(?:(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])\.)*(?:[a-
z][a-z\d-]*[a-z\d]|[a-
z])\.?)|(?:\d+\.\d+\.\d+\.\d+))(?::\d*)?(?:(?:/(?:(?:(?:[a-
z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*(?:;(?:[a-z\d_.!~*\'()\-
:@&=+$,]|%[\da-f]{2})*)*)(?:/(?:(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-
f]{2})*(?:;(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-
f]{2})*)*))*))(?:\?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)?)?`i

Splitting it into more manageable chunks:

$scheme = '(?:(?i)http)';
$domainlabel = '(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])';
$toplabel = '(?:[a-z][a-z\d-]*[a-z\d]|[a-z])';
$hostname = "(?:(?:$domainlabel\.)*$toplabel\.?)";
$ipv4address = '(?:\d+\.\d+\.\d+\.\d+)';
$host = "(?:$hostname|$ipv4address)";
$port = '(?::\d*)';
$pchar = '(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})';
$param = "$pchar*";
$segment = "(?:$pchar*(?:;$param)*)";
$path_segments = "(?:$segment(?:/$segment)*)";
$abspath = "(?:/$path_segments)";
$query = '(?:\?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)';
$http_uri = "$scheme://$host$port?(?:$abspath$query?)?";
$pattern = "`$http_uri`i";

preg_match_all($pattern,$subject,$matches)

You might care to omit the case sensitive internal option affecting
the scheme name. Scheme names should be lowercase, but, "[f]or
resiliency, programs interpreting URI should treat upper case letters
as equivalent to lower case" (RFC2396, sec. 3.1). Technically, a URI
with an uppercase scheme name isn't an absolute URI.

Refs.:

RFC2396, "Uniform Resource Identifiers (URI): Generic Syntax",
http://www.ietf.org/rfc/rfc2396.txt

RFC2616, "Hypertext Transfer Protocol -- HTTP/1.1", section 3.2,
http://www.ietf.org/rfc/rfc2616.txt

RFC1738, "Uniform Resource Locators (URL)", section 3.2,
http://www.ietf.org/rfc/rfc1738.txt
[1] You'll have read the example POSIX regular expression in RFC2396
for parsing URI references. From Appendix B:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

http://www.ietf.org/rfc/rfc2396.txt

--
Jock
Jul 17 '05 #3

P: n/a
John Dunlop <us*********@john.dunlop.name> wrote in message news:<MP************************@News.Individual.N ET>...
Justin Koivisto wrote: <snip> As an example of how involved URI *matching* can be, here's a regular
expression (PCRE) to match HTTP URIs (please excuse any line wrap):

`(?:(?i)http)://(?:(?:(?:(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])\.)*(?:[a-
z][a-z\d-]*[a-z\d]|[a-
z])\.?)|(?:\d+\.\d+\.\d+\.\d+))(?::\d*)?(?:(?:/(?:(?:(?:[a-
z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*(?:;(?:[a-z\d_.!~*\'()\-
:@&=+$,]|%[\da-f]{2})*)*)(?:/(?:(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-
f]{2})*(?:;(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-
f]{2})*)*))*))(?:\?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)?)?`i

<snip>

WOW!! I wonder, how could you swap the whole RFCs in your brain...

--
http://www.sendmetoindia.com - Send Me to India!
Email: rrjanbiah-at-Y!com
Jul 17 '05 #4

P: n/a
R. Rajesh Jeba Anbiah wrote:
Justin Koivisto <sp**@koivi.com> wrote in message news:<eo****************@news7.onvoy.net>...
OK, I found a thread that help out from a while back (Oct 9, 2002) to
give me this pattern:

`(((f|ht)tp://)(([^@:]+)([^@]+)?@)?([^:\/])+(:\d+)?(\/[^\s]+)?)`i

OK, all is well and good with this until the URL is used at the end of a
sentance. I am assuming that I will need a negative lookahead somehow,
but I just can't wrap my mind around this one...

Want to match only the URL...

$string="... http://www.example.com.";
$string="... http://www.example.com/.";
$string="... http://www.example.com/page1.html?";
$string="... http://www.example.com/info.php?id=4!";
etc...

The pattern above is pulling the last character from each string when I
don't want it. Unfortunately, the URL can be _anything_ valid, and I
don't have control of how it will be input. Can anyone help with this?
It is somewhat shocking to see even experts like Justin are lost in
regular expressions.


I'm considered an expert? THANKS FOR THE COMPLIMENT! ;) I'm OK with
regex, but I have only started really using perl regex in the last year,
so I have a way to go to learn about it.
I must admit the fact that I'm still poor in regular expression even
though I use two good tools: <http://www.weitz.de/regex-coach/> and
<http://laurent.riesterer.free.fr/regexp/>
Heh, I think I will have to check those out... thanks for the links.
I think, for your requirement this will work fine: (stolen from
<http://groups.google.com/groups?selm=4d19834f.0303070418.506bb9a5%40posting .google.com>
;-) )

(((http|ftp|https):\/\/[\w]+(.[\w]+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?))|(www.)|(ftp.)


hmm... Never even thought of looking in a VB newsgroup...

I've pasted it in, and it worked for the 2 available tests I had in
place. I'll let you know if there are any problems with it, and post
whatever I can come up with for fixes.

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
Jul 17 '05 #5

P: n/a
John Dunlop wrote:

We'd need more information to offer any help. You might be interested
in Appendix E of RFC2396
Basically, what I am doing is making hyperlinks out of urls typed into a
text area. So I am trying to match urls, but need to parse for
punctuation after them.

....
As an example of how involved URI *matching* can be, here's a regular
expression (PCRE) to match HTTP URIs (please excuse any line wrap):

$scheme = '(?:(?i)http)';
$domainlabel = '(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])';
$toplabel = '(?:[a-z][a-z\d-]*[a-z\d]|[a-z])';
$hostname = "(?:(?:$domainlabel\.)*$toplabel\.?)";
$ipv4address = '(?:\d+\.\d+\.\d+\.\d+)';
$host = "(?:$hostname|$ipv4address)";
$port = '(?::\d*)';
$pchar = '(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})';
$param = "$pchar*";
$segment = "(?:$pchar*(?:;$param)*)";
$path_segments = "(?:$segment(?:/$segment)*)";
$abspath = "(?:/$path_segments)";
$query = '(?:\?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)';
$http_uri = "$scheme://$host$port?(?:$abspath$query?)?";
$pattern = "`$http_uri`i";

preg_match_all($pattern,$subject,$matches)


Honestly, I've never actually read an RFC dealing with internet
protocols. (The only one I read was on the RTF file format, and gave up
on that about 1/2 way through.)

Interesting though that the above pattern makes some odd results:
http://waf.rangenet.com/Edit1.php

I think I will have to play some more with the patterns that were posted
in this thread and see what I can make of them.

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
Jul 17 '05 #6

P: n/a
R. Rajesh Jeba Anbiah wrote:
(((http|ftp|https):\/\/[\w]+(.[\w]+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?))|(www.)|(ftp.)


This was nearly what I was looking for. I ended up editing a bit to come
up with:
(((https?|ftp)://[\w]+(\.[\w]+)([\w.,@?^=%&:/~\+#-]*[^?.,!:;\s])?))

Which seems to be working right now. I'm a little suspicious of the
"[^?.,!:;\s]" part - I assumed that I'd need some kind of look-ahead to
do it. Anyway, here's the results:

http://waf.rangenet.com/Edit2.php

The last 7 urls had a space appeneded to the end (see source), so the
results are what I was expecting to get.

Thanks to those who replied!

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
Jul 17 '05 #7

P: n/a
Justin Koivisto wrote:
John Dunlop wrote:
[ ... ]
$http_uri = "$scheme://$host$port?(?:$abspath$query?)?";


[ ... ]
Interesting though that the above pattern makes some odd results:
http://waf.rangenet.com/Edit1.php


The pattern on that page isn't identical to my offering. Compare

$http_uri = "$scheme://$host$port?(?:$abspath$query?)?";

and

| $http_uri*= $scheme.'://'.$host.$port.'?(?:'.$abspath.$query.'.?)?';
----------------------------------------------------------------^

That final period significantly alters the match. If the URI contains
a path, it must now end with a query component (or, at least, a "?"
followed by an empty query component), for $query is no longer
optional. Therefore, when applied to <http://domain.example/foo>, the
pattern matches <http://domain.example> only.

Fix it by removing the offending period.

--
Jock
Jul 17 '05 #8

P: n/a
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
Justin Koivisto <sp**@koivi.com> wrote in message news:<eo****************@news7.onvoy.net>... It is somewhat shocking to see even experts like Justin are lost in
regular expressions.

I must admit the fact that I'm still poor in regular expression even
though I use two good tools: <http://www.weitz.de/regex-coach/> and
<http://laurent.riesterer.free.fr/regexp/>


The camel book sits on my desk ever though I'm programming in PHP.

What is needed here is the magical \b meta character:

$re = '/(((https?|ftp):\/\/[\w]+(\.[\w]+)([^\s]*)?))(\/|\b)/i';


Jul 17 '05 #9

P: n/a
Justin Koivisto <sp**@koivi.com> wrote in message news:<XS****************@news7.onvoy.net>...
John Dunlop wrote: <snip>
Interesting though that the above pattern makes some odd results:
http://waf.rangenet.com/Edit1.php

I think I will have to play some more with the patterns that were posted
in this thread and see what I can make of them.


IMHO, it is better to stay with RFC standards as John says. As John
said, I think, the problem might be within splitting the pattern. I
have tried all your urls with "The Regex Coach"
<http://www.weitz.de/regex-coach/> with the following pattern:

(?:(?i)http)://(?:(?:(?:(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])\.)*(?:[a-z][a-z\d-]*[a-z\d]|[a-z])\.?)|(?:\d+\.\d+\.\d+\.\d+))(?::\d*)?(?:(?:/(?:(?:(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*(?:;(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*)*)(?:/(?:(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*(?:;(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})*)*))*))(?:\?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)?)?

It is working fine for all your urls *except* for domains with "_"
(underscores) eg. http://www.ex_ample.com/ . If I'm right that is also
correct. Hope, John will comment on it.

--
http://www.sendmetoindia.com - Send Me to India!
Email: rrjanbiah-at-Y!com
Jul 17 '05 #10

P: n/a
John Dunlop wrote:
The pattern on that page isn't identical to my offering. Compare

$http_uri = "$scheme://$host$port?(?:$abspath$query?)?";

and

| $http_uri = $scheme.'://'.$host.$port.'?(?:'.$abspath.$query.'.?)?';
----------------------------------------------------------------^

That final period significantly alters the match. If the URI contains
a path, it must now end with a query component (or, at least, a "?"
followed by an empty query component), for $query is no longer
optional. Therefore, when applied to <http://domain.example/foo>, the
pattern matches <http://domain.example> only.

Fix it by removing the offending period.


My mistake... I now removed the extra period. However, still picks up
the ending punctuation like "!" and "?" for strings like:
http://www.example.com/?
http://www.example.com/!
http://www.example.com/dirname/file.html!
http://www.example.com/dirname/file.html?query=My+Test+String&q2=search!

I can see that the last one is a bit misleading since that does look
like part of the query string.

Another one that I noticed is:
http://www.ex_ample.com

Are underscored not allowed in a domain name, is this part of the RFC,
or just an overlook in the pattern?

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
Jul 17 '05 #11

P: n/a
Chung Leong wrote:
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
Justin Koivisto <sp**@koivi.com> wrote in message


news:<eo****************@news7.onvoy.net>...
It is somewhat shocking to see even experts like Justin are lost in
regular expressions.

I must admit the fact that I'm still poor in regular expression even
though I use two good tools: <http://www.weitz.de/regex-coach/> and
<http://laurent.riesterer.free.fr/regexp/>


The camel book sits on my desk ever though I'm programming in PHP.

What is needed here is the magical \b meta character:

$re = '/(((https?|ftp):\/\/[\w]+(\.[\w]+)([^\s]*)?))(\/|\b)/i';


I had tried using the \b metacharacter, but I see that I didn't use ( /
| \b ), and this likely why it didn't work for me. This shortened
pattern seems to be doing the trick so far.

Thanks to all who have contributed!

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
Jul 17 '05 #12

P: n/a
"Chung Leong" <ch***********@hotmail.com> wrote in message news:<X6********************@comcast.com>...
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
<snip> The camel book sits on my desk ever though I'm programming in PHP.


Yes, camel book is unbeatable. But, most of the time I used to
refer this quick table for most of my silly works
<http://www.phpedit.net/products/PHPEdit/manual/en/module.FindRegExp.php#Lv1_7>

--
http://www.sendmetoindia.com - Send Me to India!
Email: rrjanbiah-at-Y!com
Jul 17 '05 #13

P: n/a
R. Rajesh Jeba Anbiah wrote:

[ ... ]
It is working fine for all your urls *except* for domains with "_"
(underscores) eg. http://www.ex_ample.com/ .


Hostnames cannot contain underscores; they may only contain letters
([a-zA-Z]), numbers ([0-9]), hyphens ([-]) and periods ([.]). Further
restrictions apply too. RFC952, as updated by RFC1123, specifies the
syntax of hostnames; the definition is: one or more names (a letter or
digit followed by any number of letters, digits or hyphens and ending
with a letter or digit) separated by periods.

http://www.ietf.org/rfc/rfc952.txt
http://www.ietf.org/rfc/rfc1123.txt

--
Jock
Jul 17 '05 #14

P: n/a
Justin Koivisto wrote:
However, still picks up the ending punctuation like "!" and "?" for
strings like:
http://www.example.com/?
http://www.example.com/!
http://www.example.com/dirname/file.html!
http://www.example.com/dirname/file.html?query=My+Test+String&q2=search!


How do you know the "ending punctuation" isn't part of the URI?

--
Jock
Jul 17 '05 #15

P: n/a
John Dunlop wrote:
$scheme = '(?:(?i)http)';
$domainlabel = '(?:[a-z\d][a-z\d-]*[a-z\d]|[a-z\d])';
$toplabel = '(?:[a-z][a-z\d-]*[a-z\d]|[a-z])';
$hostname = "(?:(?:$domainlabel\.)*$toplabel\.?)";
$ipv4address = '(?:\d+\.\d+\.\d+\.\d+)';
$host = "(?:$hostname|$ipv4address)";
$port = '(?::\d*)';
$pchar = '(?:[a-z\d_.!~*\'()\-:@&=+$,]|%[\da-f]{2})';
$param = "$pchar*";
$segment = "(?:$pchar*(?:;$param)*)";
$path_segments = "(?:$segment(?:/$segment)*)";
$abspath = "(?:/$path_segments)";
$query = '(?:\?(?:[;/?:@&=+$,a-z\d\-_.!~*\'()]|%[\da-f])*)';
$http_uri = "$scheme://$host$port?(?:$abspath$query?)?";
$pattern = "`$http_uri`i";


Having reread RFC2396, I see this pattern doesn't treat reserved
characters properly. For example, <http://domain.example/?foo?bar>
will be treated as an HTTP URI. This is incorrect: a URI may only
contain a single question mark. That particular problem may be fixed
by removing the question mark from the query component's set of
allowable characters.

Also, no length limit is imposed on the hostname.

There may be other problems. It's not straightforward. :-(

--
Jock
Jul 17 '05 #16

P: n/a
John Dunlop wrote:
Justin Koivisto wrote:

However, still picks up the ending punctuation like "!" and "?" for
strings like:
http://www.example.com/?
http://www.example.com/!
http://www.example.com/dirname/file.html!
http://www.example.com/dirname/file.html?query=My+Test+String&q2=search!

How do you know the "ending punctuation" isn't part of the URI?


That is the problem, isn't it... In the situation I am using this for,
it's better to err on the site of not including it. It's for in a CMS
where the user types in their information. In 98% of the cases, the
users will use HTML to create a link if the URI is longer than just the
domain name. I see a lot of this kind of entries:

" In the event that you cannot make it to your scheduled meeting, please
visit www.example.com, or call 789-456-1230 to reschedule. "

It's always a catch-22, isn't it?

--
Justin Koivisto - sp**@koivi.com
PHP POSTERS: Please use comp.lang.php for PHP related questions,
alt.php* groups are not recommended.
SEO Competition League: http://seo.koivi.com/
Jul 17 '05 #17

P: n/a
John Dunlop <us*********@john.dunlop.name> wrote in message news:<MP***********************@News.Individual.NE T>...
R. Rajesh Jeba Anbiah wrote:

[ ... ]
It is working fine for all your urls *except* for domains with "_"
(underscores) eg. http://www.ex_ample.com/ .


Hostnames cannot contain underscores; they may only contain letters
([a-zA-Z]), numbers ([0-9]), hyphens ([-]) and periods ([.]). Further
restrictions apply too. RFC952, as updated by RFC1123, specifies the
syntax of hostnames; the definition is: one or more names (a letter or
digit followed by any number of letters, digits or hyphens and ending
with a letter or digit) separated by periods.

http://www.ietf.org/rfc/rfc952.txt
http://www.ietf.org/rfc/rfc1123.txt


Thanks a lot for your wonderful contributions. Nice to see a RFC
expert here.

FWIW, I have noticed Google Groups' URLing sucks as per RFC. See
the links at <http://groups.google.com/groups?selm=bRvfc.579%24m3.24284%40news7.onvoy.net >

--
http://www.sendmetoindia.com - Send Me to India!
Email: rrjanbiah-at-Y!com
Jul 17 '05 #18

P: n/a
Justin Koivisto wrote:
John Dunlop wrote:
How do you know the "ending punctuation" isn't part of the URI?
That is the problem, isn't it... In the situation I am using this for,
it's better to err on the site of not including it.


Sorry, Justin, I can't think of how to do that in one regular
expression. Perhaps some real regexp-ert will chime in and show how
easy it is.

You could match the URI then check the last character isn't a sentence
terminator or other "problem" character.
It's always a catch-22, isn't it?


Yup! :-(

--
Jock
Jul 17 '05 #19

P: n/a
R. Rajesh Jeba Anbiah wrote:
FWIW, I have noticed Google Groups' URLing sucks as per RFC.


That's unfortunate. I wonder what rules Google follows.

FWIW, my newsreader (Gravity v2.50) has problems too. I've had to fix
URLs by hand on the odd occasion. It's not a major pain though.

People usually delimit URLs, knowingly or not. Whitespace is a fine
delimiter as it can't appear in URLs.

But this is drifting off-topic now.

--
Jock
Jul 17 '05 #20

P: n/a
"John Dunlop" <us*********@john.dunlop.name> wrote in message
news:MP***********************@News.Individual.NET ...
R. Rajesh Jeba Anbiah wrote:

Hostnames cannot contain underscores; they may only contain letters
([a-zA-Z]), numbers ([0-9]), hyphens ([-]) and periods ([.]). Further
restrictions apply too. RFC952, as updated by RFC1123, specifies the
syntax of hostnames; the definition is: one or more names (a letter or
digit followed by any number of letters, digits or hyphens and ending
with a letter or digit) separated by periods.

http://www.ietf.org/rfc/rfc952.txt
http://www.ietf.org/rfc/rfc1123.txt


International Domain Name changes everything. My homepage is at www.chüng.de
for instance.
http://www.ietf.org/rfc/rfc3490.txt

Jul 17 '05 #21

P: n/a
Chung Leong wrote:
"John Dunlop" <us*********@john.dunlop.name> wrote in message
news:MP***********************@News.Individual.NET ...
Hostnames cannot contain underscores; they may only contain letters
([a-zA-Z]), numbers ([0-9]), hyphens ([-]) and periods

[ ... ]
International Domain Name changes everything.
(Well, many people use the terms "hostname" and "domain name"
interchangeably too. Anyone interested in the difference can search
the <news:comp.protocols.tcp-ip.domains> archives.)

Hopefully, along with I18N Resource Identifiers (IRIs), IDNs will
make a big difference. Domain names restricted to a subset of US-
ASCII are sorely inadequate, IMO.

But it doesn't change RFC2396 and RFC2616, which together specify the
syntax of HTTP URLs.
http://www.ietf.org/rfc/rfc3490.txt


That's a proposed standard. I haven't read it yet. Thanks for
pointing it out. Interested parties might also like

http://www.ietf.org/html.charters/idn-charter.html

--
Jock
Jul 17 '05 #22

P: n/a
"John Dunlop" <us*********@john.dunlop.name> wrote in message
news:MP************************@News.Individual.NE T...
(Well, many people use the terms "hostname" and "domain name"
interchangeably too. Anyone interested in the difference can search
the <news:comp.protocols.tcp-ip.domains> archives.)
Well, yes. It is usually referred to as the <host> part of the URL after
all.
Hopefully, along with I18N Resource Identifiers (IRIs), IDNs will
make a big difference. Domain names restricted to a subset of US-
ASCII are sorely inadequate, IMO.


On the other hand, it's going to lead to all kinds of fraud. Just think of
the number of ways you can spell something that looks like "microsoft" using
letters from other scripts. I guess that's probably why the folks at Redmond
hasn't implemented IDN yet.

I bet on the very day IDN becomes available for Chinese some wiseguy will go
and register www.[Swastika Symbol].cn.
Jul 17 '05 #23

This discussion thread is closed

Replies have been disabled for this discussion.