472,102 Members | 1,042 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,102 software developers and data experts.

Help with regular expression?

I'm hopeless at regular expressions (I just don't use them often
enough to gain/maintain knowledge), but I need one now and am looking
for help. I need to parse through a document to find a URL, and then
reconstruct another URL based on it. For example, I need to scan a
web page looking for something like <a
href="some_dir/list_20050815100225.csv">. I don't know in advance
what the date/time in the file name will be. I need to take the
result of that and construct a URL out of it so that I can automate
the download of this file on a regular basis. The replace can be done
by replacing "<token>" in
"http://www.whatever.com/some_dir/list_<token>" with the result from
above. However, I would like the directory information included in
the search result so that I don't have to hard-code it (i.e. I'd
rather look for a URL with "list_<datetime>.csv" in it).

I have a regular expression that comes close:
"href=""some_dir/list_(?:(?<1>[^""]*)""|(?<1>\S+))". I got that by
tweaking the example at
http://msdn.microsoft.com/library/de...ateformats.asp.
If I can't find a cleaner sample, that will have to do. However,
there are two minor problems with this expression: 1) I would rather
be returning the complete URL in the href (to make it easier to
capture variable subdirectories, for example), and 2) it would require
a two-step process (the match followed by the replace). Is it
possible have a single regular expression to do both? That would
simplify configuration of my program, since the intent is that none of
this be hard-coded.

Any help would be appreciated.

Thanks!
Brad.

P.S. If there's a better place to post this kind of question, I'd
love to hear about it. I was tempted to cross-post, but.... :-)
Aug 15 '05 #1
5 2336
This Regex string will work for identifying URLS:

(http|https|mailto):([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])+#*([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])

"Bradley Plett" wrote:
I'm hopeless at regular expressions (I just don't use them often
enough to gain/maintain knowledge), but I need one now and am looking
for help. I need to parse through a document to find a URL, and then
reconstruct another URL based on it. For example, I need to scan a
web page looking for something like <a
href="some_dir/list_20050815100225.csv">. I don't know in advance
what the date/time in the file name will be. I need to take the
result of that and construct a URL out of it so that I can automate
the download of this file on a regular basis. The replace can be done
by replacing "<token>" in
"http://www.whatever.com/some_dir/list_<token>" with the result from
above. However, I would like the directory information included in
the search result so that I don't have to hard-code it (i.e. I'd
rather look for a URL with "list_<datetime>.csv" in it).

I have a regular expression that comes close:
"href=""some_dir/list_(?:(?<1>[^""]*)""|(?<1>\S+))". I got that by
tweaking the example at
http://msdn.microsoft.com/library/de...ateformats.asp.
If I can't find a cleaner sample, that will have to do. However,
there are two minor problems with this expression: 1) I would rather
be returning the complete URL in the href (to make it easier to
capture variable subdirectories, for example), and 2) it would require
a two-step process (the match followed by the replace). Is it
possible have a single regular expression to do both? That would
simplify configuration of my program, since the intent is that none of
this be hard-coded.

Any help would be appreciated.

Thanks!
Brad.

P.S. If there's a better place to post this kind of question, I'd
love to hear about it. I was tempted to cross-post, but.... :-)

Aug 15 '05 #2
The example that I cited is actually closer to what I need, but
thanks!

Brad.

On Mon, 15 Aug 2005 13:23:03 -0700, Paul O
<Pa***@discussions.microsoft.com> wrote:
This Regex string will work for identifying URLS:

(http|https|mailto):([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])+#*([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])

"Bradley Plett" wrote:
I'm hopeless at regular expressions (I just don't use them often
enough to gain/maintain knowledge), but I need one now and am looking
for help. I need to parse through a document to find a URL, and then
reconstruct another URL based on it. For example, I need to scan a
web page looking for something like <a
href="some_dir/list_20050815100225.csv">. I don't know in advance
what the date/time in the file name will be. I need to take the
result of that and construct a URL out of it so that I can automate
the download of this file on a regular basis. The replace can be done
by replacing "<token>" in
"http://www.whatever.com/some_dir/list_<token>" with the result from
above. However, I would like the directory information included in
the search result so that I don't have to hard-code it (i.e. I'd
rather look for a URL with "list_<datetime>.csv" in it).

I have a regular expression that comes close:
"href=""some_dir/list_(?:(?<1>[^""]*)""|(?<1>\S+))". I got that by
tweaking the example at
http://msdn.microsoft.com/library/de...ateformats.asp.
If I can't find a cleaner sample, that will have to do. However,
there are two minor problems with this expression: 1) I would rather
be returning the complete URL in the href (to make it easier to
capture variable subdirectories, for example), and 2) it would require
a two-step process (the match followed by the replace). Is it
possible have a single regular expression to do both? That would
simplify configuration of my program, since the intent is that none of
this be hard-coded.

Any help would be appreciated.

Thanks!
Brad.

P.S. If there's a better place to post this kind of question, I'd
love to hear about it. I was tempted to cross-post, but.... :-)


Aug 15 '05 #3
I'll put this c#.

Regex regex = new Regex("href=\\\"(?'url'some_dir\\/list_[^\\\"]*)\\\""
, RegexOptions.IgnoreCase | RegexOptions.Singleline |
RegexOptions.ExplicitCapture);
string form ="<a href=\"some_dir/list_20050815100225.csv\">";
Match match = regex.Match( form );

if (match.Success)
{
Console.WriteLine("success: " + "http://www.whatever.com/" +
match.Groups["url"].Value);
}
else
{
Console.WriteLine("failed.");
}

and gets this result

success: http://www.whatever.com/some_dir/lis...0815100225.csv
Bruce Dunwiddie
www.csvreader.com
Paul O wrote:
This Regex string will work for identifying URLS:

(http|https|mailto):([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])+#*([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])

"Bradley Plett" wrote:
I'm hopeless at regular expressions (I just don't use them often
enough to gain/maintain knowledge), but I need one now and am looking
for help. I need to parse through a document to find a URL, and then
reconstruct another URL based on it. For example, I need to scan a
web page looking for something like <a
href="some_dir/list_20050815100225.csv">. I don't know in advance
what the date/time in the file name will be. I need to take the
result of that and construct a URL out of it so that I can automate
the download of this file on a regular basis. The replace can be done
by replacing "<token>" in
"http://www.whatever.com/some_dir/list_<token>" with the result from
above. However, I would like the directory information included in
the search result so that I don't have to hard-code it (i.e. I'd
rather look for a URL with "list_<datetime>.csv" in it).

I have a regular expression that comes close:
"href=""some_dir/list_(?:(?<1>[^""]*)""|(?<1>\S+))". I got that by
tweaking the example at
http://msdn.microsoft.com/library/de...ateformats.asp.
If I can't find a cleaner sample, that will have to do. However,
there are two minor problems with this expression: 1) I would rather
be returning the complete URL in the href (to make it easier to
capture variable subdirectories, for example), and 2) it would require
a two-step process (the match followed by the replace). Is it
possible have a single regular expression to do both? That would
simplify configuration of my program, since the intent is that none of
this be hard-coded.

Any help would be appreciated.

Thanks!
Brad.

P.S. If there's a better place to post this kind of question, I'd
love to hear about it. I was tempted to cross-post, but.... :-)


Aug 15 '05 #4
Yes, if I tweak the regular expression you provided just slightly (by
replacing "'url'some_dir" with "'url'[^\\\"]*", that works well and
includes the directory information even if it changes. Now it would
be nice if I could include the ["http://www.whatever.com/" +
match.Groups["url"].Value] in the same regular expression, but that
may be asking too much! :-)

Thanks!
Brad.

On 15 Aug 2005 14:16:20 -0700, "shriop" <sh****@hotmail.com> wrote:
I'll put this c#.

Regex regex = new Regex("href=\\\"(?'url'some_dir\\/list_[^\\\"]*)\\\""
, RegexOptions.IgnoreCase | RegexOptions.Singleline |
RegexOptions.ExplicitCapture);
string form ="<a href=\"some_dir/list_20050815100225.csv\">";
Match match = regex.Match( form );

if (match.Success)
{
Console.WriteLine("success: " + "http://www.whatever.com/" +
match.Groups["url"].Value);
}
else
{
Console.WriteLine("failed.");
}

and gets this result

success: http://www.whatever.com/some_dir/lis...0815100225.csv
Bruce Dunwiddie
www.csvreader.com
Paul O wrote:
This Regex string will work for identifying URLS:

(http|https|mailto):([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])+#*([a-zA-Z0-9$_.+!*(),;/?:@&~=%-])

"Bradley Plett" wrote:
> I'm hopeless at regular expressions (I just don't use them often
> enough to gain/maintain knowledge), but I need one now and am looking
> for help. I need to parse through a document to find a URL, and then
> reconstruct another URL based on it. For example, I need to scan a
> web page looking for something like <a
> href="some_dir/list_20050815100225.csv">. I don't know in advance
> what the date/time in the file name will be. I need to take the
> result of that and construct a URL out of it so that I can automate
> the download of this file on a regular basis. The replace can be done
> by replacing "<token>" in
> "http://www.whatever.com/some_dir/list_<token>" with the result from
> above. However, I would like the directory information included in
> the search result so that I don't have to hard-code it (i.e. I'd
> rather look for a URL with "list_<datetime>.csv" in it).
>
> I have a regular expression that comes close:
> "href=""some_dir/list_(?:(?<1>[^""]*)""|(?<1>\S+))". I got that by
> tweaking the example at
> http://msdn.microsoft.com/library/de...ateformats.asp.
> If I can't find a cleaner sample, that will have to do. However,
> there are two minor problems with this expression: 1) I would rather
> be returning the complete URL in the href (to make it easier to
> capture variable subdirectories, for example), and 2) it would require
> a two-step process (the match followed by the replace). Is it
> possible have a single regular expression to do both? That would
> simplify configuration of my program, since the intent is that none of
> this be hard-coded.
>
> Any help would be appreciated.
>
> Thanks!
> Brad.
>
> P.S. If there's a better place to post this kind of question, I'd
> love to hear about it. I was tempted to cross-post, but.... :-)
>


Aug 15 '05 #5
Hi Bradley,

As far as I know, the regular expression can only do matching in a string.
It cannot concatenate strings. So I think you have to do the string
operations in the C# code. HTH.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

Aug 16 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

9 posts views Thread by Steve | last post: by
4 posts views Thread by Neri | last post: by
6 posts views Thread by JohnSouth | last post: by
5 posts views Thread by Bradley Plett | last post: by
1 post views Thread by Rahul | last post: by
3 posts views Thread by Zach | last post: by
14 posts views Thread by Chris | last post: by
3 posts views Thread by Mr.Steskal | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.