471,594 Members | 1,941 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,594 software developers and data experts.

C SHARP - Parsing URL for Variable

Hi everyone,

I am building a web crawler and one of the features which I need to
include is exclusion of specified 'variable + value' from the url.

Example, user wanted to extract variable "s":

So when you look at this url:
"http://www.goldenretrieverforum.com/search.php?s=5817617a59fb630a7f40846e4a29efc1&do=g etdaily"

, it has a variable 's' and its value, plus some other variables.

I need a code which would shorten that url to this:
"http://www.goldenretrieverforum.com/search.php?do=getdaily"
, extracting variable 's' completely.

But it needs to be smart to such point, that is variable 's' is the
last variable in the link, like this:

"http://www.goldenretrieverforum.com/search.php?s=5817617a59fb630a7f40846e4a29efc1"

, it would correctly fix it to:
"http://www.goldenretrieverforum.com/search.php"
Can someone help me write REGEX or point me to site which has such
regex written already?

Or is there any other way to do this?

Thanks a lot for your time and help.

Joe

Nov 17 '05 #1
4 3078
gs
try regex
I am not good at this but I try give you a hint and example I saw
sample pattern that splits url into 5 match groups:scheme,
authority, path, query, fragment
"({[^:/?#]+}:)?(//{[^/?#]*})?{[^?#]*}(?{[^#]*})?(#{.*})?"
"({[^:/?#]+}:) // scheme
?(//{[^/?#]*}) //authority
?{[^?#]*} // path
(?{[^#]*})? // query
(#{.*})?" // qusty
you use match.groups for the above

For detail instructions, search for regex on msdn

pattern could be something like your s variable and value could be
"s=[a-z0-9]*\&"

You could even use regex to remove to the pattern "s=[a-z0-9]*\&" from the
url.

"Jozef Jarosciak" <jo*@doprocess.com> wrote in message
news:11**********************@f14g2000cwb.googlegr oups.com...
Hi everyone,

I am building a web crawler and one of the features which I need to
include is exclusion of specified 'variable + value' from the url.

Example, user wanted to extract variable "s":

So when you look at this url:
"http://www.goldenretrieverforum.com/search.php?s=5817617a59fb630a7f40846e4a29efc1&do=g etdaily"

, it has a variable 's' and its value, plus some other variables.

I need a code which would shorten that url to this:
"http://www.goldenretrieverforum.com/search.php?do=getdaily"
, extracting variable 's' completely.

But it needs to be smart to such point, that is variable 's' is the
last variable in the link, like this:

"http://www.goldenretrieverforum.com/search.php?s=5817617a59fb630a7f40846e4a29efc1"

, it would correctly fix it to:
"http://www.goldenretrieverforum.com/search.php"
Can someone help me write REGEX or point me to site which has such
regex written already?

Or is there any other way to do this?

Thanks a lot for your time and help.

Joe

Nov 17 '05 #2
Hi, thanks, when it comes to regex I am completely off.
Is there anyone who could write this (supposedly) simple regex for
extraction of variable 's' from the url?
joe

Nov 17 '05 #3
given:

http:// nowhere.com /folder/file.txt?v1=hello%20world

expression could be:

(?<variable> (?<name> .*?) = (?<value> .*?) \& | $ )

Use a "Uri" object to grab the querystring only and to reset it later if required

The above expression should parse all name=value pairs from a querystring.
(Must use RegexOptions.IgnorePatternWhiteSpace and RegexOptions.Singleline)

You can use the Group["name"].Value, etc to obtain the strings to compare
You can use group indexes to manipulate the querystring

--
Dave Sexton
dave@www..jwaonline..com
-----------------------------------------------------------------------
"Jozef Jarosciak" <jo*@doprocess.com> wrote in message news:11**********************@g44g2000cwa.googlegr oups.com...
Hi, thanks, when it comes to regex I am completely off.
Is there anyone who could write this (supposedly) simple regex for
extraction of variable 's' from the url?
joe

Nov 17 '05 #4
Hello,

I wrote this tutorial a while ago on parsing using the URI class:
http://www.geekpedia.com/tutorial68_...URI-Class.html
Hopefully it will get you on the right path, at least.

Andrei

"Dave" <NO*********@dotcomdatasolutions.com> wrote in message
news:Om**************@tk2msftngp13.phx.gbl...
given:

http:// nowhere.com /folder/file.txt?v1=hello%20world

expression could be:

(?<variable> (?<name> .*?) = (?<value> .*?) \& | $ )

Use a "Uri" object to grab the querystring only and to reset it later if
required

The above expression should parse all name=value pairs from a querystring.
(Must use RegexOptions.IgnorePatternWhiteSpace and
RegexOptions.Singleline)

You can use the Group["name"].Value, etc to obtain the strings to compare
You can use group indexes to manipulate the querystring

--
Dave Sexton
dave@www..jwaonline..com
-----------------------------------------------------------------------
"Jozef Jarosciak" <jo*@doprocess.com> wrote in message
news:11**********************@g44g2000cwa.googlegr oups.com...
Hi, thanks, when it comes to regex I am completely off.
Is there anyone who could write this (supposedly) simple regex for
extraction of variable 's' from the url?
joe


Nov 17 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

50 posts views Thread by Jerry Sievers | last post: by
6 posts views Thread by Matthew Barnes | last post: by
12 posts views Thread by Klaus Alexander Seistrup | last post: by
20 posts views Thread by windandwaves | last post: by
13 posts views Thread by Chris Carlen | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.