473,385 Members | 2,005 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Splitting URLs

Hi,
I'm looking for a function to split urls into their component parts, ie
protocol, host, path, filename, extension. I'm really only looking for
path and hostname (so I can download a webpage over sockets using c++).
Something equivilent to PHP's 'explode' function would be fine, or
even better PHP's 'spliturl' function :).

Alternatively, if someone could recommend a better way to download data
(ie. ASCII) into an array of some type (preferably char) or even a string
given a URL would be even better.

I have trawled google and google groups for something to this effect, but
have come up dry...surely someone would have wanted to do this before - or
is there a standard library for this sort of thing that I'm missing?

Thanks in advance,
Patrick
Jul 19 '05 #1
3 5527
Patrick Coleman wrote:
Hi,
I'm looking for a function to split urls into their component parts, ie
protocol, host, path, filename, extension. I'm really only looking for
path and hostname (so I can download a webpage over sockets using c++).
Something equivilent to PHP's 'explode' function would be fine, or
even better PHP's 'spliturl' function :).

Alternatively, if someone could recommend a better way to download data
(ie. ASCII) into an array of some type (preferably char) or even a string
given a URL would be even better.

I have trawled google and google groups for something to this effect, but
have come up dry...surely someone would have wanted to do this before - or
is there a standard library for this sort of thing that I'm missing?

Thanks in advance,
Patrick

I have attached a part of a C++ util library that I am working on.

This is licenced by the GPL.

Usage:

#include "at_url_parse.h"

int main()
{
AT_UrlParser x( "http://foo/path" );

std::cout << "Path is " << x.m_path << "\n";
}
It fixes "../.." stuff and decodes "%xx" from the URL as well.

It will also combine two URL's.

I wrote this thing a very long time ago and I don't like some things
about it.

Anyhow, I hope it helps !

G

---------------- at_url_parse.h --------------------

//
//
// at_url_parse.h
//
//

#ifndef __at_url_parse_h__
#define __at_url_parse_h__

#include <string>
// ======== AT_UrlString ==============================================
/**
* AT_UrlString is a std::string but also contains a flag to indicate
* if it is set or not. Hence a null string vs an empty string.
* It contains a couple of methods to automatically set the
* m_is_set flag but is by no means complete. It is used as
* a way for the AT_UrlParser to indicate if a fields is set
* or not.
*/

class AT_UrlString
: public std::string
{
public:

bool m_is_set;

AT_UrlString( const std::string & i_string )
: m_is_set( true ),
std::string( i_string )
{
}

AT_UrlString()
: m_is_set( false )
{
}

AT_UrlString & assign( AT_UrlString::const_iterator i1,
AT_UrlString::const_iterator i2 )
{
m_is_set = true;
std::string::assign( i1, i2 );

return * this;
}

AT_UrlString & operator=( const std::string & i_str )
{
m_is_set = true;
std::string::assign( i_str );

return * this;
}

/**
* UrlDecode
* This will decode this string replacing %xx and + characters
* with the pre-encoded equivalents.
*/

void UrlDecode();

AT_UrlString & assign_encoded( const std::string & i_str )
{
* this = i_str;
UrlDecode();
return * this;
}

AT_UrlString & assign_encoded( AT_UrlString::const_iterator i1,
AT_UrlString::const_iterator i2 )
{
assign( i1, i2 );
UrlDecode();
return * this;
}

};
// ======== AT_UrlParser ============================================
/**
* Parsing class for the basic elemnts of a network URL
*
* See RFC 1738:
*
*/

class AT_UrlParser
{
public:

AT_UrlParser();

AT_UrlParser(
const AT_UrlString & i_url,
std::string * o_error_message = 0
);

AT_UrlParser(
const char * i_url,
std::string * o_error_message = 0
);

/**
* Parse the passed in URL.
*
* @param url is the url string to be parsed
* @return true if the url parsing was successful
*/

bool Parse(
const AT_UrlString & i_url,
std::string * o_error_message = 0
);

bool Parse(
const char * url,
std::string * o_error_message = 0
);
/**
* CombineHostURL
* Combine this URL with the URL of the hosturl. This merges
* 2 urls as though they are the normalized or host (hosturl) of a
* web page and an embedded (this) url in a web page.
*
* @param hosturl the normalized url to fill in the blanks of this.
*/

void CombineHostURL( const AT_UrlParser & hosturl );
/**
* WriteURL
* Write a URL given the data in this URL object
*
* @return a string of the generated url.
*/

std::string WriteURL();

enum {
InitialDefaultPortNo = 80
};

AT_UrlString m_scheme;
AT_UrlString m_host;
AT_UrlString m_port;
AT_UrlString m_user;
AT_UrlString m_pass;
AT_UrlString m_path;

AT_UrlString m_parse_error;
};
#endif // __at_url_parse_h__
---------------- at_url_parse.cpp --------------------
//
// at_url_parse.cpp
//
//

#include "at_url_parse.h"

#include <sstream>

static bool UrlIsXDigit( AT_UrlString::value_type i_char )
{
switch ( i_char )
{
case '0' :
case '1' :
case '2' :
case '3' :
case '4' :
case '5' :
case '6' :
case '7' :
case '8' :
case '9' :
case 'A' :
case 'B' :
case 'C' :
case 'D' :
case 'E' :
case 'F' :
case 'a' :
case 'b' :
case 'c' :
case 'd' :
case 'e' :
case 'f' :
return true;
}
return false;
}
// ======== AT_UrlString::UrlDecode ===================================
// PURPOSE:
// decode the string in place.
//

void AT_UrlString::UrlDecode()
{

AT_UrlString::const_iterator p_read;
AT_UrlString::const_iterator p_end;
AT_UrlString::iterator p_write;

value_type l_char;
size_type l_num_reduce = 0;

p_write = begin();
p_read = begin();
p_end = end();

while ( p_read != p_end ) {

l_char = * p_read;

if (
( l_char == '%' ) &&
( p_read+1 != p_end ) && UrlIsXDigit( p_read[1] ) &&
( p_read+2 != p_end ) && UrlIsXDigit( p_read[2] )
) {

// Quickly convert from two hex digits to one character.
* p_write =
( ( (p_read[1] & 0xf) + ((p_read[1] >= 'A') ? 9 : 0) )
<< 4 )
| ( (p_read[2] & 0xf) + ((p_read[2] >= 'A') ? 9 : 0) )
;

p_read += 2;
l_num_reduce += 2;

} else if ( l_char == '+' ) {
// Undo the encoding that replaces spaces with plus signs.
* p_write = ' ';
} else {
* p_write = l_char;
}

p_write ++;
p_read ++;
}

if ( l_num_reduce ) {
resize( size() - l_num_reduce );
}
}

AT_UrlParser::AT_UrlParser(
const AT_UrlString & i_url,
std::string * o_error_message
) {
Parse( i_url, o_error_message );
}

AT_UrlParser::AT_UrlParser(
const char * i_url,
std::string * o_error_message
) {
Parse( i_url, o_error_message );
}
// ======== RemoveDotDot ======================================
//
// Remove ".." - xxx/../ and /./ parts of the path. Some servers don't
// allow '/..' sequences since it's a potential security threat. This
// mimics what netscape does. The string passed is modified in place.
//

static void RemoveDotDot( AT_UrlString & path )
{

if ( path.length() == 0 ) {
return;
}

AT_UrlString::iterator ostr = path.begin() - 1;
AT_UrlString::iterator istr = path.begin();
AT_UrlString::iterator end_str = path.end();

// copy the string over itself - removing junk
while ( istr != end_str ) {

// if we have a /./ or a /../ string then do some fixing
redo:
if ( * istr == '/' ) {
if ( * ( istr + 1 ) == '.' ) {
if ( * ( istr + 2 ) == '/' ) {
istr += 2;
goto redo;
} else if ( * ( istr + 2 ) == '.' ) {
if ( * ( istr + 3 ) == '/' ) {
// we have a /../
istr += 3;
while ( ostr >= path.begin() ) {
ostr --;
if ( * ostr == '/' ) {
ostr --;
goto redo;
}
}
}
}
}
}

ostr ++;
* ostr = * istr;
istr ++;
}

path.erase( ostr + 1, end_str );

return;

} // end RemoveDotDot

// ======== AT_UrlParser::Parse
=============================================
// PURPOSE:
// Constructor - parses a URL into it's parts
//

bool AT_UrlParser::Parse(
const char * url,
std::string * o_error_message
) {
AT_UrlString url_str( url );

return Parse( url_str, o_error_message );
}

bool AT_UrlParser::Parse(
const AT_UrlString & url_str,
std::string * o_error_message
) {

if ( url_str == "" )
{

m_parse_error = "Empty url string";

if ( o_error_message )
{
* o_error_message = m_parse_error;
}
return false;
}

AT_UrlString::const_iterator url = url_str.begin();

// initialize all the parts.
m_parse_error =
m_scheme =
m_host =
m_port =
m_user =
m_pass =
m_path = AT_UrlString();

url_str.c_str(); // null terminate the string
AT_UrlString::const_iterator str = url;
AT_UrlString::const_iterator ostr = url;
int state = 0;

AT_UrlString user_or_host;
bool passwd_or_port = false;

char ch = 1;

// looking for scheme:
while ( 1 ) {

ch = * str;

//
// The following state machine scans URL's - the following is
// an extended BNF of the syntax
//
// user_opt_pass = user [ ':' password ] .
//
// host_opt_port = host [ ':' port ] .
//
// net_spec =
// ( "//" user_opt_pass '@' host_opt_port )
// | ( "//" host_opt_port )
// .
//
// url = ( scheme ":" net_spec '/' url_path )
// | ( net_spec '/' url_path )
// | ( '/' url_path )
// | ( scheme ":" '/' url_path )
// | ( scheme ":" url_path )
// | ( url_path )
// .
//

#define grab( part ) part.assign_encoded( ostr, str )

switch ( state ) {
case 21 : {
// scanning port
switch ( ch ) {
case '/' : {
grab( m_port );
ostr = str; // include '/' in path
goto grab_rest_as_path;
}
case '\0' : {
grab( m_port );
goto done;
}
}
break;
}
case 13 : {
// scanning host
switch ( ch ) {
case '/' : {
goto grab_host_grab_rest_as_path;
}
case ':' : {
state = 21;
grab( m_host );
ostr = str + 1; // discard ':'
break;
}
case '\0' : {
grab( m_host );
goto done;
}
}
break;
}
case 12 : {
// scanning password or port
switch ( ch ) {
case '/' : {
m_host = user_or_host;
grab( m_port );
m_port.m_is_set = passwd_or_port;
ostr = str;
goto grab_rest_as_path;
}
case '@' : {
state = 13;
// user or host is really user
m_user = user_or_host;
m_pass.m_is_set = passwd_or_port;
// collect the password
grab( m_pass );
ostr = str + 1; // discard the '@'
break;
}
case '\0' : {
// no path was set !
m_host = user_or_host;
m_port.m_is_set = passwd_or_port;
grab( m_port );
goto done;
}
}
break;
}
case 9 : {
// scanning user or host
switch ( ch ) {
case '/' : {
grab_host_grab_rest_as_path:
grab( m_host );
ostr = str;
goto grab_rest_as_path;
}
case ':' : {
state = 12;
grab( user_or_host );
passwd_or_port = true;
ostr = str + 1; // skip over the ':'
break;
}
case '@' : {
state = 13;
grab( m_user );
ostr = str + 1; // skip over the '@'
break;
}
case '\0' : {
grab( m_host );
goto done;
}
}
break;
}
case 1 : {
// scanning a '//' or '/path'
switch ( ch ) {
case '/' : {
// this is the second '/' in '//'
state = 9;
// the '//' is not significant - need to
// move the output pointer
ostr = str + 1;
break;
}
default : {
goto grab_rest_as_path;
}
}
break;
}
case 0 : {
// start state - possibly a '//' or '/' or 'scheme:' or
path
switch ( ch ) {
case '/' : {
// a url beginning with '/'
state = 1;
break;
}
case ':' : {
// Strings that start with ':' are paths - weird
// but that's what happens
goto grab_rest_as_path;
}
case '\0' : {
// the empty string is significant as an empty path
goto grab_rest_as_path;
}
default : {
state = 3;
}
}
break;
}
case 3 : {
// scanning a path or scheme
switch ( ch ) {
case ':' : {
state = 2;
grab( m_scheme );
ostr = str + 1; // skip over the ':'
break;
}
case '\0' : {
// no ':' or in url and does not start with /
goto grab_rest_as_path;
}
}
break;
}
case 2 : {
switch ( ch ) {
case '/' : {
// this is the first '/' in '://'
state = 1;
break;
}
default : {
// the rest is url_path
grab_rest_as_path:
m_path.assign_encoded( ostr, url_str.end() );
goto done;
}
}
break;
}

} // switch ( state )

str ++;
}

done:
// le parse s'est fini

RemoveDotDot( m_path );

return true;

} // end HA_UrlParser_Main
// ======== AT_UrlParser::CombineHostURL ========================
// PURPOSE:
// Complete the bits of a url.
//
// RETURNS:
//
//

void AT_UrlParser::CombineHostURL( const AT_UrlParser & host )
{
// use the host scheme if one is not defined
if ( ( ! m_scheme.m_is_set ) && host.m_scheme.m_is_set ) {
m_scheme = host.m_scheme;
}

// use the host network specifier if one is not defined
if ( ( ! m_host.m_is_set && ! m_user.m_is_set ) &&
host.m_host.m_is_set ) {

m_host = host.m_host;

if ( host.m_port.m_is_set ) {
m_port = host.m_port;
} else {
m_port = AT_UrlString();
}

m_user = host.m_user;

// use the same password as the host.
if ( host.m_pass.m_is_set ) {
m_pass = host.m_pass;
} else {
m_pass = AT_UrlString();
}
}

// Path is special since we need to combine it by using
// file path rules.

if ( ! m_path.m_is_set ) {
m_path = host.m_path;
} else if ( host.m_path.m_is_set ) {
if ( m_path[ 0 ] != '/' ) {

// we have a relative path - need to combine it with the
// host path.

AT_UrlString::const_iterator str = host.m_path.begin();
AT_UrlString::const_iterator endstr = host.m_path.end();

endstr --; // point to the last valid character

for (
;
( endstr >= str ) && ( * endstr != '/' );
endstr --
) ;

endstr ++;

std::string newpath;

newpath.assign( str, endstr );
newpath.append( m_path );
m_path = newpath;

RemoveDotDot( m_path );
}
}

return;

} // end AT_UrlParser::CombineHostURL
// ======== AT_UrlParser::WriteURL ==============================
// PURPOSE:
// Create a string that reflects this URL. The string is
// needs to be free()'d by the caller.
//
// RETURNS:
// std::string that contains url
//

std::string AT_UrlParser::WriteURL()
{

std::ostringstream l_ostrm;

// Need to construct a url string

std::string l_slashes;
std::string l_atsign;

if ( m_scheme.m_is_set ) {
l_ostrm << m_scheme << ":";
l_slashes = "//";
}

if ( m_user.m_is_set ) {
l_ostrm << l_slashes;
l_slashes = "";
l_atsign = "@";
l_ostrm << m_user;
}

if ( m_pass.m_is_set ) {
l_ostrm << l_slashes;
l_slashes = "";
l_atsign = "@";
l_ostrm << ":" << m_pass;
}

if ( m_host.m_is_set ) {
l_ostrm << l_slashes;
l_slashes = "";
l_ostrm << l_atsign << m_host;
l_atsign = "";
}

if ( m_port.m_is_set ) {
l_ostrm << l_slashes;
l_slashes = "";
l_ostrm << l_atsign << ":" << m_port;
}

if ( m_path.m_is_set ) {
l_ostrm << m_path;
}

return l_ostrm.str();

} // end AT_UrlParser::WriteURL

----------------- end -----------------------------------

Jul 19 '05 #2
Gianni Mariani wrote:

I wrote this thing a very long time ago and I don't like some things
about it.


Such as the illegal use of reserved identifiers for include guards?

-Kevin
--
My email address is valid, but changes periodically.
To contact me please use the address from a recent posting.

Jul 19 '05 #3
Kevin Goodsell wrote:
Gianni Mariani wrote:

I wrote this thing a very long time ago and I don't like some things
about it.


Such as the illegal use of reserved identifiers for include guards?


I'm sure that these are reserved for *my* use. Yep, it's in a copy of
the standard sitting on my shelf.

Right there in Section 3, Paragraph 61.GM :

"All reserved identifiers are for use by Gianni Mariani. He has sole
discretion for their use."

So, I think I'm all OK there. Phew, you had me worried for a sec.

.....
In particular I don't like a few things like:

a) How the "m_path" variable is used. I think there needs to be another
class for managing m_path that inherits AT_UrlString.

b) I think that it should work for chars as well as w_char's so placing
a few things in templates should work.

c) It does not work correctly for non ASCII encodings.

d) Nasty "grab" #define ... gotta go.

Jul 19 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: phpkid | last post by:
Howdy I've been given conflicting answers about search engines picking up urls like: http://mysite.com/index.php?var1=1&var2=2&var3=3 Do search engines pick up these urls? I've been considering...
22
by: Justin Koivisto | last post by:
OK, I found a thread that help out from a while back (Oct 9, 2002) to give me this pattern: `(((f|ht)tp://)((+)(+)?@)?()+(:\d+)?(\/+)?)`i OK, all is well and good with this until the URL is...
4
by: tgiles | last post by:
Hi, all. Been staring at this for a couple of hours now and I find myself completely bewildered. I suppose it doesn't help that I'm a php newbie. Nevertheless, I throw myself at your mercy. I...
3
by: Rakesh | last post by:
Hi, I was 'googling' to look out for some ways of optimizing the code and came across this term - 'hot / cold splitting'. In short, the discussion is about splitting heavily accessed ( hot )...
26
by: Howard Brazee | last post by:
I would like to click on a URL of a html document that will open several URLs at once for me. Does someone have an example of a html document that will do this?
1
by: DM | last post by:
I'm working on a site with more than 1700 HTML files. We'll be moving files around on this site a lot because we're reorganizing it. I'm thinking of writing a script that will convert all URLs in...
2
by: shadow_ | last post by:
Hi i m new at C and trying to write a parser and a string class. Basicly program will read data from file and splits it into lines then lines to words. i used strtok function for splitting data to...
4
by: Steven D'Aprano | last post by:
I'm trying to split a URL into components. For example: URL = 'http://steve:secret@www.domain.com.au:82/dir" + \ 'ectory/file.html;params?query#fragment' (joining the strings above with plus...
3
by: WebCM | last post by:
How to apply nice URL-s into CMS? 1. Should we use nice urls for every page? 2. Do we need to put a FULL path into <a href="">? 3. What is faster and better? a) 10 rules in .htaccess...
0
by: kshw | last post by:
Hi, I’m new to programming. I’m currently learning python to write a web crawler to extract all text from a web page, in addition to, crawling to further URLs and collecting the text there. The...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.