473,395 Members | 1,766 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Efficient URL-decoding.

Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',

3. Code use iostreams, so it's locale specific
//------------- code begins ------------------------------
#include <iostream>
#include <string>
#include <sstream>
std::string URLdecode(const std::string& l)
{
std::ostringstream L;
for(std::string::size_type x=0;x<l.size();++x)
switch(l[x])
{
case('+'):
{
L<<' ';
break;
}
case('%'): // Convert all %xy hex codes into ASCII characters.
{
const std::string hexstr(l.substr(x+1,2)); // xy part of %xy.
x+=2; // Skip over hex.
if(hexstr=="26" || hexstr=="3D")
// Do not alter URL delimeters.
L<<'%'<<hexstr;
else
{
std::istringstream hexstream(hexstr);
int hexint;
hexstream>>std::hex>>hexint;
L<<static_cast<char>(hexint);
}
break;
}
default: // Copy anything else.
{
L<<l[x];
break;
}
}
return L.str();
}
int main()
{
for(std::string s;std::getline(std::cin,s);)
{
std::cout<<URLdecode(s)<<'\n';
}
return 0;
}
//--------------------- end of code ----------------------

Do any of you have any suggestion on how this code may be made more
efficient and robust with regards to the three issues above?
Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/
Jun 17 '06 #1
5 6332
Peter Jansson wrote:
Do any of you have any suggestion on how this code may be made more
efficient and robust with regards to the three issues above?


Where are its unit tests? And note that refactors which clean the code up
will often squeeze out flab, too...

--
Phlip
http://c2.com/cgi/wiki?ZeekLand <-- NOT a blog!!!
Jun 17 '06 #2
On Sat, 17 Jun 2006 11:53:09 GMT, Peter Jansson
<we*******@jansson.net> wrote:
Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior. You need to check that there is enough space left after the % sign and
that l[x + 1] and l[x + 2] are both hex digits. Use isxdigit(char) to
test them.

'l' is a *horrible* name for a variable, use URLin or something. 'L'
is almost as bad, use URLout. Only use single letter names for loops
or for some mathematical formulae where the single letters are
understood: E = Mc^2.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers', Converting two hex digits to an integer is easy enough to do yourself.
You don't need all the overhead of a stringstream. You need a way to
convert each hex digit character to a value in the range 0 ... 15 and
then multiply the first by 16 and add the second.

Since all you are doing with the outer stringstream is adding
characters to the end of a string, you can replace it with
string.append(), string.push_back() or +=.

3. Code use iostreams, so it's locale specific Remove the stringstreams.
As a complete alternative you could use a single replace() method with
three parameters: the string to look in, the string to find and the
character to replace it with. Your code would then look something
like:

void replace(std::string& baseStr, const char* target,
char newChar) { ... }

std::string URLdecode(const std::string& URLin)
std::string URLout(URLin);
replace(URLout, "+", ' ');
replace(URLout, "%20", ' ');
replace(URLout, "%2f", '/');
replace(URLout, "%2F", '/');
replace(URLout, "%3a", ':');
replace(URLout, "%3A", ':');
replace(URLout, "%3f", '?');
replace(URLout, "%3F", '?');

// %25 = '%' last to avoid problems like %252f
replace(URLout, "%25", '%');

return URLout;
}

The second version is probably only useful if there is a limited range
of values that will have to be translated from %XX to a character. I
used the second version in my own URL converter (C# so the Replace()
function came for free).

rossum


//------------- code begins ------------------------------
#include <iostream>
#include <string>
#include <sstream>
std::string URLdecode(const std::string& l)
{
std::ostringstream L;
for(std::string::size_type x=0;x<l.size();++x)
switch(l[x])
{
case('+'):
{
L<<' ';
break;
}
case('%'): // Convert all %xy hex codes into ASCII characters.
{
const std::string hexstr(l.substr(x+1,2)); // xy part of %xy.
x+=2; // Skip over hex.
if(hexstr=="26" || hexstr=="3D")
// Do not alter URL delimeters.
L<<'%'<<hexstr;
else
{
std::istringstream hexstream(hexstr);
int hexint;
hexstream>>std::hex>>hexint;
L<<static_cast<char>(hexint);
}
break;
}
default: // Copy anything else.
{
L<<l[x];
break;
}
}
return L.str();
}
int main()
{
for(std::string s;std::getline(std::cin,s);)
{
std::cout<<URLdecode(s)<<'\n';
}
return 0;
}
//--------------------- end of code ----------------------

Do any of you have any suggestion on how this code may be made more
efficient and robust with regards to the three issues above?
Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/


Jun 17 '06 #3
rossum wrote:
On Sat, 17 Jun 2006 11:53:09 GMT, Peter Jansson
<we*******@jansson.net> wrote:

Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.


You need to check that there is enough space left after the % sign and
that l[x + 1] and l[x + 2] are both hex digits. Use isxdigit(char) to
test them.

'l' is a *horrible* name for a variable, use URLin or something. 'L'
is almost as bad, use URLout. Only use single letter names for loops
or for some mathematical formulae where the single letters are
understood: E = Mc^2.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',


Converting two hex digits to an integer is easy enough to do yourself.
You don't need all the overhead of a stringstream. You need a way to
convert each hex digit character to a value in the range 0 ... 15 and
then multiply the first by 16 and add the second.

Since all you are doing with the outer stringstream is adding
characters to the end of a string, you can replace it with
string.append(), string.push_back() or +=.

3. Code use iostreams, so it's locale specific


Remove the stringstreams.
As a complete alternative you could use a single replace() method with
three parameters: the string to look in, the string to find and the
character to replace it with. Your code would then look something
like:

void replace(std::string& baseStr, const char* target,
char newChar) { ... }

std::string URLdecode(const std::string& URLin)
std::string URLout(URLin);
replace(URLout, "+", ' ');
replace(URLout, "%20", ' ');
replace(URLout, "%2f", '/');
replace(URLout, "%2F", '/');
replace(URLout, "%3a", ':');
replace(URLout, "%3A", ':');
replace(URLout, "%3f", '?');
replace(URLout, "%3F", '?');

// %25 = '%' last to avoid problems like %252f
replace(URLout, "%25", '%');

return URLout;
}

The second version is probably only useful if there is a limited range
of values that will have to be translated from %XX to a character. I
used the second version in my own URL converter (C# so the Replace()
function came for free).

rossum


Thank you for your input. Well I did some research and came up with the
following. Now, however, I wonder if things still are portable with the
pointer arithmetic (in+=2)? And what happens with isxdigit if we go out
of bounds on the in-array?

Any more ideas/comments?

Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/
//------------- code begins ------------------------------
#include <cctype>
#include <cstdlib>
#include <iostream>
#include <string>
#include <sstream>
// hex2dec convert from base 16 to base 10, strtol could be used...
inline int hex2dec(const char& hex)
{
return ((hex>='0'&&hex<='9')?(hex-'0'):(std::toupper(hex)-'A'+10));
}
std::string URLdecode(const std::string& URLin)
{
std::string URLout;
const char* in(URLin.c_str());
for(;*in;in++)
{
if(*in!='%'||!std::isxdigit(in[1])||!std::isxdigit(in[2]))
{
if(*in=='+')
URLout+=' ';
else
URLout+=*in;
}
else // Convert all %xy hex codes into ASCII characters.
{
if( (in[1]=='2' && in[2]=='6')
|| (in[1]=='3' && (in[2]=='d'||in[2]=='D')))
{ // Do not alter URL delimeters.
URLout+='%';
URLout+=in[1];
URLout+=in[2];
}
else
URLout+=static_cast<char>(hex2dec(in[1])*16+hex2dec(in[2]));
in+=2;
}
}
return URLout;
}
int main()
{
for(std::string s;std::getline(std::cin,s);)
{
std::cout<<URLdecode(s)<<'\n';
}
return 0;
}
//--------------------- end of code ----------------------
Jun 17 '06 #4
Peter Jansson wrote:
Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',

3. Code use iostreams, so it's locale specific


URI encoding and decoding is a really tricky issue, quite apart from
the issues in the code that you have (that others have already helped
with). I think maybe I can shed some light on your third question
though. It's all a little off topic for this forum, but important. I
hope everybody indulges me :-)

The URI is split into several parts, but for HTTP(S) the only parts
that will be encoded are the file specification and the query string
(if present). Note that they are encoded _differently_. They are
seperated by a single question mark (?).

If you are decoding somebody else's URI then stop right now. You don't
know enough about it to be able to work it out. By all means give it a
go, but don't expect that it will make any sense and _don't_ manipulate
it before using it. You will break something.

The file specfication (which is what it looks like you're decoding) can
be in any locale. These days people are tending to drift towards UTF-8,
but it's by no means universal. If it's a URI you've encoded yourself
then you should know which locale you used.

For the query string, the format is described in the HTML specs, _but_
only browsers have to use that format when they create a query string
from a form submission using GET. Any other URI creation can be
different and in fact the W3C recommends a slightly different format
that doesn't cause common entity problems in HTML. The biggest
difference between the file specification and the query string encoding
is the space. In the file spec they are '%20' and in a query string
they are '+' - many people get this wrong (most clearly Technorati who
have wrecked the C++ blogging community through this error). If you're
trying to decode somebody else's query string you can't rely on the
format at all.

Now, the final thing you need to think about is security. You need to
check and double check the URI for all sorts of things. What checks you
need depends on where the URI came from and what you're doing with it.

I have a number of aticles on my web site about these issues:

Problems with IIS due to HTTP.SYS decoding the file specification:
http://www.kirit.com/Getting%20the%2...ISAPI%20filter

Another problem caused in the custom 404 handling:
http://www.kirit.com/Errors%20in%20I...ror%20handling

Another encoding fault leading to problems with redirections:
http://www.kirit.com/Response.Redire...encoded%20URIs

And even the W3C doesn't get it right:
http://www.kirit.com/W3C%27s%20CSS%2...tion%20service

These articles should give you a feel for the sorts of things that can
go wrong. They link in many places to the relevant RFCs and other specs
and notes. There are some other articles on Unicode that also touch on
other aspecs of these issues. There are also a load of tricky URIs that
you should be able to deal with properly.
So, again, I know this is strictly off-topic for this group. I hope
nobody minds too much though as the issues are pretty important.
K

Jun 18 '06 #5
Peter Jansson wrote:
I did some research and came up with the
following. Now, however, I wonder if things still are portable with the
pointer arithmetic (in+=2)? And what happens with isxdigit if we go out
of bounds on the in-array?
Tricky code, but it accidentily works ;)

The reason it works is that you use c_str(). That tacks a \0 on the end
of the char[] returned, even though it's not present in the string
itself.
Now, \0 is not a hexdigit, which implies you won't check after that \0.
Also, you only increment in if you know you saw three non-\0
charachters
which means you haven't hit the end yet. Therefore, it is portable.
Some
comments would be nice, though ;)

BTW: call URLout.reserve(URLin.size()). The input size is a very good
estimate
of the output size. And you seem to have some old headers lying around.

HTH,
Michiel Salters
// hex2dec convert from base 16 to base 10, strtol could be used...
inline int hex2dec(const char& hex)
{
return ((hex>='0'&&hex<='9')?(hex-'0'):(std::toupper(hex)-'A'+10));
}
std::string URLdecode(const std::string& URLin)
{
std::string URLout;
const char* in(URLin.c_str());
for(;*in;in++)
{
if(*in!='%'||!std::isxdigit(in[1])||!std::isxdigit(in[2]))
{
if(*in=='+')
URLout+=' ';
else
URLout+=*in;
}
else // Convert all %xy hex codes into ASCII characters.
{
if( (in[1]=='2' && in[2]=='6')
|| (in[1]=='3' && (in[2]=='d'||in[2]=='D')))
{ // Do not alter URL delimeters.
URLout+='%';
URLout+=in[1];
URLout+=in[2];
}
else
URLout+=static_cast<char>(hex2dec(in[1])*16+hex2dec(in[2]));
in+=2;
}
}
return URLout;
}


Jun 19 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
by: Daniel Tonks | last post by:
First, please excuse the fact that I'm a complete MySQL newbie. My site used forum software that I wrote myself (in Perl) which, up until now, has used flat files. This worked fine, however...
12
by: Programmer | last post by:
Is this: int x; for (int count=0; count<200; count++) { x = someFunc(); anotherFunc(x); }
3
by: sandeep | last post by:
Hi i am new to this group and to c++ also though i have the knowledge of "c" and now want to learn c++ and data structure using c/c++ . so could nebody please suggest me some...
2
by: Belmin | last post by:
Hi all, Wanted to know what is the most efficient way of doing a select query for mysql that only returns one value. For example: $mysqli->query('select count(*) from log'); $temprec =...
22
by: Curious | last post by:
Hi, I am searching for a data structure that stores key-value pairs in it. This data structure is to hold large amounts of key-value pairs, and so needs to be efficient both in insertion and...
5
by: Alan Little | last post by:
I have affiliates submitting batches of anywhere from 10 to several hundred orders. Each order in the batch must include an order ID, originated by the affiliate, which must be unique across all...
1
by: =?Utf-8?B?UVNJRGV2ZWxvcGVy?= | last post by:
Using .NET 2.0 is it more efficient to copy files to a single folder versus spreading them across multiple folders. For instance if we have 100,000 files to be copied, Do we copy all of them to...
28
by: Mahesh | last post by:
Hi, I am looking for efficient string cancatination c code. I did it using ptr but my code i think is not efficient. Please help. Thanks a lot
3
by: Ken Fine | last post by:
This is a question that someone familiar with ASP.NET and ADO.NET DataSets and DataTables should be able to answer fairly easily. The basic question is how I can efficiently match data from one...
82
by: Bill David | last post by:
SUBJECT: How to make this program more efficient? In my program, a thread will check update from server periodically and generate a stl::map for other part of this program to read data from....
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.