Efficient URL-decoding.

Peter Jansson

Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',

3. Code use iostreams, so it's locale specific
//------------- code begins ------------------------------
#include <iostream>
#include <string>
#include <sstream>
std::string URLdecode(const std::string& l)
{
std::ostringstream L;
for(std::string::size_type x=0;x<l.size();++x)
switch(l[x])
{
case('+'):
{
L<<' ';
break;
}
case('%'): // Convert all %xy hex codes into ASCII characters.
{
const std::string hexstr(l.substr(x+1,2)); // xy part of %xy.
x+=2; // Skip over hex.
if(hexstr=="26" || hexstr=="3D")
// Do not alter URL delimeters.
L<<'%'<<hexstr;
else
{
std::istringstream hexstream(hexstr);
int hexint;
hexstream>>std::hex>>hexint;
L<<static_cast<char>(hexint);
}
break;
}
default: // Copy anything else.
{
L<<l[x];
break;
}
}
return L.str();
}
int main()
{
for(std::string s;std::getline(std::cin,s);)
{
std::cout<<URLdecode(s)<<'\n';
}
return 0;
}
//--------------------- end of code ----------------------

Do any of you have any suggestion on how this code may be made more
efficient and robust with regards to the three issues above?
Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/

Jun 17 '06 #1

Subscribe Post Reply

6332

Phlip

Peter Jansson wrote:

Do any of you have any suggestion on how this code may be made more
efficient and robust with regards to the three issues above?

Where are its unit tests? And note that refactors which clean the code up
will often squeeze out flab, too...

--
Phlip
http://c2.com/cgi/wiki?ZeekLand <-- NOT a blog!!!

Jun 17 '06 #2

rossum

On Sat, 17 Jun 2006 11:53:09 GMT, Peter Jansson
<we*******@jansson.net> wrote:

Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior. You need to check that there is enough space left after the % sign and
that l[x + 1] and l[x + 2] are both hex digits. Use isxdigit(char) to
test them.

'l' is a *horrible* name for a variable, use URLin or something. 'L'
is almost as bad, use URLout. Only use single letter names for loops
or for some mathematical formulae where the single letters are
understood: E = Mc^2.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers', Converting two hex digits to an integer is easy enough to do yourself.
You don't need all the overhead of a stringstream. You need a way to
convert each hex digit character to a value in the range 0 ... 15 and
then multiply the first by 16 and add the second.

Since all you are doing with the outer stringstream is adding
characters to the end of a string, you can replace it with
string.append(), string.push_back() or +=.

3. Code use iostreams, so it's locale specific Remove the stringstreams.
As a complete alternative you could use a single replace() method with
three parameters: the string to look in, the string to find and the
character to replace it with. Your code would then look something
like:

void replace(std::string& baseStr, const char* target,
char newChar) { ... }

std::string URLdecode(const std::string& URLin)
std::string URLout(URLin);
replace(URLout, "+", ' ');
replace(URLout, "%20", ' ');
replace(URLout, "%2f", '/');
replace(URLout, "%2F", '/');
replace(URLout, "%3a", ':');
replace(URLout, "%3A", ':');
replace(URLout, "%3f", '?');
replace(URLout, "%3F", '?');

// %25 = '%' last to avoid problems like %252f
replace(URLout, "%25", '%');

return URLout;
}

The second version is probably only useful if there is a limited range
of values that will have to be translated from %XX to a character. I
used the second version in my own URL converter (C# so the Replace()
function came for free).

rossum

//------------- code begins ------------------------------
#include <iostream>
#include <string>
#include <sstream>
std::string URLdecode(const std::string& l)
{
std::ostringstream L;
for(std::string::size_type x=0;x<l.size();++x)
switch(l[x])
{
case('+'):
{
L<<' ';
break;
}
case('%'): // Convert all %xy hex codes into ASCII characters.
{
const std::string hexstr(l.substr(x+1,2)); // xy part of %xy.
x+=2; // Skip over hex.
if(hexstr=="26" || hexstr=="3D")
// Do not alter URL delimeters.
L<<'%'<<hexstr;
else
{
std::istringstream hexstream(hexstr);
int hexint;
hexstream>>std::hex>>hexint;
L<<static_cast<char>(hexint);
}
break;
}
default: // Copy anything else.
{
L<<l[x];
break;
}
}
return L.str();
}
int main()
{
for(std::string s;std::getline(std::cin,s);)
{
std::cout<<URLdecode(s)<<'\n';
}
return 0;
}
//--------------------- end of code ----------------------

Do any of you have any suggestion on how this code may be made more
efficient and robust with regards to the three issues above?
Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/

Jun 17 '06 #3

Peter Jansson

rossum wrote:

On Sat, 17 Jun 2006 11:53:09 GMT, Peter Jansson
<we*******@jansson.net> wrote:

Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.

You need to check that there is enough space left after the % sign and
that l[x + 1] and l[x + 2] are both hex digits. Use isxdigit(char) to
test them.

'l' is a *horrible* name for a variable, use URLin or something. 'L'
is almost as bad, use URLout. Only use single letter names for loops
or for some mathematical formulae where the single letters are
understood: E = Mc^2.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',

Converting two hex digits to an integer is easy enough to do yourself.
You don't need all the overhead of a stringstream. You need a way to
convert each hex digit character to a value in the range 0 ... 15 and
then multiply the first by 16 and add the second.

Since all you are doing with the outer stringstream is adding
characters to the end of a string, you can replace it with
string.append(), string.push_back() or +=.

3. Code use iostreams, so it's locale specific

Remove the stringstreams.
As a complete alternative you could use a single replace() method with
three parameters: the string to look in, the string to find and the
character to replace it with. Your code would then look something
like:

void replace(std::string& baseStr, const char* target,
char newChar) { ... }

std::string URLdecode(const std::string& URLin)
std::string URLout(URLin);
replace(URLout, "+", ' ');
replace(URLout, "%20", ' ');
replace(URLout, "%2f", '/');
replace(URLout, "%2F", '/');
replace(URLout, "%3a", ':');
replace(URLout, "%3A", ':');
replace(URLout, "%3f", '?');
replace(URLout, "%3F", '?');

// %25 = '%' last to avoid problems like %252f
replace(URLout, "%25", '%');

return URLout;
}

The second version is probably only useful if there is a limited range
of values that will have to be translated from %XX to a character. I
used the second version in my own URL converter (C# so the Replace()
function came for free).

rossum

Thank you for your input. Well I did some research and came up with the
following. Now, however, I wonder if things still are portable with the
pointer arithmetic (in+=2)? And what happens with isxdigit if we go out
of bounds on the in-array?

Any more ideas/comments?

Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/
//------------- code begins ------------------------------
#include <cctype>
#include <cstdlib>
#include <iostream>
#include <string>
#include <sstream>
// hex2dec convert from base 16 to base 10, strtol could be used...
inline int hex2dec(const char& hex)
{
return ((hex>='0'&&hex<='9')?(hex-'0'):(std::toupper(hex)-'A'+10));
}
std::string URLdecode(const std::string& URLin)
{
std::string URLout;
const char* in(URLin.c_str());
for(;*in;in++)
{
if(*in!='%'||!std::isxdigit(in[1])||!std::isxdigit(in[2]))
{
if(*in=='+')
URLout+=' ';
else
URLout+=*in;
}
else // Convert all %xy hex codes into ASCII characters.
{
if( (in[1]=='2' && in[2]=='6')
|| (in[1]=='3' && (in[2]=='d'||in[2]=='D')))
{ // Do not alter URL delimeters.
URLout+='%';
URLout+=in[1];
URLout+=in[2];
}
else
URLout+=static_cast<char>(hex2dec(in[1])*16+hex2dec(in[2]));
in+=2;
}
}
return URLout;
}
int main()
{
for(std::string s;std::getline(std::cin,s);)
{
std::cout<<URLdecode(s)<<'\n';
}
return 0;
}
//--------------------- end of code ----------------------

Jun 17 '06 #4

Kirit Sælensminde

Peter Jansson wrote:

Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstream is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',

3. Code use iostreams, so it's locale specific

URI encoding and decoding is a really tricky issue, quite apart from
the issues in the code that you have (that others have already helped
with). I think maybe I can shed some light on your third question
though. It's all a little off topic for this forum, but important. I
hope everybody indulges me :-)

The URI is split into several parts, but for HTTP(S) the only parts
that will be encoded are the file specification and the query string
(if present). Note that they are encoded _differently_. They are
seperated by a single question mark (?).

If you are decoding somebody else's URI then stop right now. You don't
know enough about it to be able to work it out. By all means give it a
go, but don't expect that it will make any sense and _don't_ manipulate
it before using it. You will break something.

The file specfication (which is what it looks like you're decoding) can
be in any locale. These days people are tending to drift towards UTF-8,
but it's by no means universal. If it's a URI you've encoded yourself
then you should know which locale you used.

For the query string, the format is described in the HTML specs, _but_
only browsers have to use that format when they create a query string
from a form submission using GET. Any other URI creation can be
different and in fact the W3C recommends a slightly different format
that doesn't cause common entity problems in HTML. The biggest
difference between the file specification and the query string encoding
is the space. In the file spec they are '%20' and in a query string
they are '+' - many people get this wrong (most clearly Technorati who
have wrecked the C++ blogging community through this error). If you're
trying to decode somebody else's query string you can't rely on the
format at all.

Now, the final thing you need to think about is security. You need to
check and double check the URI for all sorts of things. What checks you
need depends on where the URI came from and what you're doing with it.

I have a number of aticles on my web site about these issues:

Problems with IIS due to HTTP.SYS decoding the file specification:
http://www.kirit.com/Getting%20the%2...ISAPI%20filter

Another problem caused in the custom 404 handling:
http://www.kirit.com/Errors%20in%20I...ror%20handling

Another encoding fault leading to problems with redirections:
http://www.kirit.com/Response.Redire...encoded%20URIs

And even the W3C doesn't get it right:
http://www.kirit.com/W3C%27s%20CSS%2...tion%20service

These articles should give you a feel for the sorts of things that can
go wrong. They link in many places to the relevant RFCs and other specs
and notes. There are some other articles on Unicode that also touch on
other aspecs of these issues. There are also a load of tricky URIs that
you should be able to deal with properly.
So, again, I know this is strictly off-topic for this group. I hope
nobody minds too much though as the issues are pretty important.
K

Jun 18 '06 #5

Michiel.Salters

Peter Jansson wrote:

I did some research and came up with the
following. Now, however, I wonder if things still are portable with the
pointer arithmetic (in+=2)? And what happens with isxdigit if we go out
of bounds on the in-array?
Tricky code, but it accidentily works ;)

The reason it works is that you use c_str(). That tacks a \0 on the end
of the char[] returned, even though it's not present in the string
itself.
Now, \0 is not a hexdigit, which implies you won't check after that \0.
Also, you only increment in if you know you saw three non-\0
charachters
which means you haven't hit the end yet. Therefore, it is portable.
Some
comments would be nice, though ;)

BTW: call URLout.reserve(URLin.size()). The input size is a very good
estimate
of the output size. And you seem to have some old headers lying around.

HTH,
Michiel Salters
// hex2dec convert from base 16 to base 10, strtol could be used...
inline int hex2dec(const char& hex)
{
return ((hex>='0'&&hex<='9')?(hex-'0'):(std::toupper(hex)-'A'+10));
}
std::string URLdecode(const std::string& URLin)
{
std::string URLout;
const char* in(URLin.c_str());
for(;*in;in++)
{
if(*in!='%'||!std::isxdigit(in[1])||!std::isxdigit(in[2]))
{
if(*in=='+')
URLout+=' ';
else
URLout+=*in;
}
else // Convert all %xy hex codes into ASCII characters.
{
if( (in[1]=='2' && in[2]=='6')
|| (in[1]=='3' && (in[2]=='d'||in[2]=='D')))
{ // Do not alter URL delimeters.
URLout+='%';
URLout+=in[1];
URLout+=in[2];
}
else
URLout+=static_cast<char>(hex2dec(in[1])*16+hex2dec(in[2]));
in+=2;
}
}
return URLout;
}

Jun 19 '06 #6

by: Daniel Tonks | last post by:

First, please excuse the fact that I'm a complete MySQL newbie. My site used forum software that I wrote myself (in Perl) which, up until now, has used flat files. This worked fine, however...

MySQL Database

efficient assignment ???

by: Programmer | last post by:

Is this: int x; for (int count=0; count<200; count++) { x = someFunc(); anotherFunc(x); }

C / C++

About Efficient c++

by: sandeep | last post by:

Hi i am new to this group and to c++ also though i have the knowledge of "c" and now want to learn c++ and data structure using c/c++ . so could nebody please suggest me some...

C / C++

MySQL: Most efficient method retrieving single result.

by: Belmin | last post by:

Hi all, Wanted to know what is the most efficient way of doing a select query for mysql that only returns one value. For example: $mysqli->query('select count(*) from log'); $temprec =...

MySQL Database

Efficient Data Structures

by: Curious | last post by:

Hi, I am searching for a data structure that stores key-value pairs in it. This data structure is to hold large amounts of key-value pairs, and so needs to be efficient both in insertion and...

C# / C Sharp

Efficient Uniqueness Check

by: Alan Little | last post by:

I have affiliates submitting batches of anywhere from 10 to several hundred orders. Each order in the batch must include an order ID, originated by the affiliate, which must be unique across all...

PHP

Most efficient way to copy large volume of files

by: =?Utf-8?B?UVNJRGV2ZWxvcGVy?= | last post by:

Using .NET 2.0 is it more efficient to copy files to a single folder versus spreading them across multiple folders. For instance if we have 100,000 files to be copied, Do we copy all of them to...

.NET Framework

Efficient strcat implementation.

by: Mahesh | last post by:

Hi, I am looking for efficient string cancatination c code. I did it using ptr but my code i think is not efficient. Please help. Thanks a lot

C / C++

How can I make this more efficient? (combining DataSet results with the results of a DB lookup.)

by: Ken Fine | last post by:

This is a question that someone familiar with ASP.NET and ADO.NET DataSets and DataTables should be able to answer fairly easily. The basic question is how I can efficiently match data from one...

ASP.NET

How to make this program more efficient?

by: Bill David | last post by:

SUBJECT: How to make this program more efficient? In my program, a thread will check update from server periodically and generate a stl::map for other part of this program to read data from....

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Efficient URL-decoding.

Similar topics