473,839 Members | 1,398 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Efficient URL-decoding.

Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstr eam is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',

3. Code use iostreams, so it's locale specific
//------------- code begins ------------------------------
#include <iostream>
#include <string>
#include <sstream>
std::string URLdecode(const std::string& l)
{
std::ostringstr eam L;
for(std::string ::size_type x=0;x<l.size(); ++x)
switch(l[x])
{
case('+'):
{
L<<' ';
break;
}
case('%'): // Convert all %xy hex codes into ASCII characters.
{
const std::string hexstr(l.substr (x+1,2)); // xy part of %xy.
x+=2; // Skip over hex.
if(hexstr=="26" || hexstr=="3D")
// Do not alter URL delimeters.
L<<'%'<<hexstr ;
else
{
std::istringstr eam hexstream(hexst r);
int hexint;
hexstream>>std: :hex>>hexint;
L<<static_cast< char>(hexint);
}
break;
}
default: // Copy anything else.
{
L<<l[x];
break;
}
}
return L.str();
}
int main()
{
for(std::string s;std::getline( std::cin,s);)
{
std::cout<<URLd ecode(s)<<'\n';
}
return 0;
}
//--------------------- end of code ----------------------

Do any of you have any suggestion on how this code may be made more
efficient and robust with regards to the three issues above?
Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/
Jun 17 '06 #1
5 6370
Peter Jansson wrote:
Do any of you have any suggestion on how this code may be made more
efficient and robust with regards to the three issues above?


Where are its unit tests? And note that refactors which clean the code up
will often squeeze out flab, too...

--
Phlip
http://c2.com/cgi/wiki?ZeekLand <-- NOT a blog!!!
Jun 17 '06 #2
On Sat, 17 Jun 2006 11:53:09 GMT, Peter Jansson
<we*******@jans son.net> wrote:
Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstr eam is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialize d value variable 'hexint' used - this is undefined behavior. You need to check that there is enough space left after the % sign and
that l[x + 1] and l[x + 2] are both hex digits. Use isxdigit(char) to
test them.

'l' is a *horrible* name for a variable, use URLin or something. 'L'
is almost as bad, use URLout. Only use single letter names for loops
or for some mathematical formulae where the single letters are
understood: E = Mc^2.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers', Converting two hex digits to an integer is easy enough to do yourself.
You don't need all the overhead of a stringstream. You need a way to
convert each hex digit character to a value in the range 0 ... 15 and
then multiply the first by 16 and add the second.

Since all you are doing with the outer stringstream is adding
characters to the end of a string, you can replace it with
string.append() , string.push_bac k() or +=.

3. Code use iostreams, so it's locale specific Remove the stringstreams.
As a complete alternative you could use a single replace() method with
three parameters: the string to look in, the string to find and the
character to replace it with. Your code would then look something
like:

void replace(std::st ring& baseStr, const char* target,
char newChar) { ... }

std::string URLdecode(const std::string& URLin)
std::string URLout(URLin);
replace(URLout, "+", ' ');
replace(URLout, "%20", ' ');
replace(URLout, "%2f", '/');
replace(URLout, "%2F", '/');
replace(URLout, "%3a", ':');
replace(URLout, "%3A", ':');
replace(URLout, "%3f", '?');
replace(URLout, "%3F", '?');

// %25 = '%' last to avoid problems like %252f
replace(URLout, "%25", '%');

return URLout;
}

The second version is probably only useful if there is a limited range
of values that will have to be translated from %XX to a character. I
used the second version in my own URL converter (C# so the Replace()
function came for free).

rossum


//------------- code begins ------------------------------
#include <iostream>
#include <string>
#include <sstream>
std::string URLdecode(const std::string& l)
{
std::ostringstr eam L;
for(std::string ::size_type x=0;x<l.size(); ++x)
switch(l[x])
{
case('+'):
{
L<<' ';
break;
}
case('%'): // Convert all %xy hex codes into ASCII characters.
{
const std::string hexstr(l.substr (x+1,2)); // xy part of %xy.
x+=2; // Skip over hex.
if(hexstr=="26" || hexstr=="3D")
// Do not alter URL delimeters.
L<<'%'<<hexstr ;
else
{
std::istringstr eam hexstream(hexst r);
int hexint;
hexstream>>std: :hex>>hexint;
L<<static_cast< char>(hexint);
}
break;
}
default: // Copy anything else.
{
L<<l[x];
break;
}
}
return L.str();
}
int main()
{
for(std::string s;std::getline( std::cin,s);)
{
std::cout<<URLd ecode(s)<<'\n';
}
return 0;
}
//--------------------- end of code ----------------------

Do any of you have any suggestion on how this code may be made more
efficient and robust with regards to the three issues above?
Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/


Jun 17 '06 #3
rossum wrote:
On Sat, 17 Jun 2006 11:53:09 GMT, Peter Jansson
<we*******@jans son.net> wrote:

Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstr eam is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitializ ed value variable 'hexint' used - this is undefined behavior.


You need to check that there is enough space left after the % sign and
that l[x + 1] and l[x + 2] are both hex digits. Use isxdigit(char) to
test them.

'l' is a *horrible* name for a variable, use URLin or something. 'L'
is almost as bad, use URLout. Only use single letter names for loops
or for some mathematical formulae where the single letters are
understood: E = Mc^2.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',


Converting two hex digits to an integer is easy enough to do yourself.
You don't need all the overhead of a stringstream. You need a way to
convert each hex digit character to a value in the range 0 ... 15 and
then multiply the first by 16 and add the second.

Since all you are doing with the outer stringstream is adding
characters to the end of a string, you can replace it with
string.append() , string.push_bac k() or +=.

3. Code use iostreams, so it's locale specific


Remove the stringstreams.
As a complete alternative you could use a single replace() method with
three parameters: the string to look in, the string to find and the
character to replace it with. Your code would then look something
like:

void replace(std::st ring& baseStr, const char* target,
char newChar) { ... }

std::string URLdecode(const std::string& URLin)
std::string URLout(URLin);
replace(URLout, "+", ' ');
replace(URLout, "%20", ' ');
replace(URLout, "%2f", '/');
replace(URLout, "%2F", '/');
replace(URLout, "%3a", ':');
replace(URLout, "%3A", ':');
replace(URLout, "%3f", '?');
replace(URLout, "%3F", '?');

// %25 = '%' last to avoid problems like %252f
replace(URLout, "%25", '%');

return URLout;
}

The second version is probably only useful if there is a limited range
of values that will have to be translated from %XX to a character. I
used the second version in my own URL converter (C# so the Replace()
function came for free).

rossum


Thank you for your input. Well I did some research and came up with the
following. Now, however, I wonder if things still are portable with the
pointer arithmetic (in+=2)? And what happens with isxdigit if we go out
of bounds on the in-array?

Any more ideas/comments?

Sincerely,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/
//------------- code begins ------------------------------
#include <cctype>
#include <cstdlib>
#include <iostream>
#include <string>
#include <sstream>
// hex2dec convert from base 16 to base 10, strtol could be used...
inline int hex2dec(const char& hex)
{
return ((hex>='0'&&hex <='9')?(hex-'0'):(std::toup per(hex)-'A'+10));
}
std::string URLdecode(const std::string& URLin)
{
std::string URLout;
const char* in(URLin.c_str( ));
for(;*in;in++)
{
if(*in!='%'||!s td::isxdigit(in[1])||!std::isxdig it(in[2]))
{
if(*in=='+')
URLout+=' ';
else
URLout+=*in;
}
else // Convert all %xy hex codes into ASCII characters.
{
if( (in[1]=='2' && in[2]=='6')
|| (in[1]=='3' && (in[2]=='d'||in[2]=='D')))
{ // Do not alter URL delimeters.
URLout+='%';
URLout+=in[1];
URLout+=in[2];
}
else
URLout+=static_ cast<char>(hex2 dec(in[1])*16+hex2dec(in[2]));
in+=2;
}
}
return URLout;
}
int main()
{
for(std::string s;std::getline( std::cin,s);)
{
std::cout<<URLd ecode(s)<<'\n';
}
return 0;
}
//--------------------- end of code ----------------------
Jun 17 '06 #4
Peter Jansson wrote:
Hello group,

The following code is an attempt to perform URL-decoding of URL-encoded
string. Note that std::istringstr eam is used within the switch, within
the loop. Three main issues have been raised about the code;

1. If characters after '%' do not represent hexademical number, then
uninitialized value variable 'hexint' used - this is undefined behavior.

2. This code is very inefficient - to many mallocs/string
copyings/text-streams processing for such simple operation as 'convert
to hex chars to integers',

3. Code use iostreams, so it's locale specific


URI encoding and decoding is a really tricky issue, quite apart from
the issues in the code that you have (that others have already helped
with). I think maybe I can shed some light on your third question
though. It's all a little off topic for this forum, but important. I
hope everybody indulges me :-)

The URI is split into several parts, but for HTTP(S) the only parts
that will be encoded are the file specification and the query string
(if present). Note that they are encoded _differently_. They are
seperated by a single question mark (?).

If you are decoding somebody else's URI then stop right now. You don't
know enough about it to be able to work it out. By all means give it a
go, but don't expect that it will make any sense and _don't_ manipulate
it before using it. You will break something.

The file specfication (which is what it looks like you're decoding) can
be in any locale. These days people are tending to drift towards UTF-8,
but it's by no means universal. If it's a URI you've encoded yourself
then you should know which locale you used.

For the query string, the format is described in the HTML specs, _but_
only browsers have to use that format when they create a query string
from a form submission using GET. Any other URI creation can be
different and in fact the W3C recommends a slightly different format
that doesn't cause common entity problems in HTML. The biggest
difference between the file specification and the query string encoding
is the space. In the file spec they are '%20' and in a query string
they are '+' - many people get this wrong (most clearly Technorati who
have wrecked the C++ blogging community through this error). If you're
trying to decode somebody else's query string you can't rely on the
format at all.

Now, the final thing you need to think about is security. You need to
check and double check the URI for all sorts of things. What checks you
need depends on where the URI came from and what you're doing with it.

I have a number of aticles on my web site about these issues:

Problems with IIS due to HTTP.SYS decoding the file specification:
http://www.kirit.com/Getting%20the%2...ISAPI%20filter

Another problem caused in the custom 404 handling:
http://www.kirit.com/Errors%20in%20I...ror%20handling

Another encoding fault leading to problems with redirections:
http://www.kirit.com/Response.Redire...encoded%20URIs

And even the W3C doesn't get it right:
http://www.kirit.com/W3C%27s%20CSS%2...tion%20service

These articles should give you a feel for the sorts of things that can
go wrong. They link in many places to the relevant RFCs and other specs
and notes. There are some other articles on Unicode that also touch on
other aspecs of these issues. There are also a load of tricky URIs that
you should be able to deal with properly.
So, again, I know this is strictly off-topic for this group. I hope
nobody minds too much though as the issues are pretty important.
K

Jun 18 '06 #5
Peter Jansson wrote:
I did some research and came up with the
following. Now, however, I wonder if things still are portable with the
pointer arithmetic (in+=2)? And what happens with isxdigit if we go out
of bounds on the in-array?
Tricky code, but it accidentily works ;)

The reason it works is that you use c_str(). That tacks a \0 on the end
of the char[] returned, even though it's not present in the string
itself.
Now, \0 is not a hexdigit, which implies you won't check after that \0.
Also, you only increment in if you know you saw three non-\0
charachters
which means you haven't hit the end yet. Therefore, it is portable.
Some
comments would be nice, though ;)

BTW: call URLout.reserve( URLin.size()). The input size is a very good
estimate
of the output size. And you seem to have some old headers lying around.

HTH,
Michiel Salters
// hex2dec convert from base 16 to base 10, strtol could be used...
inline int hex2dec(const char& hex)
{
return ((hex>='0'&&hex <='9')?(hex-'0'):(std::toup per(hex)-'A'+10));
}
std::string URLdecode(const std::string& URLin)
{
std::string URLout;
const char* in(URLin.c_str( ));
for(;*in;in++)
{
if(*in!='%'||!s td::isxdigit(in[1])||!std::isxdig it(in[2]))
{
if(*in=='+')
URLout+=' ';
else
URLout+=*in;
}
else // Convert all %xy hex codes into ASCII characters.
{
if( (in[1]=='2' && in[2]=='6')
|| (in[1]=='3' && (in[2]=='d'||in[2]=='D')))
{ // Do not alter URL delimeters.
URLout+='%';
URLout+=in[1];
URLout+=in[2];
}
else
URLout+=static_ cast<char>(hex2 dec(in[1])*16+hex2dec(in[2]));
in+=2;
}
}
return URLout;
}


Jun 19 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
3018
by: Daniel Tonks | last post by:
First, please excuse the fact that I'm a complete MySQL newbie. My site used forum software that I wrote myself (in Perl) which, up until now, has used flat files. This worked fine, however lately I've been wanting to do more stuff with user accounts, and had been eying MySQL for over a year. Finally I've decided to start off small by converting the forum's account system to a MySQL database (and convert the rest later after I'm...
12
1465
by: Programmer | last post by:
Is this: int x; for (int count=0; count<200; count++) { x = someFunc(); anotherFunc(x); }
3
1568
by: sandeep | last post by:
Hi i am new to this group and to c++ also though i have the knowledge of "c" and now want to learn c++ and data structure using c/c++ . so could nebody please suggest me some tips(books,links,&experiences) so that i can be an EFFICIENT programmer of c++. Also i want to ask that how can we develope efficient codes and what are various techniques for writing code sin efficient manner. Please help me.
2
3897
by: Belmin | last post by:
Hi all, Wanted to know what is the most efficient way of doing a select query for mysql that only returns one value. For example: $mysqli->query('select count(*) from log'); $temprec = $result->fetch_assoc(); $count = $temprec; That doesn't seem efficient. How should I do it? Or is this as efficient
22
7688
by: Curious | last post by:
Hi, I am searching for a data structure that stores key-value pairs in it. This data structure is to hold large amounts of key-value pairs, and so needs to be efficient both in insertion and deletion. Does anybody know of ready data structures that can be freely used under the .Net framwork. Thanks in Advance
5
2593
by: Alan Little | last post by:
I have affiliates submitting batches of anywhere from 10 to several hundred orders. Each order in the batch must include an order ID, originated by the affiliate, which must be unique across all orders in all batches ever submitted by that affiliate. I'm trying to figure out the most efficient way to check the uniqueness of the order ID. Order data is being submitted to Zen Cart, and also stored in custom tables. I have created a unique...
1
3901
by: =?Utf-8?B?UVNJRGV2ZWxvcGVy?= | last post by:
Using .NET 2.0 is it more efficient to copy files to a single folder versus spreading them across multiple folders. For instance if we have 100,000 files to be copied, Do we copy all of them to a single folder called 'All Files' Do we spread them out and copy them to multiple folders like Folder 000 - Copy files from 0 to 1000 Folder 001 - Copy files from 1000 to 2000 Folder 002 - Copy files from 2000 to 2999
28
3908
by: Mahesh | last post by:
Hi, I am looking for efficient string cancatination c code. I did it using ptr but my code i think is not efficient. Please help. Thanks a lot
3
2855
by: Ken Fine | last post by:
This is a question that someone familiar with ASP.NET and ADO.NET DataSets and DataTables should be able to answer fairly easily. The basic question is how I can efficiently match data from one dataset to data in a second dataset, using a common key. I will first describe the problem in words and then I will show my code, which has most of the solution done already. I have built an ASP.NET that queries an Index Server and returns a...
82
3749
by: Bill David | last post by:
SUBJECT: How to make this program more efficient? In my program, a thread will check update from server periodically and generate a stl::map for other part of this program to read data from. Let's name the update method as doUpdate and stl::map read methods as getData and copyData. Since stl::map is not thread-safe, we should do synchronization by ourselves. A usable solution is to create a boost::mutex::scoped_lock object in all above...
0
9855
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10908
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10295
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7018
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5682
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5867
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4487
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
4064
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3136
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.