By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,918 Members | 1,852 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,918 IT Pros & Developers. It's quick & easy.

std::string and case insensitive comparison

P: n/a
Hi,

what is the most efficient way of doing a case insensitive comparison ?
I am trying to write a universal String class and I am stuck with the
case insensitive part :

TCHAR is a char in MultiByte String env (MBCS)
and wchar_t if UNICODE

#if defined(WIN32) || defined(UNDER_CE)
typedef std::basic_string<TCHAR, std::char_traits<TCHAR>,
std::allocator<TCHAR tstring;
#else

#endif
#endif
class String : public Object
{
private:
tstring m_str;

public:
String(){}

String(LPCTSTR lpsz)
{
m_str = lpsz;
}

String(tstring str)
{
m_str = str;
}
// Comparison
int Compare( LPCTSTR psz ) const
{
return m_str.compare(psz);
}

int CompareNoCase( LPCTSTR psz ) const
{
???
}

// Convert the string to lowercase
String& MakeLower( LPCTSTR psz )
{
std::transform(m_str.begin(), m_str.end(), m_str.begin(), tolower);
return *this;
}

}
Jul 19 '07 #1
Share this Question
Share on Google+
14 Replies


P: n/a
On 2007-07-20 00:07, Mosfet wrote:
Hi,

what is the most efficient way of doing a case insensitive comparison ?
I am trying to write a universal String class and I am stuck with the
case insensitive part :
Usually it involves modifying both strings to either upper- or lower-
case and then compare. Checkout the toupper()/tolower() functions from
<cctype>/ctype.h>.

--
Erik Wikström
Jul 19 '07 #2

P: n/a
"Mosfet" <an*******@free.frwrote in message
news:46***********************@news.free.fr...
Hi,

what is the most efficient way of doing a case insensitive comparison ?
I am trying to write a universal String class and I am stuck with the case
insensitive part :

TCHAR is a char in MultiByte String env (MBCS)
and wchar_t if UNICODE

#if defined(WIN32) || defined(UNDER_CE)
typedef std::basic_string<TCHAR, std::char_traits<TCHAR>,
std::allocator<TCHAR tstring;
#else

#endif
#endif
class String : public Object
{
private:
tstring m_str;

public:
String(){}

String(LPCTSTR lpsz)
{
m_str = lpsz;
}

String(tstring str)
{
m_str = str;
}
// Comparison
int Compare( LPCTSTR psz ) const
{
return m_str.compare(psz);
}

int CompareNoCase( LPCTSTR psz ) const
{
???
}

// Convert the string to lowercase
String& MakeLower( LPCTSTR psz )
{
std::transform(m_str.begin(), m_str.end(), m_str.begin(), tolower);
return *this;
}

}
This is what I use. I'm not sure if it's optimal, but it works.

bool StrLowCompare( std::string String1, std::string String2 )
{
std::transform(String1.begin(), String1.end(), String1.begin(),
tolower);
std::transform(String2.begin(), String2.end(), String2.begin(),
tolower);
return String1 == String2;
}
Jul 19 '07 #3

P: n/a
Jim Langston a écrit :
"Mosfet" <an*******@free.frwrote in message
news:46***********************@news.free.fr...
>Hi,

what is the most efficient way of doing a case insensitive comparison ?
I am trying to write a universal String class and I am stuck with the case
insensitive part :

TCHAR is a char in MultiByte String env (MBCS)
and wchar_t if UNICODE

#if defined(WIN32) || defined(UNDER_CE)
typedef std::basic_string<TCHAR, std::char_traits<TCHAR>,
std::allocator<TCHAR tstring;
#else

#endif
#endif
class String : public Object
{
private:
tstring m_str;

public:
String(){}

String(LPCTSTR lpsz)
{
m_str = lpsz;
}

String(tstring str)
{
m_str = str;
}
// Comparison
int Compare( LPCTSTR psz ) const
{
return m_str.compare(psz);
}

int CompareNoCase( LPCTSTR psz ) const
{
???
}

// Convert the string to lowercase
String& MakeLower( LPCTSTR psz )
{
std::transform(m_str.begin(), m_str.end(), m_str.begin(), tolower);
return *this;
}

}

This is what I use. I'm not sure if it's optimal, but it works.

bool StrLowCompare( std::string String1, std::string String2 )
{
std::transform(String1.begin(), String1.end(), String1.begin(),
tolower);
std::transform(String2.begin(), String2.end(), String2.begin(),
tolower);
return String1 == String2;
}

Ir doesn't seem very efficient...


Jul 19 '07 #4

P: n/a
"Mosfet" <an*******@free.frwrote in message
news:46***********************@news.free.fr...
Jim Langston a écrit :
>"Mosfet" <an*******@free.frwrote in message
news:46***********************@news.free.fr...
>>Hi,

what is the most efficient way of doing a case insensitive comparison ?
I am trying to write a universal String class and I am stuck with the
case insensitive part :

TCHAR is a char in MultiByte String env (MBCS)
and wchar_t if UNICODE

#if defined(WIN32) || defined(UNDER_CE)
typedef std::basic_string<TCHAR, std::char_traits<TCHAR>,
std::allocator<TCHAR tstring;
#else

#endif
#endif
class String : public Object
{
private:
tstring m_str;

public:
String(){}

String(LPCTSTR lpsz)
{
m_str = lpsz;
}

String(tstring str)
{
m_str = str;
}
// Comparison
int Compare( LPCTSTR psz ) const
{
return m_str.compare(psz);
}

int CompareNoCase( LPCTSTR psz ) const
{
???
}

// Convert the string to lowercase
String& MakeLower( LPCTSTR psz )
{
std::transform(m_str.begin(), m_str.end(), m_str.begin(), tolower);
return *this;
}

}

This is what I use. I'm not sure if it's optimal, but it works.

bool StrLowCompare( std::string String1, std::string String2 )
{
std::transform(String1.begin(), String1.end(), String1.begin(),
tolower);
std::transform(String2.begin(), String2.end(), String2.begin(),
tolower);
return String1 == String2;
}

Ir doesn't seem very efficient...
No, it doesn't. But then, there is no way to do a case insensitive compare
without converting both to either upper or lower. Or to determine if one is
uppercase before converting to lower, but it probably takes about the same
amount of time for the if statement.

Basically, that's the way case insensitve works. You convert both to upper
or lower, then compare, or compare character by character converting.

It may be faster to compare character by character and see if you can return
early without having to go through the whole string, but the you're doing a
bunch of if statments anyway. I.E something like: (untested code)

bool StrLowCompare( std::string& String1, std::string& String2 )
{
if ( String1.size() != String2.size() )
return false;

for ( std::string::size_type i = 0; i < String1.size(); ++i )
{
if ( tolower( String1[i] ) != tolower( String2[i] )
return false;
}
return true;
}
Jul 19 '07 #5

P: n/a
>
bool StrLowCompare( std::string& String1, std::string& String2 )
{
if ( String1.size() != String2.size() )
return false;

for ( std::string::size_type i = 0; i < String1.size(); ++i )
{
if ( tolower( String1[i] ) != tolower( String2[i] )
return false;
}
return true;
}
If I had a pound for everytime this mistake is made I would be as rich
as Bill Gates.
tolower( String1[i] )

is undefined since char may be signed and therefore you may pass a
negative number to tolower. tolower is only defined on integer values in
the range of unsigned char and the value of EOF.

tolower( (unsigned char) String1[i] )

is correct.

This also means that

std::transform(str.begin(), str.end(), tolower)

is undefined for the same reason.

john
Jul 20 '07 #6

P: n/a
John Harrison wrote:
>>
bool StrLowCompare( std::string& String1, std::string& String2 )
{
if ( String1.size() != String2.size() )
return false;

for ( std::string::size_type i = 0; i < String1.size(); ++i )
{
if ( tolower( String1[i] ) != tolower( String2[i] )
return false;
}
return true;
}

If I had a pound for everytime this mistake is made I would be as rich
as Bill Gates.
tolower( String1[i] )

is undefined since char may be signed and therefore you may pass a
negative number to tolower. tolower is only defined on integer values in
the range of unsigned char and the value of EOF.

tolower( (unsigned char) String1[i] )

is correct.

This also means that

std::transform(str.begin(), str.end(), tolower)

is undefined for the same reason.
That wording is a little too harsh. The above code has perfectly
well-defined behavior for quite a lot of input values. To dismiss it as
undefined is like saying *p is undefined since p might be null. I agree,
however, that one can and should do better.

For the use in std::transform(), I would suggest a function object like
this:

#include <locale>
#include <string>
#include <iostream>
#include <algorithm>

class to_lower {

std::locale const & loc;

public:

to_lower ( std::locale const & r_loc = std::locale() )
: loc ( r_loc )
{}

template < typename CharT >
CharT operator() ( CharT chr ) const {
return( std::tolower( chr, this->loc ) );
}

}; // class to_lower;

int main ( void ) {
std::string str ( "Hello World!" );
std::transform ( str.begin(), str.end(), str.begin(), to_lower() );
std::cout << str << '\n';
}
Best

Kai-Uwe Bux
Jul 20 '07 #7

P: n/a
On Jul 20, 12:56 am, "Jim Langston" <tazmas...@rocketmail.comwrote:
"Mosfet" <anonym...@free.frwrote in message
news:46***********************@news.free.fr...
[...]
This is what I use. I'm not sure if it's optimal, but it works.
bool StrLowCompare( std::string String1, std::string String2 )
{
std::transform(String1.begin(), String1.end(), String1.begin(),
tolower);
std::transform(String2.begin(), String2.end(), String2.begin(),
tolower);
return String1 == String2;
}
Using which headers? It shouldn't compile if you happen to
include <locale>, and <stringmay (or may not) include
<locale>. (With g++, <stringdoesn't include <locale>, but
<iostreamdoes, so if you happen to also include <iostream>, it
doesn't compile.) If it does compile, it has undefined
behavior if char is signed. (Of course, g++ and VC++ have
options to force char to be unsigned, so if you're using these,
you may be OK.)

If you want to use transform, the correct invocation is:

std::transform(
source.begin(), source.end(),
std::back_inserter( dest ),
boost::bind(
&Cvt::tolower,
&std::use_facet< Cvt >( std::locale() ),
_1 ) ) ;

You need boost::bind, or else write you're own functional
object.

Note that using std::equal with boost::transform_iterator (and
the functional object used with transform---either your own, or
the results of boost::bind) will allow the comparison without
making a copy.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jul 20 '07 #8

P: n/a
On Jul 20, 1:27 am, "Jim Langston" <tazmas...@rocketmail.comwrote:
"Mosfet" <anonym...@free.frwrote in message
[...]
This is what I use. I'm not sure if it's optimal, but it works.
bool StrLowCompare( std::string String1, std::string String2 )
{
std::transform(String1.begin(), String1.end(), String1.begin(),
tolower);
std::transform(String2.begin(), String2.end(), String2.begin(),
tolower);
return String1 == String2;
}
Ir doesn't seem very efficient...
No, it doesn't. But then, there is no way to do a case
insensitive compare without converting both to either upper or
lower. Or to determine if one is uppercase before converting
to lower, but it probably takes about the same amount of time
for the if statement.
If you do it on a character by character basis, you can avoid
the copy of the string. If the string is long, this could be
significant.

On the other hand, if you're doing a lot of comparisons, it's
probably better to convert the strings once, and then just use
== on them.
Basically, that's the way case insensitve works. You convert
both to upper or lower, then compare, or compare character by
character converting.
Actually, it's more complicated than that. In German, for
example, in a case insensitive comparison, the single character
'ß' must compare equal to the two character sequence "SS"; in
Swiss German (and according to DIN, I think), 'ä' compares equal
to "AE", etc., where as in Turkish, 'i' shouldn't compare equal
to 'I'. Just defining what you actually mean by "case
insensitive compare" is a nightmare, even before you start
trying to implement it.
It may be faster to compare character by character and see if
you can return early without having to go through the whole
string, but the you're doing a bunch of if statments anyway.
There's never the slightest need for an if; tolower is generally
implemented as a simple table mapping. By comparing character
by character, however, you save copying the string.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jul 20 '07 #9

P: n/a
On Jul 19, 5:07 pm, Mosfet <anonym...@free.frwrote:
Hi,

what is the most efficient way of doing a case insensitive comparison ?
I am trying to write a universal String class and I am stuck with the
case insensitive part :

TCHAR is a char in MultiByte String env (MBCS)
and wchar_t if UNICODE

#if defined(WIN32) || defined(UNDER_CE)
typedef std::basic_string<TCHAR, std::char_traits<TCHAR>,
std::allocator<TCHAR tstring;
#else

#endif
#endif

class String : public Object
{
....
}
I have no idea why you have an Object class unless you are trying to
mimic Java. In "Exceptional C++" by Herb Sutter, there is an
excellent solution: provide your own char_traits<TCHAR(pp. 5-6).
You will have to overload operators <<, >>, =, +, +=, -, -=, etc. (or
always call c_str()).

template<typename CHAR>
class ci_char_traits: public char_traits<CHAR>
{
public: static bool eq( CHAR c1, CHAR c2 )
{
return( toupper( static_cast<unsigned char>( c1 ))
== toupper( static_cast<unsigned char>( c2 )));
}
public: static bool lt( CHAR c1, CHAR c2 )
{
return( toupper( static_cast<unsigned char>( c1 ))
< toupper( static_cast<unsigned char>( c2 )));
}
public: static int compare( CHAR *s1, CHAR *s2, size_t n )
{
return( memicmp( s1, s2, n ));
}
public: static const CHAR *find( const CHAR *s, int n, CHAR a )
{
while((n-- 0) && toupper( static_cast<unsigned char>( *s ))
!= toupper( static_cast<unsigned char>( a )))
{
++s;
}
return((n 0)?(s):(NULL));
}
};

static_cast<>'s would've been forgotten if it hadn't been for John
Harrison. Keep in mind James Kanze's comments about equality. For
more information, search the Internet for Java collator classes.

Milburn Young

Jul 20 '07 #10

P: n/a
"John Harrison" <jo*************@hotmail.comwrote in message
news:kC*************@newsfe7-win.ntli.net...
>>
bool StrLowCompare( std::string& String1, std::string& String2 )
{
if ( String1.size() != String2.size() )
return false;

for ( std::string::size_type i = 0; i < String1.size(); ++i )
{
if ( tolower( String1[i] ) != tolower( String2[i] )
return false;
}
return true;
}

If I had a pound for everytime this mistake is made I would be as rich as
Bill Gates.
tolower( String1[i] )

is undefined since char may be signed and therefore you may pass a
negative number to tolower. tolower is only defined on integer values in
the range of unsigned char and the value of EOF.

tolower( (unsigned char) String1[i] )

is correct.

This also means that

std::transform(str.begin(), str.end(), tolower)

is undefined for the same reason.
I'm using Microsoft VC++ .net 2003. I'm using the tolower defined in
ctype.h. I just did a test, put in values from 0 to 255 (-1), displayed
their values, ran them through tolower and displayed their values. Yes,
some were negative going in, but they were also negative coming out. I
would be interested in seeing any results other than this in any compiler.
Jul 21 '07 #11

P: n/a
Jim Langston wrote:
:: "John Harrison" <jo*************@hotmail.comwrote in message
:: news:kC*************@newsfe7-win.ntli.net...
::::
:::: bool StrLowCompare( std::string& String1, std::string& String2 )
:::: {
:::: if ( String1.size() != String2.size() )
:::: return false;
::::
:::: for ( std::string::size_type i = 0; i < String1.size(); ++i )
:::: {
:::: if ( tolower( String1[i] ) != tolower( String2[i] )
:::: return false;
:::: }
:::: return true;
:::: }
:::
::: If I had a pound for everytime this mistake is made I would be as
::: rich as Bill Gates.
:::
:::
::: tolower( String1[i] )
:::
::: is undefined since char may be signed and therefore you may pass a
::: negative number to tolower. tolower is only defined on integer
::: values in the range of unsigned char and the value of EOF.
:::
::: tolower( (unsigned char) String1[i] )
:::
::: is correct.
:::
::: This also means that
:::
::: std::transform(str.begin(), str.end(), tolower)
:::
::: is undefined for the same reason.
::
:: I'm using Microsoft VC++ .net 2003. I'm using the tolower defined
:: in ctype.h. I just did a test, put in values from 0 to 255 (-1),
:: displayed their values, ran them through tolower and displayed
:: their values. Yes, some were negative going in, but they were
:: also negative coming out. I would be interested in seeing any
:: results other than this in any compiler.

You can't test for undefined behavior, because there is no requirement
that it should be consistent. In fact, there are no requirements at
all. :-)

It so happens that the MS implementation handles this by casting to
unsigned internally, and checking the range of the parameter. Possibly
because there is a compiler option to make the char type unsigned, in
which case your test must work.

It's non-portable anyway.
Bo Persson
Jul 21 '07 #12

P: n/a
"Bo Persson" <bo*@gmb.dkwrote in message
news:5g*************@mid.individual.net...
Jim Langston wrote:
:: "John Harrison" <jo*************@hotmail.comwrote in message
:: news:kC*************@newsfe7-win.ntli.net...
::::
:::: bool StrLowCompare( std::string& String1, std::string& String2 )
:::: {
:::: if ( String1.size() != String2.size() )
:::: return false;
::::
:::: for ( std::string::size_type i = 0; i < String1.size(); ++i )
:::: {
:::: if ( tolower( String1[i] ) != tolower( String2[i] )
:::: return false;
:::: }
:::: return true;
:::: }
:::
::: If I had a pound for everytime this mistake is made I would be as
::: rich as Bill Gates.
:::
:::
::: tolower( String1[i] )
:::
::: is undefined since char may be signed and therefore you may pass a
::: negative number to tolower. tolower is only defined on integer
::: values in the range of unsigned char and the value of EOF.
:::
::: tolower( (unsigned char) String1[i] )
:::
::: is correct.
:::
::: This also means that
:::
::: std::transform(str.begin(), str.end(), tolower)
:::
::: is undefined for the same reason.
::
:: I'm using Microsoft VC++ .net 2003. I'm using the tolower defined
:: in ctype.h. I just did a test, put in values from 0 to 255 (-1),
:: displayed their values, ran them through tolower and displayed
:: their values. Yes, some were negative going in, but they were
:: also negative coming out. I would be interested in seeing any
:: results other than this in any compiler.

You can't test for undefined behavior, because there is no requirement
that it should be consistent. In fact, there are no requirements at all.
:-)

It so happens that the MS implementation handles this by casting to
unsigned internally, and checking the range of the parameter. Possibly
because there is a compiler option to make the char type unsigned, in
which case your test must work.

It's non-portable anyway.
Assigning a signed value a non signed value of the same size is well
defined, isn't it?
unsigned int = -1;
is well defined, is it not?

So what if tolower expects an unsigned char and I pass it a signed char, it
is well defined, right?

Please quote from the standard where it is undefined.
Jul 22 '07 #13

P: n/a
On Jul 21, 12:47 pm, "Bo Persson" <b...@gmb.dkwrote:

[...]
It so happens that the MS implementation handles this by
casting to unsigned internally, and checking the range of the
parameter.
I would be very surprised if it does this, because that would
not be conform. It probably does like Sun, and simply ensures
that the bytes in from of the mapping array contain the right
thing. And it will probably fail for 0xFF, if the function
requires returning something other than 0xFF, because
tolower(EOF) is defined, and EOF is usually -1 (although I don't
think I've ever looked for Microsoft).

(BTW: I tend to think that Sun invented this trick, since I
first saw it under Solaris, at a time when the versions of HP/UX
and (I thing) AIX didn't use it. But to tell the truth, I have
no idea who did it first.)
Possibly because there is a compiler option to make the char
type unsigned, in which case your test must work.
G++ and VC++ both have this option. IMHO, it would have made
sense for the C committee to require that char be unsigned, and
it certainly makes sense for the implementation to make it
unsigned whenever there is no additional overhead. But
historically, most early C users were on PDP-11, where making
char unsigned had a very significant run-time overhead. And
they wrote a lot of code assuming that char was signed.
Wrong---even K&R I stress the fact that its signedness is
implementation dependant. But vendors don't like breaking
existing code, even if it's wrong, and to this day, most
implementations default to plain char being signed.

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jul 22 '07 #14

P: n/a
On Jul 22, 11:48 am, "Jim Langston" <tazmas...@rocketmail.comwrote:

[...]
Assigning a signed value a non signed value of the same size is well
defined, isn't it?
Assigning a signed value to an unsigned variable is well
defined, regardless of the sizes. The signed to unsigned
conversion is defined to be modulo 2^n, where N is the number of
value bits in the target unsigned type.
unsigned int = -1;
is well defined, is it not?
Yes, for a given implementation. (The exact value will, of
course, depend on the size of a int.)
So what if tolower expects an unsigned char and I pass it a signed char, it
is well defined, right?
tolower doesn't expect an unsigned char. It expects an int,
whose value is either EOF or in the range of [0...UCAR_MAX].
Converting a negative signed char to int will preserve its
value, and the results will be negative.
Please quote from the standard where it is undefined.
>From the C standard (obviously), §7.4/1:
The header <ctype.hdeclares several functions useful for
classifying and mapping characters.166) In all cases the
argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value
of the macro EOF. If the argument has any other value, the
behavior is undefined.

(This is from C99, the only version I can copy/paste from. But
the text is unchanged from C90, the reference version for C++.)

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jul 22 '07 #15

This discussion thread is closed

Replies have been disabled for this discussion.