473,386 Members | 1,712 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

char_traits<char>::compare

On VC++.NET it is implemented like this

static int __cdecl compare
(
const _Elem *_First1,
const _Elem *_First2,
size_t _Count
)
{ // compare [_First1, _First1 + _Count) with [_First2, ...)
return (::memcmp(_First1, _First2, _Count));
}

i.e. using memcmp. But memcmp is an unsigned comparison, whereas char
is a signed character.

Therefore if I declare a std::string as "\x80" and another std::string
as "\x7f" and do a comparison, the one that is "\x7f" is "lower",
although if I compared their first characters then the first character
of the "\x80" string is "lower".

Is this behaviour standard? Is it correct? Is there a formal definition
of what the result of a std::string comparison should return if one or
more of the characters in one or other of the strings is "negative".

Aug 10 '05 #1
8 3711
Earl Purple wrote:
On VC++.NET it is implemented like this

static int __cdecl compare
(
const _Elem *_First1,
const _Elem *_First2,
size_t _Count
)
{ // compare [_First1, _First1 + _Count) with [_First2, ...)
return (::memcmp(_First1, _First2, _Count));
}

i.e. using memcmp. But memcmp is an unsigned comparison, whereas char
is a signed character.
Whether 'char' is signed is implementation-defined. You can change it
usually with some compiler command-line switch.
Therefore if I declare a std::string as "\x80" and another std::string
as "\x7f" and do a comparison, the one that is "\x7f" is "lower",
although if I compared their first characters then the first character
of the "\x80" string is "lower".

Is this behaviour standard?
Reading the requirements for char_traits, 'compare' should yield 0 if for
any i in the range [0,_Count) 'eq(_First[i], _Second[i])' is true, and
yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
'eq' is true for all preceding chars, and 1 otherwise.

There is no requirement in the Standard as to how to implement those.
The traits essentially govern the sorting, not operator< or operator==,
which you were probably using when you "compared their first characters".
Is it correct? Is there a formal definition
of what the result of a std::string comparison should return if one or
more of the characters in one or other of the strings is "negative".


There is no "negative" or "positive" in there. Those are just characters
for which there are traits, which in turn say how the strings compare.

V
Aug 10 '05 #2

Victor Bazarov wrote:
Reading the requirements for char_traits, 'compare' should yield 0 if for
any i in the range [0,_Count) 'eq(_First[i], _Second[i])' is true, and
yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
'eq' is true for all preceding chars, and 1 otherwise.

There is no requirement in the Standard as to how to implement those.
The traits essentially govern the sorting, not operator< or operator==,
which you were probably using when you "compared their first characters".


from char_traits<char> (on VC .NET)

static bool __cdecl lt(const _Elem& _Left, const _Elem& _Right)
{
// test if _Left precedes _Right

return (_Left < _Right);
}

but 0x80 < 0x7f because char is signed. Thus when I have my strings

std::string s128( "\x80" );
std::string s127 ("\x7f" );

s127 < s128 but s128[0] < s127[0]

As basic_string (correctly) uses char_traits to do the comparison
(that's what it's there for isn't it?) the inconsistency is in
char_traits.

VC .NET provides no specialisation for char_traits<unsigned char> and I
have actually implemented my own traits class for unsigned char (but
not char_traits because I'm not supposed to extend namespace std),
which for me guarantees I will get consistent behaviour.

I just wanted to know if this inconsistency is part of the standard,
and by your quoting of the standard it is not - it is against the
standard rule for char_traits::compare.

> Is it correct? Is there a formal definition
of what the result of a std::string comparison should return if one or
more of the characters in one or other of the strings is "negative".


There is no "negative" or "positive" in there. Those are just characters
for which there are traits, which in turn say how the strings compare.

V


Aug 10 '05 #3
Earl Purple wrote:
Victor Bazarov wrote:
Reading the requirements for char_traits, 'compare' should yield 0 if for
any i in the range [0,_Count) 'eq(_First[i], _Second[i])' is true, and
yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
'eq' is true for all preceding chars, and 1 otherwise.

There is no requirement in the Standard as to how to implement those.
The traits essentially govern the sorting, not operator< or operator==,
which you were probably using when you "compared their first characters".

from char_traits<char> (on VC .NET)

static bool __cdecl lt(const _Elem& _Left, const _Elem& _Right)
{
// test if _Left precedes _Right

return (_Left < _Right);
}

[...]
I just wanted to know if this inconsistency is part of the standard,
and by your quoting of the standard it is not - it is against the
standard rule for char_traits::compare.


Yes, it certainly seems so. You should perhaps contact Dinkumware (the
implementors of the standard library Microsoft ships along with VC++
compilers) and let them know...

V
Aug 10 '05 #4
"Earl Purple" <ea*********@yahoo.com> wrote in message
news:11**********************@o13g2000cwo.googlegr oups.com...
Victor Bazarov wrote:
Reading the requirements for char_traits, 'compare' should yield 0 if for
any i in the range [0,_Count) 'eq(_First[i], _Second[i])' is true, and
yield -1 if exists j for which 'lt(_First[j], _Second[j])' is true and
'eq' is true for all preceding chars, and 1 otherwise.

There is no requirement in the Standard as to how to implement those.
The traits essentially govern the sorting, not operator< or operator==,
which you were probably using when you "compared their first characters".
from char_traits<char> (on VC .NET)

static bool __cdecl lt(const _Elem& _Left, const _Elem& _Right)
{
// test if _Left precedes _Right

return (_Left < _Right);
}

but 0x80 < 0x7f because char is signed. Thus when I have my strings

std::string s128( "\x80" );
std::string s127 ("\x7f" );

s127 < s128 but s128[0] < s127[0]

As basic_string (correctly) uses char_traits to do the comparison
(that's what it's there for isn't it?) the inconsistency is in
char_traits.

VC .NET provides no specialisation for char_traits<unsigned char> and I
have actually implemented my own traits class for unsigned char (but
not char_traits because I'm not supposed to extend namespace std),
which for me guarantees I will get consistent behaviour.


The template definition works fine for unsigned char. You don't
need to explicitly specialize it.
I just wanted to know if this inconsistency is part of the standard,
and by your quoting of the standard it is not - it is against the
standard rule for char_traits::compare.


Once upon a time, the draft C++ Standard spelled out that memcmp
should be used for char_traits<char>::compare. That got lost
along the way. Most (or possibly all) implementations still use
memcmp as a result. I know there has been discussion on the
C++ library committee reflector about this. IIRC, the consensus
is that memcmp is the right way to go. Whether there's a Defect
Report on this topic I don't recall.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Aug 10 '05 #5

P.J. Plauger wrote:

The template definition works fine for unsigned char. You don't
need to explicitly specialize it.
Actually it does not work fine when using it for basic_ofstream to
write binary, but this is caused by another issue. If the character at
position 0 or any multiple of 8192 happens to be 0xff it rips it out as
an EOF.

The templated version for compare "works" but does not take advantage
of the nature of unsigned char such that memcmp and memcpy can be
safely used for comparison/copying and are probably more efficient than
the byte-by-byte versions.
Once upon a time, the draft C++ Standard spelled out that memcmp
should be used for char_traits<char>::compare. That got lost
along the way. Most (or possibly all) implementations still use
memcmp as a result. I know there has been discussion on the
C++ library committee reflector about this. IIRC, the consensus
is that memcmp is the right way to go. Whether there's a Defect
Report on this topic I don't recall.


Thank you for clearing that up. So effectively it's better not to use
it if you are going to have any characters in your string that have the
negative bit set if you want consistent results across all compilers.

Aug 10 '05 #6
"Earl Purple" <ea*********@yahoo.com> wrote in message
news:11**********************@o13g2000cwo.googlegr oups.com...
P.J. Plauger wrote:

The template definition works fine for unsigned char. You don't
need to explicitly specialize it.
Actually it does not work fine when using it for basic_ofstream to
write binary, but this is caused by another issue. If the character at
position 0 or any multiple of 8192 happens to be 0xff it rips it out as
an EOF.


I'm assuming that's a lower-level C issue. No reason why it should
happen in the C++ buffering.
The templated version for compare "works" but does not take advantage
of the nature of unsigned char such that memcmp and memcpy can be
safely used for comparison/copying and are probably more efficient than
the byte-by-byte versions.


Until you can demonstrate that your program runs too slow because
this optimization is missing, it's safe to say that the templated
version works, period.
Once upon a time, the draft C++ Standard spelled out that memcmp
should be used for char_traits<char>::compare. That got lost
along the way. Most (or possibly all) implementations still use
memcmp as a result. I know there has been discussion on the
C++ library committee reflector about this. IIRC, the consensus
is that memcmp is the right way to go. Whether there's a Defect
Report on this topic I don't recall.


Thank you for clearing that up. So effectively it's better not to use
it if you are going to have any characters in your string that have the
negative bit set if you want consistent results across all compilers.


The only real issue is the ordering rule used for comparisons. If
you don't like what you get by default, you can always make your
own.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Aug 10 '05 #7

P.J. Plauger wrote:
Actually it does not work fine when using it for basic_ofstream to
write binary, but this is caused by another issue. If the character at
position 0 or any multiple of 8192 happens to be 0xff it rips it out as
an EOF.
I'm assuming that's a lower-level C issue. No reason why it should
happen in the C++ buffering.

own.
P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


No, the error comes from this function in basic_streambuf: (I have
formatted it to make it a bit easier to read)

virtual streamsize xsputn
(const _Elem *_Ptr, streamsize _Count)
{ // put _Count characters to stream
streamsize _Size, _Copied;

for (_Copied = 0; 0 < _Count; )
{
if
(
( pptr() != 0 ) &&
( 0 < (_Size = (streamsize)(epptr() - pptr())) )
)
{ // copy to write buffer
if (_Count < _Size)
{
_Size = _Count;
}
_Traits::copy(pptr(), _Ptr, _Size);
_Ptr += _Size;
_Copied += _Size;
_Count -= _Size;
pbump((int)_Size);
}
else if // ** ERROR IN THIS SECTION **
(
_Traits::eq_int_type
(
_Traits::eof(), overflow(_Traits::to_int_type(*_Ptr) )
)
)
{
break; // single character put failed, quit
}
else
{ // count character successfully put
++_Ptr;
++_Copied;
--_Count;
}
}
return (_Copied);
}

thus you have assumed that if the first character in our buffer happens
to be 0xff it is an end of file. (For a binary file this is not the
case). to_int_type for 0xff (unsigned) produces 0x000000ff which is not
equal to 0xffffffff.

My "fix" in my own version was to make int_type an int so eq always
fails.

Here is a test to reproduce the bug.

#include <fstream>
#include <string>

int main()
{
std::basic_ofstream< unsigned char > outFile
(
"test.dat",
std::ios_base::binary | std::ios_base::trunc
);

std::basic_string<unsigned char> data( 16, '\xff' );
for ( int iters=0; iters<8192; ++iters )
{
outFile.write( data.c_str(), 17 );
}
}

So we are writing 17 characters, 16 of 0xff followed by the 0
terminator, 8192 times. That should give us a file length of 139264 or
in hex 22000. On mine (VC7.1.3088) it is 49 bytes short.

Aug 11 '05 #8
"Earl Purple" <ea*********@yahoo.com> wrote in message
news:11**********************@f14g2000cwb.googlegr oups.com...
P.J. Plauger wrote:
> Actually it does not work fine when using it for basic_ofstream to
> write binary, but this is caused by another issue. If the character at
> position 0 or any multiple of 8192 happens to be 0xff it rips it out as
> an EOF.


I'm assuming that's a lower-level C issue. No reason why it should
happen in the C++ buffering.

own.

No, the error comes from this function in basic_streambuf: (I have
formatted it to make it a bit easier to read)

virtual streamsize xsputn
(const _Elem *_Ptr, streamsize _Count)
{ // put _Count characters to stream
streamsize _Size, _Copied;

for (_Copied = 0; 0 < _Count; )
{
if
(
( pptr() != 0 ) &&
( 0 < (_Size = (streamsize)(epptr() - pptr())) )
)
{ // copy to write buffer
if (_Count < _Size)
{
_Size = _Count;
}
_Traits::copy(pptr(), _Ptr, _Size);
_Ptr += _Size;
_Copied += _Size;
_Count -= _Size;
pbump((int)_Size);
}
else if // ** ERROR IN THIS SECTION **
(
_Traits::eq_int_type
(
_Traits::eof(), overflow(_Traits::to_int_type(*_Ptr) )
)
)
{
break; // single character put failed, quit
}
else
{ // count character successfully put
++_Ptr;
++_Copied;
--_Count;
}
}
return (_Copied);
}

thus you have assumed that if the first character in our buffer happens
to be 0xff it is an end of file. (For a binary file this is not the
case). to_int_type for 0xff (unsigned) produces 0x000000ff which is not
equal to 0xffffffff.

My "fix" in my own version was to make int_type an int so eq always
fails.


Ah, now I see the problem. We've long since changed the default type
for the template version of basic_streambuf to long, which is essentially
the same as your fix. That happened after we delivered the V7.1 library
to Microsoft. The old default, having int_type the same as char_type,
is not binary transparent, as you've observed.

It's fixed in the library we currently license from our web site (thus
my confusion). Should also work fine in Whidbey (VC++ V8).

Thanks for the clarification.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
Aug 11 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: pillbug | last post by:
I am trying to convince the Greta regular expression library to use the following char_traits: struct ignore_case_traits : std::char_traits<char> { static bool eq (const char& x, const char& y)...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.