473,382 Members | 1,353 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

first different character in string

Is there a quick way to find the index of the first character different in
two strings?

For instance, if I had the strings
"abcdefghijkl"
and
"abcdefxyz"
I would want the return value to be 6.

Either in C# or C++. (I obviously know how to do it by looping through the
characters in the character array, but I was wondering if there was some
smart-alec memory-manipulation function, asm even, or regexes I could use
that would do it.)

Thanks
Nov 16 '05 #1
15 2252
In a 32 bit environment, the fastest approach would be find the 4-byte
boundary and start reading and comparing 4 bytes at a time. When a mismatch
is found, you'll dig in and compare them at byte/word level (depending on if
your encoding is 8-bit or 16-bit)

In x86 asm, you should use the string optimized lodsb, cmpsb instructions.

-vJ

"Beeeeeves" <1234512345789123456789> wrote in message
news:uy******************@TK2MSFTNGP10.phx.gbl...
Is there a quick way to find the index of the first character different in
two strings?

For instance, if I had the strings
"abcdefghijkl"
and
"abcdefxyz"
I would want the return value to be 6.

Either in C# or C++. (I obviously know how to do it by looping through the
characters in the character array, but I was wondering if there was some
smart-alec memory-manipulation function, asm even, or regexes I could use
that would do it.)

Thanks

Nov 16 '05 #2

"Vijaye Raji" <no*************@hotmail.com> wrote in message
news:O3*************@TK2MSFTNGP10.phx.gbl...
In a 32 bit environment, the fastest approach would be find the 4-byte
boundary and start reading and comparing 4 bytes at a time. When a mismatch is found, you'll dig in and compare them at byte/word level (depending on if your encoding is 8-bit or 16-bit)
That's what I've resorted to.

In x86 asm, you should use the string optimized lodsb, cmpsb instructions.
Know any examples?

-vJ

"Beeeeeves" <1234512345789123456789> wrote in message
news:uy******************@TK2MSFTNGP10.phx.gbl...
Is there a quick way to find the index of the first character different in two strings?

For instance, if I had the strings
"abcdefghijkl"
and
"abcdefxyz"
I would want the return value to be 6.

Either in C# or C++. (I obviously know how to do it by looping through the characters in the character array, but I was wondering if there was some
smart-alec memory-manipulation function, asm even, or regexes I could use that would do it.)

Thanks


Nov 16 '05 #3
Come on, you know you want to. I'd love to get a bit of assembly language
into this program ;-)

"Vijaye Raji" <no*************@hotmail.com> wrote in message
news:O3*************@TK2MSFTNGP10.phx.gbl...
In a 32 bit environment, the fastest approach would be find the 4-byte
boundary and start reading and comparing 4 bytes at a time. When a mismatch is found, you'll dig in and compare them at byte/word level (depending on if your encoding is 8-bit or 16-bit)

In x86 asm, you should use the string optimized lodsb, cmpsb instructions.

-vJ

"Beeeeeves" <1234512345789123456789> wrote in message
news:uy******************@TK2MSFTNGP10.phx.gbl...
Is there a quick way to find the index of the first character different in two strings?

For instance, if I had the strings
"abcdefghijkl"
and
"abcdefxyz"
I would want the return value to be 6.

Either in C# or C++. (I obviously know how to do it by looping through the characters in the character array, but I was wondering if there was some
smart-alec memory-manipulation function, asm even, or regexes I could use that would do it.)

Thanks


Nov 16 '05 #4
I've written a couple of mean functions to do this, they seem to be working
pretty well (0.6 seconds to execute each of them on strings of 749KB, where
something had been deleted out of the middle, on a 1.3MHz Duron. That was
with the DLL in release mode.)

Anyone able to beat this with some assembly language?

#define SCALE (long)(sizeof(long) / sizeof(_TCHAR))
long __cdecl FirstDifferentIndex(_TCHAR* str1, _TCHAR* str2)

{

long minlen = min((long)_tcslen(str1), (long)_tcslen(str2)),

minlongs = minlen / SCALE,

i = 0, t = 0;

long* skipper[] = {(long*)str1, (long*)str2};

for(;(i < minlongs) && (skipper[0][i] == skipper[1][i]); i++);

for(t = i * SCALE; (t < minlen) && (str1[t] == str2[t]); t++);

return t == minlen ? -1 : t;

}

long _cdecl AmountSameAtEnd(_TCHAR* str1, _TCHAR* str2)

{

long len[] = {(long)_tcslen(str1), (long)_tcslen(str2)},

off[] = {len[0] % SCALE, len[1] % SCALE},

pos[] = {(len[0] - off[0])/SCALE, (len[1] - off[1])/SCALE},

t = 0,

* skipper[] = {

(long*)(str1 + SCALE - off[0]),

(long*)(str2 + SCALE - off[1])};

for(;(pos[0] >= 0) && (pos[1] >= 0) &&

(skipper[0][pos[0]] == skipper[1][pos[1]]);

pos[0]--, pos[1]--);

long fin[] = {

min((pos[0] + 1) * SCALE - 1, len[0] - 1),

min((pos[1] + 1) * SCALE - 1, len[1] - 1)};

for(;(fin[0] >= 0) && (fin[1] >= 0) && (str1[fin[0]] == str2[fin[1]]);

fin[0]--, fin[1]--);

return len[0] - fin[0] - 1; //should be the same as len[1] - fin[1] - 1

}

"Vijaye Raji" <no*************@hotmail.com> wrote in message
news:O3*************@TK2MSFTNGP10.phx.gbl...
In a 32 bit environment, the fastest approach would be find the 4-byte
boundary and start reading and comparing 4 bytes at a time. When a mismatch is found, you'll dig in and compare them at byte/word level (depending on if your encoding is 8-bit or 16-bit)

In x86 asm, you should use the string optimized lodsb, cmpsb instructions.

-vJ

"Beeeeeves" <1234512345789123456789> wrote in message
news:uy******************@TK2MSFTNGP10.phx.gbl...
Is there a quick way to find the index of the first character different in two strings?

For instance, if I had the strings
"abcdefghijkl"
and
"abcdefxyz"
I would want the return value to be 6.

Either in C# or C++. (I obviously know how to do it by looping through the characters in the character array, but I was wondering if there was some
smart-alec memory-manipulation function, asm even, or regexes I could use that would do it.)

Thanks


Nov 16 '05 #5
I don't have my dev machine in front of me, but I could spot a few problems
in this code.

1. tcslen - You are first finding the string lengths of the two strings.
Now, that is going to walk thru the string a character at a time looking for
zero termination. That is an overkill. As long as you're doing this,
there's no point in optimizing. You could very well compare the strings
right then. Try a loop method with character comparisons and time it.
You'll be surprised.

2. Your algorithm does the 4-byte boundary alignment. However, if the
string is already 4-byte aligned, your skipper will still go ahead and skip
4 bytes.

-vJ
"Beeeeeves" <1234512345789123456789> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...
I've written a couple of mean functions to do this, they seem to be
working
pretty well (0.6 seconds to execute each of them on strings of 749KB,
where
something had been deleted out of the middle, on a 1.3MHz Duron. That was
with the DLL in release mode.)

Anyone able to beat this with some assembly language?

#define SCALE (long)(sizeof(long) / sizeof(_TCHAR))
long __cdecl FirstDifferentIndex(_TCHAR* str1, _TCHAR* str2)

{

long minlen = min((long)_tcslen(str1), (long)_tcslen(str2)),

minlongs = minlen / SCALE,

i = 0, t = 0;

long* skipper[] = {(long*)str1, (long*)str2};

for(;(i < minlongs) && (skipper[0][i] == skipper[1][i]); i++);

for(t = i * SCALE; (t < minlen) && (str1[t] == str2[t]); t++);

return t == minlen ? -1 : t;

}

long _cdecl AmountSameAtEnd(_TCHAR* str1, _TCHAR* str2)

{

long len[] = {(long)_tcslen(str1), (long)_tcslen(str2)},

off[] = {len[0] % SCALE, len[1] % SCALE},

pos[] = {(len[0] - off[0])/SCALE, (len[1] - off[1])/SCALE},

t = 0,

* skipper[] = {

(long*)(str1 + SCALE - off[0]),

(long*)(str2 + SCALE - off[1])};

for(;(pos[0] >= 0) && (pos[1] >= 0) &&

(skipper[0][pos[0]] == skipper[1][pos[1]]);

pos[0]--, pos[1]--);

long fin[] = {

min((pos[0] + 1) * SCALE - 1, len[0] - 1),

min((pos[1] + 1) * SCALE - 1, len[1] - 1)};

for(;(fin[0] >= 0) && (fin[1] >= 0) && (str1[fin[0]] == str2[fin[1]]);

fin[0]--, fin[1]--);

return len[0] - fin[0] - 1; //should be the same as len[1] - fin[1] - 1

}

"Vijaye Raji" <no*************@hotmail.com> wrote in message
news:O3*************@TK2MSFTNGP10.phx.gbl...
In a 32 bit environment, the fastest approach would be find the 4-byte
boundary and start reading and comparing 4 bytes at a time. When a

mismatch
is found, you'll dig in and compare them at byte/word level (depending on

if
your encoding is 8-bit or 16-bit)

In x86 asm, you should use the string optimized lodsb, cmpsb
instructions.

-vJ

"Beeeeeves" <1234512345789123456789> wrote in message
news:uy******************@TK2MSFTNGP10.phx.gbl...
> Is there a quick way to find the index of the first character different in > two strings?
>
> For instance, if I had the strings
> "abcdefghijkl"
> and
> "abcdefxyz"
> I would want the return value to be 6.
>
> Either in C# or C++. (I obviously know how to do it by looping through the > characters in the character array, but I was wondering if there was
> some
> smart-alec memory-manipulation function, asm even, or regexes I could use > that would do it.)
>
> Thanks
>
>



Nov 16 '05 #6
For the masm guys, which I hope will be able to make mincemeat out of this,
please read from the bottom up. It would make my day.
But this is where I'm up to with 'the pecker languages'.
1. tcslen - You are first finding the string lengths of the two strings.
Now, that is going to walk thru the string a character at a time looking for zero termination. That is an overkill. As long as you're doing this,
there's no point in optimizing. You could very well compare the strings
right then. Try a loop method with character comparisons and time it.
You'll be surprised.
ok, that's something to consider... thanks

2. Your algorithm does the 4-byte boundary alignment. However, if the
string is already 4-byte aligned, your skipper will still go ahead and skip 4 bytes.
eh?
a) the string won't be 4-byte aligned, and
b) it should still work if it is, shouldn't it?

Give me an example of how to call it with some 4-byte aligned strings
(including how to create the 4-byte aligned strings in the first place)
that shows an incorrectness.
Or better still how to write it in assembly language...

"Beeeeeves" <1234512345789123456789> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl... I've written a couple of mean functions to do this, they seem to be
working
pretty well (0.6 seconds to execute each of them on strings of 749KB,
where
something had been deleted out of the middle, on a 1.3MHz Duron. That was
with the DLL in release mode.)

Anyone able to beat this with some assembly language?

#define SCALE (long)(sizeof(long) / sizeof(_TCHAR))
long __cdecl FirstDifferentIndex(_TCHAR* str1, _TCHAR* str2)

{

long minlen = min((long)_tcslen(str1), (long)_tcslen(str2)),

minlongs = minlen / SCALE,

i = 0, t = 0;

long* skipper[] = {(long*)str1, (long*)str2};

for(;(i < minlongs) && (skipper[0][i] == skipper[1][i]); i++);

for(t = i * SCALE; (t < minlen) && (str1[t] == str2[t]); t++);

return t == minlen ? -1 : t;

}

long _cdecl AmountSameAtEnd(_TCHAR* str1, _TCHAR* str2)

{

long len[] = {(long)_tcslen(str1), (long)_tcslen(str2)},

off[] = {len[0] % SCALE, len[1] % SCALE},

pos[] = {(len[0] - off[0])/SCALE, (len[1] - off[1])/SCALE},

t = 0,

* skipper[] = {

(long*)(str1 + SCALE - off[0]),

(long*)(str2 + SCALE - off[1])};

for(;(pos[0] >= 0) && (pos[1] >= 0) &&

(skipper[0][pos[0]] == skipper[1][pos[1]]);

pos[0]--, pos[1]--);

long fin[] = {

min((pos[0] + 1) * SCALE - 1, len[0] - 1),

min((pos[1] + 1) * SCALE - 1, len[1] - 1)};

for(;(fin[0] >= 0) && (fin[1] >= 0) && (str1[fin[0]] == str2[fin[1]]);

fin[0]--, fin[1]--);

return len[0] - fin[0] - 1; //should be the same as len[1] - fin[1] - 1

}

"Vijaye Raji" <no*************@hotmail.com> wrote in message
news:O3*************@TK2MSFTNGP10.phx.gbl...
In a 32 bit environment, the fastest approach would be find the 4-byte
boundary and start reading and comparing 4 bytes at a time. When a

mismatch
is found, you'll dig in and compare them at byte/word level (depending on

if
your encoding is 8-bit or 16-bit)

In x86 asm, you should use the string optimized lodsb, cmpsb
instructions.

-vJ

"Beeeeeves" <1234512345789123456789> wrote in message
news:uy******************@TK2MSFTNGP10.phx.gbl...
> Is there a quick way to find the index of the first character different in > two strings?
>
> For instance, if I had the strings
> "abcdefghijkl"
> and
> "abcdefxyz"
> I would want the return value to be 6.
>
> Either in C# or C++. (I obviously know how to do it by looping through the > characters in the character array, but I was wondering if there was
> some
> smart-alec memory-manipulation function, asm even, or regexes I could use > that would do it.)
>
> Thanks
>
>




Nov 16 '05 #7
p.s. sorry about the readability... quite impressed you made it to be honest
;-)

"Vijaye Raji" <no*************@hotmail.com> wrote in message
news:e5**************@TK2MSFTNGP11.phx.gbl...
I don't have my dev machine in front of me, but I could spot a few problems in this code.

1. tcslen - You are first finding the string lengths of the two strings.
Now, that is going to walk thru the string a character at a time looking for zero termination. That is an overkill. As long as you're doing this,
there's no point in optimizing. You could very well compare the strings
right then. Try a loop method with character comparisons and time it.
You'll be surprised.

2. Your algorithm does the 4-byte boundary alignment. However, if the
string is already 4-byte aligned, your skipper will still go ahead and skip 4 bytes.

-vJ
"Beeeeeves" <1234512345789123456789> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...
I've written a couple of mean functions to do this, they seem to be
working
pretty well (0.6 seconds to execute each of them on strings of 749KB,
where
something had been deleted out of the middle, on a 1.3MHz Duron. That was with the DLL in release mode.)

Anyone able to beat this with some assembly language?

#define SCALE (long)(sizeof(long) / sizeof(_TCHAR))
long __cdecl FirstDifferentIndex(_TCHAR* str1, _TCHAR* str2)

{

long minlen = min((long)_tcslen(str1), (long)_tcslen(str2)),

minlongs = minlen / SCALE,

i = 0, t = 0;

long* skipper[] = {(long*)str1, (long*)str2};

for(;(i < minlongs) && (skipper[0][i] == skipper[1][i]); i++);

for(t = i * SCALE; (t < minlen) && (str1[t] == str2[t]); t++);

return t == minlen ? -1 : t;

}

long _cdecl AmountSameAtEnd(_TCHAR* str1, _TCHAR* str2)

{

long len[] = {(long)_tcslen(str1), (long)_tcslen(str2)},

off[] = {len[0] % SCALE, len[1] % SCALE},

pos[] = {(len[0] - off[0])/SCALE, (len[1] - off[1])/SCALE},

t = 0,

* skipper[] = {

(long*)(str1 + SCALE - off[0]),

(long*)(str2 + SCALE - off[1])};

for(;(pos[0] >= 0) && (pos[1] >= 0) &&

(skipper[0][pos[0]] == skipper[1][pos[1]]);

pos[0]--, pos[1]--);

long fin[] = {

min((pos[0] + 1) * SCALE - 1, len[0] - 1),

min((pos[1] + 1) * SCALE - 1, len[1] - 1)};

for(;(fin[0] >= 0) && (fin[1] >= 0) && (str1[fin[0]] == str2[fin[1]]);

fin[0]--, fin[1]--);

return len[0] - fin[0] - 1; //should be the same as len[1] - fin[1] - 1

}

"Vijaye Raji" <no*************@hotmail.com> wrote in message
news:O3*************@TK2MSFTNGP10.phx.gbl...
In a 32 bit environment, the fastest approach would be find the 4-byte
boundary and start reading and comparing 4 bytes at a time. When a

mismatch
is found, you'll dig in and compare them at byte/word level (depending
on if
your encoding is 8-bit or 16-bit)

In x86 asm, you should use the string optimized lodsb, cmpsb
instructions.

-vJ

"Beeeeeves" <1234512345789123456789> wrote in message
news:uy******************@TK2MSFTNGP10.phx.gbl...
> Is there a quick way to find the index of the first character
different in
> two strings?
>
> For instance, if I had the strings
> "abcdefghijkl"
> and
> "abcdefxyz"
> I would want the return value to be 6.
>
> Either in C# or C++. (I obviously know how to do it by looping
through the
> characters in the character array, but I was wondering if there was
> some
> smart-alec memory-manipulation function, asm even, or regexes I could

use
> that would do it.)
>
> Thanks
>
>



Nov 16 '05 #8
This isn't the most efficient ASM method but it'll do your job. This is
MASM format. From here, you can play with it and learn the hows and whys.

; prototype
;int GetCharOffset(void* array1, void* array2);

GetCharOffset proc uses esi edi A1:dword, A2:dword
mov esi, [A1]
mov edi, [A2]
mov ecx, 0

Loop:
cmp byte ptr [esi+ecx], 0
je EndLoop
cmp byte ptr [edi+ecx], 0
je EndLoop
inc ecx
cmp byte ptr [esi+ecx-1], byte ptr [edi+ecx-1]
jne Loop

EndLoop:
mov eax, ecx
ret
GetCharOffset endp

"Beeeeeves" <1234512345789123456789> wrote in message
news:Ou**************@TK2MSFTNGP12.phx.gbl...
For the masm guys, which I hope will be able to make mincemeat out of this, please read from the bottom up. It would make my day.
But this is where I'm up to with 'the pecker languages'.
1. tcslen - You are first finding the string lengths of the two strings.
Now, that is going to walk thru the string a character at a time looking

for
zero termination. That is an overkill. As long as you're doing this,
there's no point in optimizing. You could very well compare the strings
right then. Try a loop method with character comparisons and time it.
You'll be surprised.


ok, that's something to consider... thanks

2. Your algorithm does the 4-byte boundary alignment. However, if the
string is already 4-byte aligned, your skipper will still go ahead and

skip
4 bytes.


eh?
a) the string won't be 4-byte aligned, and
b) it should still work if it is, shouldn't it?

Give me an example of how to call it with some 4-byte aligned strings
(including how to create the 4-byte aligned strings in the first place)
that shows an incorrectness.
Or better still how to write it in assembly language...

"Beeeeeves" <1234512345789123456789> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...
I've written a couple of mean functions to do this, they seem to be
working
pretty well (0.6 seconds to execute each of them on strings of 749KB,
where
something had been deleted out of the middle, on a 1.3MHz Duron. That was with the DLL in release mode.)

Anyone able to beat this with some assembly language?

#define SCALE (long)(sizeof(long) / sizeof(_TCHAR))
long __cdecl FirstDifferentIndex(_TCHAR* str1, _TCHAR* str2)

{

long minlen = min((long)_tcslen(str1), (long)_tcslen(str2)),

minlongs = minlen / SCALE,

i = 0, t = 0;

long* skipper[] = {(long*)str1, (long*)str2};

for(;(i < minlongs) && (skipper[0][i] == skipper[1][i]); i++);

for(t = i * SCALE; (t < minlen) && (str1[t] == str2[t]); t++);

return t == minlen ? -1 : t;

}

long _cdecl AmountSameAtEnd(_TCHAR* str1, _TCHAR* str2)

{

long len[] = {(long)_tcslen(str1), (long)_tcslen(str2)},

off[] = {len[0] % SCALE, len[1] % SCALE},

pos[] = {(len[0] - off[0])/SCALE, (len[1] - off[1])/SCALE},

t = 0,

* skipper[] = {

(long*)(str1 + SCALE - off[0]),

(long*)(str2 + SCALE - off[1])};

for(;(pos[0] >= 0) && (pos[1] >= 0) &&

(skipper[0][pos[0]] == skipper[1][pos[1]]);

pos[0]--, pos[1]--);

long fin[] = {

min((pos[0] + 1) * SCALE - 1, len[0] - 1),

min((pos[1] + 1) * SCALE - 1, len[1] - 1)};

for(;(fin[0] >= 0) && (fin[1] >= 0) && (str1[fin[0]] == str2[fin[1]]);

fin[0]--, fin[1]--);

return len[0] - fin[0] - 1; //should be the same as len[1] - fin[1] - 1

}

"Vijaye Raji" <no*************@hotmail.com> wrote in message
news:O3*************@TK2MSFTNGP10.phx.gbl...
In a 32 bit environment, the fastest approach would be find the 4-byte
boundary and start reading and comparing 4 bytes at a time. When a

mismatch
is found, you'll dig in and compare them at byte/word level (depending
on if
your encoding is 8-bit or 16-bit)

In x86 asm, you should use the string optimized lodsb, cmpsb
instructions.

-vJ

"Beeeeeves" <1234512345789123456789> wrote in message
news:uy******************@TK2MSFTNGP10.phx.gbl...
> Is there a quick way to find the index of the first character
different in
> two strings?
>
> For instance, if I had the strings
> "abcdefghijkl"
> and
> "abcdefxyz"
> I would want the return value to be 6.
>
> Either in C# or C++. (I obviously know how to do it by looping
through the
> characters in the character array, but I was wondering if there was
> some
> smart-alec memory-manipulation function, asm even, or regexes I could

use
> that would do it.)
>
> Thanks
>
>



Nov 16 '05 #9
Ooops, the 'jne Loop' should be 'je Loop' instead.
"Relvinian" <m@msn.com> wrote in message
news:Du********************@comcast.com...
This isn't the most efficient ASM method but it'll do your job. This is
MASM format. From here, you can play with it and learn the hows and whys.

; prototype
;int GetCharOffset(void* array1, void* array2);

GetCharOffset proc uses esi edi A1:dword, A2:dword
mov esi, [A1]
mov edi, [A2]
mov ecx, 0

Loop:
cmp byte ptr [esi+ecx], 0
je EndLoop
cmp byte ptr [edi+ecx], 0
je EndLoop
inc ecx
cmp byte ptr [esi+ecx-1], byte ptr [edi+ecx-1]
jne Loop

EndLoop:
mov eax, ecx
ret
GetCharOffset endp

"Beeeeeves" <1234512345789123456789> wrote in message
news:Ou**************@TK2MSFTNGP12.phx.gbl...
For the masm guys, which I hope will be able to make mincemeat out of this,
please read from the bottom up. It would make my day.
But this is where I'm up to with 'the pecker languages'.
1. tcslen - You are first finding the string lengths of the two strings. Now, that is going to walk thru the string a character at a time looking
for
zero termination. That is an overkill. As long as you're doing this,
there's no point in optimizing. You could very well compare the
strings right then. Try a loop method with character comparisons and time it.
You'll be surprised.


ok, that's something to consider... thanks

2. Your algorithm does the 4-byte boundary alignment. However, if the
string is already 4-byte aligned, your skipper will still go ahead and

skip
4 bytes.


eh?
a) the string won't be 4-byte aligned, and
b) it should still work if it is, shouldn't it?

Give me an example of how to call it with some 4-byte aligned strings
(including how to create the 4-byte aligned strings in the first place)
that shows an incorrectness.
Or better still how to write it in assembly language...

"Beeeeeves" <1234512345789123456789> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...
I've written a couple of mean functions to do this, they seem to be
working
pretty well (0.6 seconds to execute each of them on strings of 749KB,
where
something had been deleted out of the middle, on a 1.3MHz Duron. That

was with the DLL in release mode.)

Anyone able to beat this with some assembly language?

#define SCALE (long)(sizeof(long) / sizeof(_TCHAR))
long __cdecl FirstDifferentIndex(_TCHAR* str1, _TCHAR* str2)

{

long minlen = min((long)_tcslen(str1), (long)_tcslen(str2)),

minlongs = minlen / SCALE,

i = 0, t = 0;

long* skipper[] = {(long*)str1, (long*)str2};

for(;(i < minlongs) && (skipper[0][i] == skipper[1][i]); i++);

for(t = i * SCALE; (t < minlen) && (str1[t] == str2[t]); t++);

return t == minlen ? -1 : t;

}

long _cdecl AmountSameAtEnd(_TCHAR* str1, _TCHAR* str2)

{

long len[] = {(long)_tcslen(str1), (long)_tcslen(str2)},

off[] = {len[0] % SCALE, len[1] % SCALE},

pos[] = {(len[0] - off[0])/SCALE, (len[1] - off[1])/SCALE},

t = 0,

* skipper[] = {

(long*)(str1 + SCALE - off[0]),

(long*)(str2 + SCALE - off[1])};

for(;(pos[0] >= 0) && (pos[1] >= 0) &&

(skipper[0][pos[0]] == skipper[1][pos[1]]);

pos[0]--, pos[1]--);

long fin[] = {

min((pos[0] + 1) * SCALE - 1, len[0] - 1),

min((pos[1] + 1) * SCALE - 1, len[1] - 1)};

for(;(fin[0] >= 0) && (fin[1] >= 0) && (str1[fin[0]] == str2[fin[1]]);

fin[0]--, fin[1]--);

return len[0] - fin[0] - 1; //should be the same as len[1] - fin[1] - 1
}

"Vijaye Raji" <no*************@hotmail.com> wrote in message
news:O3*************@TK2MSFTNGP10.phx.gbl...
> In a 32 bit environment, the fastest approach would be find the 4-byte> boundary and start reading and comparing 4 bytes at a time. When a
mismatch
> is found, you'll dig in and compare them at byte/word level (depending on
if
> your encoding is 8-bit or 16-bit)
>
> In x86 asm, you should use the string optimized lodsb, cmpsb
> instructions.
>
> -vJ
>
> "Beeeeeves" <1234512345789123456789> wrote in message
> news:uy******************@TK2MSFTNGP10.phx.gbl...
> > Is there a quick way to find the index of the first character different in
> > two strings?
> >
> > For instance, if I had the strings
> > "abcdefghijkl"
> > and
> > "abcdefxyz"
> > I would want the return value to be 6.
> >
> > Either in C# or C++. (I obviously know how to do it by looping through the
> > characters in the character array, but I was wondering if there was
> > some
> > smart-alec memory-manipulation function, asm even, or regexes I

could use
> > that would do it.)
> >
> > Thanks
> >
> >
>
>



Nov 16 '05 #10
> eh?
a) the string won't be 4-byte aligned, and
b) it should still work if it is, shouldn't it?

Give me an example of how to call it with some 4-byte aligned strings
(including how to create the 4-byte aligned strings in the first place)
that shows an incorrectness.
Or better still how to write it in assembly language...


What I meant was, if your string was already 4-byte aligned, your skipper is
going to skip 4 bytes and read them byte by byte. A small optimization
here, but your code will still work.

Again, I'm sorry I'm unable to give you ASM at this time, but I'd rather you
worry about the algorithm first, before jumping to ASM.
[Not Tested]
int CompareStrX (TCHAR *str1, TCHAR *str2)
{
int ret = 0;
while (str1[ret] == str2[ret])
{
if (str1[ret] == NULL)
return -1;
ret++;
}

return ret;
}

This will throw out a real tight loop. Compare your previous code to this.
Your C++ compiler will probably generate the most efficient code for this
already and you wouldn't need to write a line of assembly.

-vJ

"Beeeeeves" <1234512345789123456789> wrote in message
news:Ou**************@TK2MSFTNGP12.phx.gbl...
[Stripped out]
Nov 16 '05 #11
"Vijaye Raji" <no*************@hotmail.com> wrote in message
news:%2***************@TK2MSFTNGP11.phx.gbl...
if (str1[ret] == NULL)


This is a pet peeve of mine. "NULL" refers to an *address* of zero.
Here, what you actually want is a *character* of zero. The official ASCII
name for that is "NUL" (one L). In ANSI C, that line would probably give
you a warning (as NULL as #defined to "(void*)0"). Since C++'s strict
typing manke that definition non-viable and requires a compromised #define
for NULL(of just "0"), you won't get the warning, but it's still just wrong.

--
Truth,
James Curran
Home: www.noveltheory.com Work: www.njtheater.com
Blog: www.honestillusion.com Day Job: www.partsearch.com
(note new day job!)
Nov 16 '05 #12
True.

Thanks for the correction.

-vJ

"James Curran" <Ja*********@mvps.org> wrote in message
news:eJ**************@TK2MSFTNGP09.phx.gbl...
"Vijaye Raji" <no*************@hotmail.com> wrote in message
news:%2***************@TK2MSFTNGP11.phx.gbl...
if (str1[ret] == NULL)


This is a pet peeve of mine. "NULL" refers to an *address* of zero.
Here, what you actually want is a *character* of zero. The official ASCII
name for that is "NUL" (one L). In ANSI C, that line would probably give
you a warning (as NULL as #defined to "(void*)0"). Since C++'s strict
typing manke that definition non-viable and requires a compromised #define
for NULL(of just "0"), you won't get the warning, but it's still just
wrong.

--
Truth,
James Curran
Home: www.noveltheory.com Work: www.njtheater.com
Blog: www.honestillusion.com Day Job: www.partsearch.com
(note new day job!)

Nov 16 '05 #13
Thanks a lot
Any idea how to put it into a C++ function

e.g
int GetCharOffset(void* array1, void* array2

_as

....???????..

}
Nov 16 '05 #14
If you'd tell me how to put it into a C++ function then I could debug it and would be able to understand it

thanks!!!
Nov 16 '05 #15
They are comparing byte by byte. That is what the 'byte ptr' is referencing
in the two 'cmp' statements and the two 'mov' with the cmp al,dl statement.

As for comparing DWORDs instead of BYTEs, there is a more a lot more work
involved. First, you would have to compare the strings byte-by-byte until
you are aligned on a dword boundary then you could check dword-by-dword.
Once you found a non-match, then you would have to take that dword and do a
byte-by-byte check on it to find out which byte isn't the same. In all
this process, you would also have to correctly find a NULL-terminator for
each of the strings while comparing dwords.

There are tricks by anding/oring/etc the dword values to find a
NULL-terminator which can be done to help make the algorythm quicker in this
process.

It can be done and it could be well worth it if the strings you are trying
to compare are long in length. If they are short, (not really sure of
length without testing -- a rough guess would be less then 256 bytes), I
would stay with just a byte comparison.

Just a quick note: if you are running this on a P4 machine, change the 'inc
ecx' to a 'add ecx, 1'. This will speed up the loop a little too. Also, if
this is the full function, you don't need the 'long s = SCALE;'. You can
remove this.

Let me know if you have any more questions.

Relvinian

"Beeeeeeves" <an*******@discussions.microsoft.com> wrote in message
news:CA**********************************@microsof t.com...
That's great Relvinian it's working.
I changed it to:

long _cdecl GetCharOffset(void* str1, void* str2)
{
long s = SCALE;
_asm
{
mov esi, [str1]
mov edi, [str2]
mov ecx, 0
mov eax, ecx
LoopStart:
cmp byte ptr [esi+ecx], 0
je EndLoop
cmp byte ptr [edi+ecx], 0
je EndLoop
inc ecx
mov al, byte ptr [esi+ecx-1]
mov dl, byte ptr [edi+ecx-1]
cmp al, dl
je LoopStart
EndLoop:
mov eax, ecx
}
}

However I'm unsure about:
How it knows what the size of the individual elements it's comparing are. They are presumably bytes? If they are bytes how can I make them DWORDs until it finds a different one, then make them bytes again? (i'm a beginner so please explain it...)
----- Relvinian wrote: -----

This isn't the most efficient ASM method but it'll do your job. This is MASM format. From here, you can play with it and learn the hows and whys.
; prototype
;int GetCharOffset(void* array1, void* array2);

GetCharOffset proc uses esi edi A1:dword, A2:dword
mov esi, [A1]
mov edi, [A2]
mov ecx, 0

Loop:
cmp byte ptr [esi+ecx], 0
je EndLoop
cmp byte ptr [edi+ecx], 0
je EndLoop
inc ecx
cmp byte ptr [esi+ecx-1], byte ptr [edi+ecx-1]
jne Loop

EndLoop:
mov eax, ecx
ret
GetCharOffset endp

"Beeeeeves" <1234512345789123456789> wrote in message
news:Ou**************@TK2MSFTNGP12.phx.gbl...
> For the masm guys, which I hope will be able to make mincemeat out of
this,
> please read from the bottom up. It would make my day.
> But this is where I'm up to with 'the pecker languages'.
>>> 1. tcslen - You are first finding the string lengths of the two strings. >> Now, that is going to walk thru the string a character at a time looking
> for
>> zero termination. That is an overkill. As long as you're doing
this, >> there's no point in optimizing. You could very well compare the strings >> right then. Try a loop method with character comparisons and time it. >> You'll be surprised.
>> ok, that's something to consider... thanks
>>>>> 2. Your algorithm does the 4-byte boundary alignment. However, if the >> string is already 4-byte aligned, your skipper will still go ahead and > skip
>> 4 bytes.
>> eh? > a) the string won't be 4-byte aligned, and
> b) it should still work if it is, shouldn't it?
>> Give me an example of how to call it with some 4-byte aligned
strings > (including how to create the 4-byte aligned strings in the first place) > that shows an incorrectness.
> Or better still how to write it in assembly language...
>> "Beeeeeves" <1234512345789123456789> wrote in message

> news:%2****************@TK2MSFTNGP10.phx.gbl...
>> I've written a couple of mean functions to do this, they seem to

be >> working
>> pretty well (0.6 seconds to execute each of them on strings of 749KB, >> where
>> something had been deleted out of the middle, on a 1.3MHz Duron. That was >> with the DLL in release mode.)
>>>> Anyone able to beat this with some assembly language?
>>>> #define SCALE (long)(sizeof(long) / sizeof(_TCHAR))
>> long __cdecl FirstDifferentIndex(_TCHAR* str1, _TCHAR* str2)
>>>> {
>>>> long minlen = min((long)_tcslen(str1), (long)_tcslen(str2)),
>>>> minlongs = minlen / SCALE,
>>>> i = 0, t = 0;
>>>> long* skipper[] = {(long*)str1, (long*)str2};
>>>> for(;(i < minlongs) && (skipper[0][i] == skipper[1][i]); i++);
>>>> for(t = i * SCALE; (t < minlen) && (str1[t] == str2[t]); t++);
>>>> return t == minlen ? -1 : t;
>>>> }
>>>> long _cdecl AmountSameAtEnd(_TCHAR* str1, _TCHAR* str2)
>>>> {
>>>> long len[] = {(long)_tcslen(str1), (long)_tcslen(str2)},
>>>> off[] = {len[0] % SCALE, len[1] % SCALE},
>>>> pos[] = {(len[0] - off[0])/SCALE, (len[1] - off[1])/SCALE},
>>>> t = 0,
>>>> * skipper[] = {
>>>> (long*)(str1 + SCALE - off[0]),
>>>> (long*)(str2 + SCALE - off[1])};
>>>> for(;(pos[0] >= 0) && (pos[1] >= 0) &&>>>> (skipper[0][pos[0]]
== skipper[1][pos[1]]); >>>> pos[0]--, pos[1]--);
>>>> long fin[] = {
>>>> min((pos[0] + 1) * SCALE - 1, len[0] - 1),
>>>> min((pos[1] + 1) * SCALE - 1, len[1] - 1)};
>>>> for(;(fin[0] >= 0) && (fin[1] >= 0) && (str1[fin[0]] == str2[fin[1]]); >>>> fin[0]--, fin[1]--);
>>>> return len[0] - fin[0] - 1; //should be the same as len[1] - fin[1] - 1 >>>> }
>>>> "Vijaye Raji" <no*************@hotmail.com> wrote in message
>> news:O3*************@TK2MSFTNGP10.phx.gbl...
>>> In a 32 bit environment, the fastest approach would be find the 4-byte >>> boundary and start reading and comparing 4 bytes at a time. When a >> mismatch
>>> is found, you'll dig in and compare them at byte/word level (depending
on >> if
>>> your encoding is 8-bit or 16-bit)
>>>>>> In x86 asm, you should use the string optimized lodsb, cmpsb
>>> instructions.
>>>>>> -vJ
>>>>>> "Beeeeeves" <1234512345789123456789> wrote in message
>>> news:uy******************@TK2MSFTNGP10.phx.gbl...
>>>> Is there a quick way to find the index of the first character different >> in
>>>> two strings?
>>>>>>>> For instance, if I had the strings
>>>> "abcdefghijkl"
>>>> and
>>>> "abcdefxyz"
>>>> I would want the return value to be 6.
>>>>>>>> Either in C# or C++. (I obviously know how to do it by
looping
through >> the
>>>> characters in the character array, but I was wondering if there

was >>>> some
>>>> smart-alec memory-manipulation function, asm even, or regexes I could >> use
>>>> that would do it.)
>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>

Nov 16 '05 #16

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Ren? M?hle | last post by:
I have a psp script with a procedure just to run an update on one table. The Problem occurs when I try to compile this script with pspload: ORA-20006: "frsys_updatereport.psp": compilation...
1
by: rusttree | last post by:
I'm working on a program that manipulates bmp files. I know the offset location of each piece of relevent data within the bmp file. For example, I know the 18th through 21st byte is an integer...
0
by: MLH | last post by:
Is an apostrophe a character of special significance to MySQL in a way that would cause "Bob's dog" to become translated into a 12-character string when typed into a MySQL memo field? If I type...
7
by: Justin | last post by:
i need to build the unsigned character string: "PR0N\0Spam\0G1RLS\0Other\0Items\0\0\0" from the signed character string: "PR0N Spam G1RLS Other Items" Tokeninzing the character string is not...
5
by: Karthik | last post by:
Hello! I am not a wizard in this area! Just need some help out in this. I am trying to convert bstr string to new character string. Here is the snippet of my code. **** Code Start**** ...
25
by: lovecreatesbeauty | last post by:
Hello experts, I write a function named palindrome to determine if a character string is palindromic, and test it with some example strings. Is it suitable to add it to a company/project library...
8
by: Brand Bogard | last post by:
Does the C standard include a library function to convert an 8 bit character string to a 16 bit character string?
1
by: iwongu | last post by:
Hi, I have a question about std::wcout and its underlying C functions. According to C99, a stream have an orientation type that is one of byte-oriented or wide-one, and it is determined by the...
3
by: Gmuchan | last post by:
Hi, Help needed please. Basically I am wanting to update a 16 digit character string which is in a table called say accounts. If the account no. field is say 0502562389568745 I want to...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.