472,145 Members | 1,639 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,145 software developers and data experts.

String to byte[] reloaded

Hi
I need an efficient method to convert a string object to it's byte[]
equivalent.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array. Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:

byte[] GetBuffer()

This method "blocks" the CString instance until ReleaseBuffer() method
is called. Again, maybe the method names are not quite as I remember,
but the important thing is the principle.
The marvelous result is that you may freely iterate through the byte[]
array returned by GetBuffer() method and even modify it (with respect
to some limits, of course), and all this, without allocating new
memory.
My question is: using MemoryStream class will do the job for me? I
mean, there is a method called GetBuffer(), but will it allocate new
memory or not, as it is not stated in MS documentation.

Thanks

Feb 10 '07 #1
30 3838
nano2k <ad***********@ikonsoft.rowrote:
I need an efficient method to convert a string object to it's byte[]
equivalent.
*Which* byte[] equivalent? It depends on the encoding.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array.
No they don't. Use Encoding.GetBytes(string, int, int, byte[], int) to
copy the bytes into an existing byte array. Of course, you'll have to
allocate the array at some point first... I'm currently working on a
BufferManager class which allows buffers to be reused etc, but I'm not
sure it's really worth it here.

Have you actually proven (with profiling etc) that a normal
Encoding.GetBytes call is causing you a bottleneck?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Feb 10 '07 #2


"Jon Skeet [C# MVP]" <sk***@pobox.comwrote in message
news:MP************************@msnews.microsoft.c om...
nano2k <ad***********@ikonsoft.rowrote:
>I need an efficient method to convert a string object to it's byte[]
equivalent.

*Which* byte[] equivalent? It depends on the encoding.
>I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array.

No they don't. Use Encoding.GetBytes(string, int, int, byte[], int) to
copy the bytes into an existing byte array. Of course, you'll have to
allocate the array at some point first... I'm currently working on a
BufferManager class which allows buffers to be reused etc, but I'm not
sure it's really worth it here.
Have you seen System.ServiceModel.Channels.BufferManager in .NET 3.0?

David

Feb 10 '07 #3
nano2k wrote:
Hi
I need an efficient method to convert a string object to it's byte[]
equivalent.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array. Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:

byte[] GetBuffer()

This method "blocks" the CString instance until ReleaseBuffer() method
is called. Again, maybe the method names are not quite as I remember,
but the important thing is the principle.
The marvelous result is that you may freely iterate through the byte[]
array returned by GetBuffer() method and even modify it (with respect
to some limits, of course), and all this, without allocating new
memory.
My question is: using MemoryStream class will do the job for me? I
mean, there is a method called GetBuffer(), but will it allocate new
memory or not, as it is not stated in MS documentation.

Thanks
Do you really need a byte array? A string can be indexed by it's
characters, and you can cast each char to an int, so effectively you
have an int array already. If you need it as bytes, just split each int
into two bytes.

If you want to access the string as bytes to modify it, that is a really
bad idea. Strings are immutable, and every method that uses strings rely
on that.

--
Göran Andersson
_____
http://www.guffa.com
Feb 10 '07 #4
nano2k wrote:
I need an efficient method to convert a string object to it's byte[]
equivalent.
There are many byte[] equivalents for a string, one for each encoding
and its options.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array.
I disagree on this point! From my perspective, most of them *don't*
allocate a byte array - you've got to do it yourself.
Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:
Create an Encoding descendant instance and pre-allocate the byte[] you
pass to it. Most of the Encoding.GetBytes() don't allocate byte arrays,
they require the caller to allocate the array, and that way you control
the allocation strategy.

-- Barry

--
http://barrkel.blogspot.com/
Feb 10 '07 #5
<"David Browne" <davidbaxterbrowne no potted me**@hotmail.com>wrote:
No they don't. Use Encoding.GetBytes(string, int, int, byte[], int) to
copy the bytes into an existing byte array. Of course, you'll have to
allocate the array at some point first... I'm currently working on a
BufferManager class which allows buffers to be reused etc, but I'm not
sure it's really worth it here.

Have you seen System.ServiceModel.Channels.BufferManager in .NET 3.0?
I hadn't before, to be honest. Can't say I like the idea of having to
explicitly call ReturnBuffer - my buffers allow you access to the byte
array, but implement IDisposable so you can just do:

using (IBuffer buffer = manager.GetBuffer(...))
{
byte[] bytes = buffer.Bytes;
...
}

I'm not entirely surprised that others have thought it would be useful
though :)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Feb 10 '07 #6


"Jon Skeet [C# MVP]" <sk***@pobox.comwrote in message
news:MP************************@msnews.microsoft.c om...
<"David Browne" <davidbaxterbrowne no potted me**@hotmail.com>wrote:
No they don't. Use Encoding.GetBytes(string, int, int, byte[], int) to
copy the bytes into an existing byte array. Of course, you'll have to
allocate the array at some point first... I'm currently working on a
BufferManager class which allows buffers to be reused etc, but I'm not
sure it's really worth it here.

Have you seen System.ServiceModel.Channels.BufferManager in .NET 3.0?

I hadn't before, to be honest. Can't say I like the idea of having to
explicitly call ReturnBuffer - my buffers allow you access to the byte
array, but implement IDisposable so you can just do:

using (IBuffer buffer = manager.GetBuffer(...))
{
byte[] bytes = buffer.Bytes;
...
}
That's handy. Though if it's a public library I would worry that it could
lead to inadvertent sharing of buffers.

David

Feb 11 '07 #7
<"David Browne" <davidbaxterbrowne no potted me**@hotmail.com>wrote:
using (IBuffer buffer = manager.GetBuffer(...))
{
byte[] bytes = buffer.Bytes;
...
}

That's handy. Though if it's a public library I would worry that it could
lead to inadvertent sharing of buffers.
It would depend on the scope of the manager. The BufferManager
*classes* are public, but how you share instances of them is up to you
:)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Feb 11 '07 #8
You have not stated what your performance requirements are or how you
measured that all other methods are not efficient enough, but the answer to
the second part of your question whether MemoryStream.GetBuffer allocates
new memory is false.

From Reflector:

public virtual byte[] GetBuffer()
{
if (!this._exposable)
{
throw new
UnauthorizedAccessException(Environment.GetResourc eString("UnauthorizedAccess_MemStreamBuffer"));
}
return this._buffer;
}
"nano2k" <ad***********@ikonsoft.rowrote in message
news:11*********************@l53g2000cwa.googlegro ups.com...
Hi
I need an efficient method to convert a string object to it's byte[]
equivalent.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array. Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:

byte[] GetBuffer()

This method "blocks" the CString instance until ReleaseBuffer() method
is called. Again, maybe the method names are not quite as I remember,
but the important thing is the principle.
The marvelous result is that you may freely iterate through the byte[]
array returned by GetBuffer() method and even modify it (with respect
to some limits, of course), and all this, without allocating new
memory.
My question is: using MemoryStream class will do the job for me? I
mean, there is a method called GetBuffer(), but will it allocate new
memory or not, as it is not stated in MS documentation.

Thanks

Feb 11 '07 #9
Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation =not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.
Barry Kelly a scris:
nano2k wrote:
I need an efficient method to convert a string object to it's byte[]
equivalent.

There are many byte[] equivalents for a string, one for each encoding
and its options.
I know there are LOTS of methods, but they lack in efficiency. All
methods allocate new memory to create the byte[] array.

I disagree on this point! From my perspective, most of them *don't*
allocate a byte array - you've got to do it yourself.
Of course,
when memory allocation occurs, then naturally extra processing power
is needed.
To more explicit, MFC introduced a super-efficient method of dealing
with this situation. As far as I remember (I switched from MFC to .NET
few years ago), MFC's CString class has a method with the following
signature:

Create an Encoding descendant instance and pre-allocate the byte[] you
pass to it. Most of the Encoding.GetBytes() don't allocate byte arrays,
they require the caller to allocate the array, and that way you control
the allocation strategy.

-- Barry

--
http://barrkel.blogspot.com/
Feb 12 '07 #10
nano2k <ad***********@ikonsoft.rowrote:
Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
No, they're completely different - if you get to do the allocation, you
can reuse the same buffer several times.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation =not enough memory.
Just how large are the strings you're compressing? Can you not
serialize (or pseudo-serialize) the operation so you're only
compressing a few strings at a time?
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.
You *may* be able to get at the UTF-16 encoded (internal) version with
unsafe code, but I'd strongly recommend against it.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Feb 12 '07 #11
nano2k wrote:
Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation =not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

If you get the string data directly from memory, you will get it as
UTF-16, so almost every other byte will be zero. I think that you get
better compression if you encode the string first.

If you have huge strings to compress, you should not keep them in
memory. Save the string to a file, and read a chunk at a time from the
file and compress. This of course adds a litte overhead, but it scales
far better, and from what you said, it's the scalability that is the
problem.

--
Göran Andersson
_____
http://www.guffa.com
Feb 12 '07 #12
Thanks, Jon

No, they're completely different - if you get to do the allocation, you
can reuse the same buffer several times.
That's right. But in my particular case, I don't need to use the
buffer several times. I'm just forced to create an extra buffer just
to pass data to the compresion method. That's my only problem.
Unfortunately, I cannot change the compress method's signature (e.g.
to accept an MemoryStream object as input data, etc), so I thought I
will be able to adapt my code, because this is where I'm 100% in
control.

Just how large are the strings you're compressing? Can you not
serialize (or pseudo-serialize) the operation so you're only
compressing a few strings at a time?
I cannot keep control on the size of the request because the request
is based on an SQL statement that could virtually return tens of megs
of data (imagine for example a report for anual activity for a
comapny). Now, because all runs inside a webservice, there can be
multiple requests of such type
You *may* be able to get at the UTF-16 encoded (internal) version with
unsafe code, but I'd strongly recommend against it.
I don't know what's worst. To make sure I handle very careful such a
buffer or to risc the reliability of my webservice...

Feb 12 '07 #13
nano2k <ad***********@ikonsoft.rowrote:
No, they're completely different - if you get to do the allocation, you
can reuse the same buffer several times.
That's right. But in my particular case, I don't need to use the
buffer several times. I'm just forced to create an extra buffer just
to pass data to the compresion method. That's my only problem.
Unfortunately, I cannot change the compress method's signature (e.g.
to accept an MemoryStream object as input data, etc), so I thought I
will be able to adapt my code, because this is where I'm 100% in
control.
You wanted to reduce the overall memory consumption, right? So create a
buffer and reuse it, encoding different strings into the same byte
array.
Just how large are the strings you're compressing? Can you not
serialize (or pseudo-serialize) the operation so you're only
compressing a few strings at a time?
I cannot keep control on the size of the request because the request
is based on an SQL statement that could virtually return tens of megs
of data (imagine for example a report for anual activity for a
comapny). Now, because all runs inside a webservice, there can be
multiple requests of such type
If there are no bounds to the amount of data you need to compress, you
really, really should be streaming it. Grabbing into one lump is never
going to be a good option.
You *may* be able to get at the UTF-16 encoded (internal) version with
unsafe code, but I'd strongly recommend against it.
I don't know what's worst. To make sure I handle very careful such a
buffer or to risc the reliability of my webservice...
It sounds to me like you should look very carefully at the overall
architecture and your compression API.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Feb 12 '07 #14
You wanted to reduce the overall memory consumption, right? So create a
buffer and reuse it, encoding different strings into the same byte
array.
In theory is ok, but should I maintain a static buffer of several tens
of megabytes locked between requests?
This means I should allocate the buffer accordingly to the largest
request. Not only that, but I need to protect the buffer agains other
calling requests that may need compressing in the same time.
If there are no bounds to the amount of data you need to compress, you
really, really should be streaming it. Grabbing into one lump is never
going to be a good option.
Yes, I also think this is the way to go.
Thanks.

Feb 12 '07 #15
Maybe I missed something in an earlier post... but why do you want a
buffer that big? There are far better (more efficient) ways of reading
data using small buffers and chunking. I wasn't sure if your example
was dealing with database BLOBs - but if so you can do this using
small buffers too (even for big BLOBs).

Small (ish) good. Big bad.

Marc
Feb 12 '07 #16
nano2k <ad***********@ikonsoft.rowrote:
You wanted to reduce the overall memory consumption, right? So create a
buffer and reuse it, encoding different strings into the same byte
array.
In theory is ok, but should I maintain a static buffer of several tens
of megabytes locked between requests?
Tens of megabytes isn't too big if you've only got one of them. Better
than tens of megabytes being allocated and deallocated on a regular
basis.
This means I should allocate the buffer accordingly to the largest
request. Not only that, but I need to protect the buffer agains other
calling requests that may need compressing in the same time.
Yes, you need to serialize the compression. That will help keep the
memory usage down a lot, even if it slows down the app overall.
If there are no bounds to the amount of data you need to compress, you
really, really should be streaming it. Grabbing into one lump is never
going to be a good option.
Yes, I also think this is the way to go.
Goodo.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Feb 12 '07 #17

"nano2k" <ad***********@ikonsoft.rowrote in message
news:11**********************@j27g2000cwj.googlegr oups.com...
Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation =not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.
Here's your loaded gun, plenty of rope to hang yourself (I looked at the
C++/CLI function PtrToStringChars):

string s = "This is a test string";

GCHandle ptr = GCHandle.Alloc(s);

byte* pString = *(byte**)GCHandle.ToIntPtr(ptr).ToPointer() +
System.Runtime.CompilerServices.RuntimeHelpers.Off setToStringData;

char* c = (char*)pString;

ptr.Free();

GCHandle pinptr = GCHandle.Alloc(s, GCHandleType.Pinned);

pString = (byte*)pinptr.AddrOfPinnedObject().ToPointer();

c = (char*)pString;

pinptr.Free();

Note that the Large Object Heap is a heap, and not subject to garbage
collection, in which case you ought not need to pin the object.

BTW, that OffsetToStringData is 12 (at least in .NET 2.0) but it's a real
property, not a constant, so the value is gotten from your actual runtime
library, not when you compile. Of course the JIT will inline that property
access to nothing anyway.

The PtrToStringChars code is the same in both .NET 1.1 and 2.0, so this
might be almost stable.

I don't know why the offset isn't needed when you use a pinning pointer. I
did notice that the pinning action moves the string though... let me try
with a larger string....

With a one million character string, there is no change in the pointer, and
the AddrOfPinnedObject call still includes the correct offset. Probably
small objects get moved to the Large Object Heap in order to pin them (and
do they come back, maybe once pinned, always pinned until all references
disappear?).

So that's how to get a zero-copy pointer to the internal data of a large
string.

Note that everything here is based on my quick tests and reading vcclr.h and
I may just have gotten lucky; pressure on the GC could move things around
and mess things up, or other bad things could happen.
Feb 13 '07 #18
Ben Voigt <rb*@nospam.nospamwrote:

<snip>
Note that the Large Object Heap is a heap, and not subject to garbage
collection, in which case you ought not need to pin the object.
Just to check what you mean - as far as I'm aware, the LOH *is* garbage
collected, but is *not* compacted.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Feb 13 '07 #19
Thank you Ben, your work helped me.
Anyway, I decided to redesign my method and somehow stream the result
directly into an archived buffer.
It's a hell lot of work, as the response method is very delicate, but
it's worth the effort.

Thanks anyone who answered. Anyway, I'm still puzzled why microsoft
has not (yet) implemented such a mechanism that was very useful in
MFC.

Thanks.
Ben Voigt a scris:
"nano2k" <ad***********@ikonsoft.rowrote in message
news:11**********************@j27g2000cwj.googlegr oups.com...
Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation =not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

Here's your loaded gun, plenty of rope to hang yourself (I looked at the
C++/CLI function PtrToStringChars):

string s = "This is a test string";

GCHandle ptr = GCHandle.Alloc(s);

byte* pString = *(byte**)GCHandle.ToIntPtr(ptr).ToPointer() +
System.Runtime.CompilerServices.RuntimeHelpers.Off setToStringData;

char* c = (char*)pString;

ptr.Free();

GCHandle pinptr = GCHandle.Alloc(s, GCHandleType.Pinned);

pString = (byte*)pinptr.AddrOfPinnedObject().ToPointer();

c = (char*)pString;

pinptr.Free();

Note that the Large Object Heap is a heap, and not subject to garbage
collection, in which case you ought not need to pin the object.

BTW, that OffsetToStringData is 12 (at least in .NET 2.0) but it's a real
property, not a constant, so the value is gotten from your actual runtime
library, not when you compile. Of course the JIT will inline that property
access to nothing anyway.

The PtrToStringChars code is the same in both .NET 1.1 and 2.0, so this
might be almost stable.

I don't know why the offset isn't needed when you use a pinning pointer. I
did notice that the pinning action moves the string though... let me try
with a larger string....

With a one million character string, there is no change in the pointer, and
the AddrOfPinnedObject call still includes the correct offset. Probably
small objects get moved to the Large Object Heap in order to pin them (and
do they come back, maybe once pinned, always pinned until all references
disappear?).

So that's how to get a zero-copy pointer to the internal data of a large
string.

Note that everything here is based on my quick tests and reading vcclr.h and
I may just have gotten lucky; pressure on the GC could move things around
and mess things up, or other bad things could happen.
Feb 13 '07 #20
nano2k <ad***********@ikonsoft.rowrote:
Thank you Ben, your work helped me.
Anyway, I decided to redesign my method and somehow stream the result
directly into an archived buffer.
It's a hell lot of work, as the response method is very delicate, but
it's worth the effort.

Thanks anyone who answered. Anyway, I'm still puzzled why microsoft
has not (yet) implemented such a mechanism that was very useful in
MFC.
Because strings are immutable, and because UTF-16 is rarely the
encoding you want when you're converting to a byte array anyway.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Feb 13 '07 #21
"Ben Voigt" <rb*@nospam.nospamwrote in message
news:%2****************@TK2MSFTNGP02.phx.gbl...
>
"nano2k" <ad***********@ikonsoft.rowrote in message
news:11**********************@j27g2000cwj.googlegr oups.com...
>Hi, thanks all for your replys.
I will answer to some ideas in this one place.
Indeed, most of them don't allocate, but relies on you to allocate.
So, in my perspective, it's the same.
I am using .NET framework v1.1 and I need to compress my string.
Unfortunately, the compressing library (no sources available to me)
takes an byte[] as input parameter and I have a string to compress.
It is frustrating that I have to allocate new memory to perform this
operation. This sometimes leads to webservice crash, as many requests
simultaneously require this operation =not enough memory.
I do not intend to use the buffer but for strict readonly operations.
I am aware that any "unmanaged" changes in such an intimate buffer
could cause future unexpected behavior.

Here's your loaded gun, plenty of rope to hang yourself (I looked at the C++/CLI function
PtrToStringChars):

string s = "This is a test string";

GCHandle ptr = GCHandle.Alloc(s);

byte* pString = *(byte**)GCHandle.ToIntPtr(ptr).ToPointer() +
System.Runtime.CompilerServices.RuntimeHelpers.Off setToStringData;

char* c = (char*)pString;

ptr.Free();

GCHandle pinptr = GCHandle.Alloc(s, GCHandleType.Pinned);

pString = (byte*)pinptr.AddrOfPinnedObject().ToPointer();

c = (char*)pString;

pinptr.Free();

Note that the Large Object Heap is a heap, and not subject to garbage collection, in which
case you ought not need to pin the object.

BTW, that OffsetToStringData is 12 (at least in .NET 2.0) but it's a real property, not a
constant, so the value is gotten from your actual runtime library, not when you compile.
Of course the JIT will inline that property access to nothing anyway.

The PtrToStringChars code is the same in both .NET 1.1 and 2.0, so this might be almost
stable.

I don't know why the offset isn't needed when you use a pinning pointer. I did notice
that the pinning action moves the string though... let me try with a larger string....

With a one million character string, there is no change in the pointer, and the
AddrOfPinnedObject call still includes the correct offset. Probably small objects get
moved to the Large Object Heap in order to pin them (and do they come back, maybe once
pinned, always pinned until all references disappear?).

So that's how to get a zero-copy pointer to the internal data of a large string.

Note that everything here is based on my quick tests and reading vcclr.h and I may just
have gotten lucky; pressure on the GC could move things around and mess things up, or
other bad things could happen.


Above assumes that the GC would never compact the LOH, IMO no-one ever said that future
versions of the CLR would not attempt to compact the LOH, so I think it's dangerous to
assume no pinning is needed.

Anyway, why make it that complicated when you have the "fixed" statement in C#?

[DllImport("somedll")]
unsafe private static extern bool Foo(char* bytes);
...
string hugeString = ............
unsafe {
fixed (char* phugeString = hugeString ) {
Foo(phugeString );
}
}

Note that it's easy to corrupt the heap when passing native pointers to unmanaged....
Willy.
Feb 13 '07 #22
Because strings are immutable, and because UTF-16 is rarely the
encoding you want when you're converting to a byte array anyway.
I like that strings are immutable and I don't want that to be changed.
Anyway, no one can ignore the performance issues due to this
particularity.
I only pointed out the differences between the old (good IMO) features
of CString class and the new (not completely good, IMO) String class.
And once again, I only need readonly access; I can't see any harm
here.
Of course, one may argue that CString's feature would make Strings
"less immutable". I can at least partially aggree, but this is only a
theoretical point of view. In practise, my application is affected by
this - let me say it - limitation.
For that reason, I need to redesign a part of it just because of the
way Strings are seen as immutable. No problem, I will do it thinking
that perhaps my initial analyse was a bit twisted.
As for the UTF-16 in my case I have 2 possibilities:
1. To assume this extra cost in processing by compressing the unneeded
zero bytes.
2. To jump over each zero byte when compressing.
IMO, either method is welcome because in many cases it will prevent
the memory peak that generates the out of memory exception.

Thanks again to you all. I will redesign the processing method in a
way to stream the response.
Any debate on this subject will always capture my attention.

Feb 13 '07 #23
nano2k <ad***********@ikonsoft.rowrote:
Because strings are immutable, and because UTF-16 is rarely the
encoding you want when you're converting to a byte array anyway.

I like that strings are immutable and I don't want that to be changed.
Anyway, no one can ignore the performance issues due to this
particularity.
There are only any issues *if* the format you'd want the string in is
UTF-16. As has been said before, that means that for many strings using
mostly ASCII characters, pretty much every other byte is going to be 0.
I only pointed out the differences between the old (good IMO) features
of CString class and the new (not completely good, IMO) String class.
And once again, I only need readonly access; I can't see any harm
here.
Yes, you only need readonly access. How do you mark a byte array as
being readonly though?
Of course, one may argue that CString's feature would make Strings
"less immutable". I can at least partially aggree, but this is only a
theoretical point of view.
No it's not. Making strings mutable has a *huge* effect on any number
of things, not least security. Either something is mutable, or it's
not. If I can't rely on strings being immutable, then I need to take a
copy of one every time I receive it etc.
In practise, my application is affected by
this - let me say it - limitation.
By the sounds of it your application needed rearchitecting so as not to
have huge chunks of data in memory at a time anyway.
For that reason, I need to redesign a part of it just because of the
way Strings are seen as immutable. No problem, I will do it thinking
that perhaps my initial analyse was a bit twisted.
As for the UTF-16 in my case I have 2 possibilities:
1. To assume this extra cost in processing by compressing the unneeded
zero bytes.
That is likely to add to the compressed size, as well.
2. To jump over each zero byte when compressing.
You won't know where to fill in the zero bytes when decompressing then.
IMO, either method is welcome because in many cases it will prevent
the memory peak that generates the out of memory exception.
Other solutions that have been offered here are much better, IMO.
Thanks again to you all. I will redesign the processing method in a
way to stream the response.
And by doing that you'll not only avoid this "limitation", you'll
reduce the overall memory usage from what it would have been even if
you *could* have accessed the data in situ.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Feb 13 '07 #24
I like that strings are immutable and I don't want that to be changed.
Anyway, no one can ignore the performance issues due to this
particularity.

There are only any issues *if* the format you'd want the string in is
UTF-16. As has been said before, that means that for many strings using
mostly ASCII characters, pretty much every other byte is going to be 0.
No, I was talking about performance issues regarding the
multiplication of strings that are inherent with immutable strings.

Yes, you only need readonly access. How do you mark a byte array as
being readonly though?
You can't; I was only talking about an hypothetical situation.
By the sounds of it your application needed rearchitecting so as not to
have huge chunks of data in memory at a time anyway.
100% correct; no pb with that. Still, I think that such a feature
could bring benefits. I can imagine it working just the way the 3 APIs
work together: GlobalAlloc(), GlobalLock() and GlobalUnlock().

Feb 13 '07 #25
Of course, if you want a mutable string, you might also consider
char[] (or possibly StringBuilder) as "string-like" place-holders.
Encoding.GetBytes() and Encoding.GetChars() will work with char[]
happily (maybe not StringBuilder) for the purpose of passing byte[]
to/from compression routines, presumably using UTF8 for the encoding
to get single-byte ASCII behavior while still supporting full
unicode.

Just some thoughts, possibly already stated (it is a long chain to
re-read...)

Marc
Feb 13 '07 #26
By the sounds of it your application needed rearchitecting so as not to
have huge chunks of data in memory at a time anyway.
[snip]
>
That is likely to add to the compressed size, as well.
Not by much. This mainly just doubles the length of every dictionary entry
within the compressed data, but the backreferences which constitute the bulk
of the compressor output shouldn't grow at all.

Also, compressors *need* to keep a large chunk (on the order of megabytes)
of data in memory at a time for good ratios, if you use an streaming
interface then the compressor probably just ends up allocating buffers to
maintain the history.
Feb 13 '07 #27
Above assumes that the GC would never compact the LOH, IMO no-one ever
said that future versions of the CLR would not attempt to compact the LOH,
so I think it's dangerous to assume no pinning is needed.
No, if you use GCHandle with the pinning option you have zero-copy, at least
in .NET 2.0, and gc safety in any version.
>
Anyway, why make it that complicated when you have the "fixed" statement
in C#?

[DllImport("somedll")]
unsafe private static extern bool Foo(char* bytes);
...
string hugeString = ............
unsafe {
fixed (char* phugeString = hugeString ) {
Foo(phugeString );
}
}
Because there's no documented System.String.explicit operator char*?

In fact, I don't think it's that easy, because Reflector reveals that .NET
doesn't take the address of a string, but of the m_firstChar member which
isn't visible to clients (see System.String.ReplaceCharInPlace,
System.String.SmallCharToUpper).

Ahh, but the C# specification, part 25.6, specifically mentions char* to
string. Looks like it will do the same as the C++ PtrToStringChars and the
C# code I gave, only simpler. However, I'm concerned how the
null-termination guarantee is provided without incurring the overhead of a
copy at least part of the time.
>
Note that it's easy to corrupt the heap when passing native pointers to
unmanaged....
Willy.


Feb 13 '07 #28

"Ben Voigt" <rb*@nospam.nospamwrote in message
news:u1**************@TK2MSFTNGP06.phx.gbl...
>Above assumes that the GC would never compact the LOH, IMO no-one ever
said that future versions of the CLR would not attempt to compact the
LOH, so I think it's dangerous to assume no pinning is needed.

No, if you use GCHandle with the pinning option you have zero-copy, at
least in .NET 2.0, and gc safety in any version.
>>
Anyway, why make it that complicated when you have the "fixed" statement
in C#?

[DllImport("somedll")]
unsafe private static extern bool Foo(char* bytes);
...
string hugeString = ............
unsafe {
fixed (char* phugeString = hugeString ) {
Foo(phugeString );
}
}

Because there's no documented System.String.explicit operator char*?

In fact, I don't think it's that easy, because Reflector reveals that .NET
doesn't take the address of a string, but of the m_firstChar member which
isn't visible to clients (see System.String.ReplaceCharInPlace,
System.String.SmallCharToUpper).

Ahh, but the C# specification, part 25.6, specifically mentions char* to
string. Looks like it will do the same as the C++ PtrToStringChars and
the C# code I gave, only simpler. However, I'm concerned how the
null-termination guarantee is provided without incurring the overhead of a
copy at least part of the time.
Turns out the MSIL uses OffsetToStringData as well, along with some unusual
"string pinned" local variable. Doesn't look like anything that could
produce a copy, but I don't know what magic is invoked by that "string
pinned" type.
>
>>
Note that it's easy to corrupt the heap when passing native pointers to
unmanaged....
Willy.



Feb 13 '07 #29
"Ben Voigt" <rb*@nospam.nospamwrote in message
news:u9**************@TK2MSFTNGP04.phx.gbl...
>
"Ben Voigt" <rb*@nospam.nospamwrote in message
news:u1**************@TK2MSFTNGP06.phx.gbl...
>>Above assumes that the GC would never compact the LOH, IMO no-one ever said that future
versions of the CLR would not attempt to compact the LOH, so I think it's dangerous to
assume no pinning is needed.

No, if you use GCHandle with the pinning option you have zero-copy, at least in .NET 2.0,
and gc safety in any version.
>>>
Anyway, why make it that complicated when you have the "fixed" statement in C#?

[DllImport("somedll")]
unsafe private static extern bool Foo(char* bytes);
...
string hugeString = ............
unsafe {
fixed (char* phugeString = hugeString ) {
Foo(phugeString );
}
}

Because there's no documented System.String.explicit operator char*?

In fact, I don't think it's that easy, because Reflector reveals that .NET doesn't take
the address of a string, but of the m_firstChar member which isn't visible to clients
(see System.String.ReplaceCharInPlace, System.String.SmallCharToUpper).

Ahh, but the C# specification, part 25.6, specifically mentions char* to string. Looks
like it will do the same as the C++ PtrToStringChars and the C# code I gave, only
simpler. However, I'm concerned how the null-termination guarantee is provided without
incurring the overhead of a copy at least part of the time.

Turns out the MSIL uses OffsetToStringData as well, along with some unusual "string
pinned" local variable. Doesn't look like anything that could produce a copy, but I don't
know what magic is invoked by that "string pinned" type.
The "string pinned" is not a local variable, this tells the JIT that it should reserve a
slot in the "global handle" table, this table holds references to objects that can't be
moved until they are removed from this table. This is quite handy, by this the Interop layer
knows that the parameter's object reference is all ready pinned, so he shouldn't care about
this.
If you look at the IL you'll see that the string 'reference' is copied to this location at
the start of the fixed block scope and set to null when the scope ends.
The Global Handle table is part of the LOH.

Willy.


Feb 13 '07 #30
nano2k wrote:
I like that strings are immutable and I don't want that to be changed.
Anyway, no one can ignore the performance issues due to this
particularity.
There are only any issues *if* the format you'd want the string in is
UTF-16. As has been said before, that means that for many strings using
mostly ASCII characters, pretty much every other byte is going to be 0.
No, I was talking about performance issues regarding the
multiplication of strings that are inherent with immutable strings.
Actually, since immutable strings can be freely shared, I believe that
it leads to a reduction in pointless string duplication simply to
establish a private copy.

-- Barry

--
http://barrkel.blogspot.com/
Feb 14 '07 #31

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by Christophe Elek | last post: by
4 posts views Thread by Charlie | last post: by
5 posts views Thread by Sia Jai Sung | last post: by
9 posts views Thread by Lee | last post: by
reply views Thread by Saiars | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.