Decoding strategy

marcin.rzeznicki

Hello everyone
I've got a little problem with choosing the best decoding strategy for
some nasty problem. I have to deal with very large files wich contain
text encoded with various encodings. Their length makes loading
contents of file into memory in single run inappropriate. I solved this
problem by implementing memory mapping using P/Invoke and I load
contents of file in chunks. Since files' contents are in different
encodings what I really do is mapping portion of file into memory and
then decoding that part using System.Text.Encoding. So far, so good,
but. It's not difficult to imagine serious problem with this approach.
Since file processing is not, and also cannot be, sequential and
furthermore, memory mapping limits offsets at which mapping can take
place, then some mapping can "tear" a character apart. How to deal with
this? I thought of implementing decoder fallback which would check few
bytes behind current mapping and would try to substitute unrecognized
chars, but I don't know whether it is feasible. I do not know if
decoder will not accidently mistake broken char for some valid, but
different from expected, character. I guess it depends on encoding
used. What do You think?

Oct 9 '06 #1

Subscribe Post Reply

3348

Kevin Spencer

I would use a FileStream instance to read the file. The FileStream class
supports random access to files, allowing you to jump around in the file.
You can read as little or as much as you want into memory when you need to.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Shooter
http://unclechutney.blogspot.com

A man, a plan, a canal, a palindrome that has.. oh, never mind.

<ma**************@gmail.comwrote in message
news:11********************@h48g2000cwc.googlegrou ps.com...

Hello everyone
I've got a little problem with choosing the best decoding strategy for
some nasty problem. I have to deal with very large files wich contain
text encoded with various encodings. Their length makes loading
contents of file into memory in single run inappropriate. I solved this
problem by implementing memory mapping using P/Invoke and I load
contents of file in chunks. Since files' contents are in different
encodings what I really do is mapping portion of file into memory and
then decoding that part using System.Text.Encoding. So far, so good,
but. It's not difficult to imagine serious problem with this approach.
Since file processing is not, and also cannot be, sequential and
furthermore, memory mapping limits offsets at which mapping can take
place, then some mapping can "tear" a character apart. How to deal with
this? I thought of implementing decoder fallback which would check few
bytes behind current mapping and would try to substitute unrecognized
chars, but I don't know whether it is feasible. I do not know if
decoder will not accidently mistake broken char for some valid, but
different from expected, character. I guess it depends on encoding
used. What do You think?

Oct 10 '06 #2

marcin.rzeznicki

Kevin Spencer napisal(a):

I would use a FileStream instance to read the file. The FileStream class
supports random access to files, allowing you to jump around in the file.
You can read as little or as much as you want into memory when you need to.

Hello Kevin
Thanks for reply.
I didn't test performance with FileStream, but maybe you can confirm -
Does File Stream caches contents of file in memory? I think there is
slight speedup when using memory mapping in that I do not have to hit
the disk all the time. In my solution I simply open mapping over whole
file and create views as needed. Anyway, let's say that I did it using
FileStream, I can read some bytes from it, but I still face the same
problem - how to interpret first bytes I have read, whether they are
beginning of character, or maybe end of "previous" character?

Oct 10 '06 #3

Kim Greenlee

Hi Marcin,

I need a little clarification. You have multiple files where each file
could use a different encoding OR you have multiple files where WITHIN each
file multiple encodings are used?

I'm also confused by your reference to a character "tear". And if you could
explain that reference, I would find it helpful.

Thanks,

Kim Greenlee
--
digipede - Many legs make light work.
Grid computing for the real world.
http://www.digipede.net
http://krgreenlee.blogspot.net

Oct 10 '06 #4

Kevin Spencer

No, the FileStream is the .Net equivalent of a FILE pointer (in a sense). It
is positioned and reads from the file according to your code. You must
create a buffer for it to read into. That buffer can be used to read
portions of the file, and used repeatedly. See
http://msdn2.microsoft.com/en-us/library/ms256203.aspx for more detailed
information.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Shooter
http://unclechutney.blogspot.com

A man, a plan, a canal, a palindrome that has.. oh, never mind.

<ma**************@gmail.comwrote in message
news:11**********************@i42g2000cwa.googlegr oups.com...

>
Kevin Spencer napisal(a):
>I would use a FileStream instance to read the file. The FileStream class
supports random access to files, allowing you to jump around in the file.
You can read as little or as much as you want into memory when you need
to.

Hello Kevin
Thanks for reply.
I didn't test performance with FileStream, but maybe you can confirm -
Does File Stream caches contents of file in memory? I think there is
slight speedup when using memory mapping in that I do not have to hit
the disk all the time. In my solution I simply open mapping over whole
file and create views as needed. Anyway, let's say that I did it using
FileStream, I can read some bytes from it, but I still face the same
problem - how to interpret first bytes I have read, whether they are
beginning of character, or maybe end of "previous" character?

Oct 10 '06 #5

Peter Duniho

<ma**************@gmail.comwrote in message
news:11**********************@i42g2000cwa.googlegr oups.com...

I didn't test performance with FileStream, but maybe you can confirm -
Does File Stream caches contents of file in memory?

FileStream does buffer, which is in a sense a kind of caching. You can
specify the buffer size when you create the FileStream.

I think there is
slight speedup when using memory mapping in that I do not have to hit
the disk all the time.

IMHO, the two major benefits to memory mapping are 1) convenience (as long
as your file access fits within the addressable space available to you), and
2) minimal and efficient virtual memory usage (the physical memory storage
of the data can be backed by the file itself, rather than using up swap file
space).

Any i/o speed advantage you can get with memory mapping, you can get with
normal file i/o using appropriate techniques.

In my solution I simply open mapping over whole
file and create views as needed. Anyway, let's say that I did it using
FileStream, I can read some bytes from it, but I still face the same
problem - how to interpret first bytes I have read, whether they are
beginning of character, or maybe end of "previous" character?

I'm not entirely sure I understand the question. Even using a memory mapped
file, if you jump into a random location in the middle, you can't tell
whether you're at the beginning of a new character or in the middle of one.
You need some point of reference to tell the difference.

If the file is entirely made up of contiguous Unicode characters, and thus
each character always starts on an even offset from the start of the file,
then that's one easy way to tell when you are at the beginning or middle of
a character. If that's the case though, then you could easily preserve that
characteristic even reading the file using FileStream.

On the other hand, if you are dealing with some other multibyte character
set, or it's all Unicode but there's other data that can cause the Unicode
characters to get shifted to odd offsets, then even using memory mapped
files you need to find a good point of reference before you decide whether
you're dealing with the start of a Unicode character.

Basically, I don't see how using the FileStream class versus using memory
mapping alters the underlying issue of determining what the character
boundaries are. You can read sections of the file using FileStream, and as
long as you keep track of what absolute file position those sections come
from, you can always translate the address of a byte from a partial section
back to an absolute file position, giving you the exact same position
information you'd have when using memory mapping.

It *is* true that reading the file into buffers by sections using the
FileStream class, you could wind up with partial data at the beginning of
end of one of these sections. The question there though is not knowing what
you've got (since as I point out above, you can just as easily determine
that whether using FileStream or memory mapping), but rather how to get back
the other part. To deal with that, you'd need additional layer of
processing that can piece together these data that straddle read boundaries.

I agree that this is an area in which memory mapped files are more
convenient, but it shouldn't be that hard for you to maintain a small
"workspace" buffer in which this sort of reconstruction can take place. In
the simplest case, it need only be a single "char" in which you pull out one
byte at a time from the buffer read by FileStream and combine them as pairs
into the "char" buffer (that may or may not be efficient, depending on what
level at which you're processing the data...if you have to look at each and
every character anyway, it may not be all that bad).

Pete

Oct 10 '06 #6

marcin.rzeznicki

Kim Greenlee napisal(a):

Hi Marcin,

Hi Kim
Thanks for reply

>
I need a little clarification. You have multiple files where each file
could use a different encoding OR you have multiple files where WITHIN each
file multiple encodings are used?

The former. In other words, I do not know in advance what the encoding
of a file is. It can be some encoding which properties might lead to
"tearing" :-) (see below)

>
I'm also confused by your reference to a character "tear". And if you could
explain that reference, I would find it helpful.

Yes, sure. Let's say file is encoded using UTF8. Then single character
can occupy 1, 2, 3 or 4 bytes. Let's say at offset n-1 lies the
beginning of some 3-byte character. Its byte pattern, according to UTF8
specs, are then as follows:
1110xxxx 10yyyyyy 10zzzzzz
I read contents, this way or that, starting at offset n. I cannot, due
to memory mapping constraints, choose that offset freely, they have to
be aligned to some boundaries.
1110xxxx | 10yyyyyy 10zzzzzz (| indicates start of mapping)
That's what I call torn character.
In this particular case, due to UTF8 properties, it is easy to fix
this, no sane decoder can assume that 10yyyyyy is beginning of UTF8
character, so it suffices to read bytes behind offset thus providing
fallback to decoder.
So, having said that, my questions are: whether all encodings
(multibyte, of course) have that nice property that one can determine
that given byte is part of "torn" character, rather than treating it
wrongly as beginning of other character? and - what is the bast way to
solve the problem, in your opinion, I think that implementing decoder
fallback is quite sane, but I want to know your opinion.

Oct 10 '06 #7

marcin.rzeznicki

Peter Duniho napisal(a):

<ma**************@gmail.comwrote in message
news:11**********************@i42g2000cwa.googlegr oups.com...
I didn't test performance with FileStream, but maybe you can confirm -
Does File Stream caches contents of file in memory?

FileStream does buffer, which is in a sense a kind of caching. You can
specify the buffer size when you create the FileStream.

I think there is
slight speedup when using memory mapping in that I do not have to hit
the disk all the time.

IMHO, the two major benefits to memory mapping are 1) convenience (as long
as your file access fits within the addressable space available to you), and
2) minimal and efficient virtual memory usage (the physical memory storage
of the data can be backed by the file itself, rather than using up swap file
space).

I agree with you. Especially second point is what I struggle to
achieve. I think that there is also other advantage, which lies in
explicit access of "memory buffer". Since I get pointer (it is unsafe I
know :-) ) to contiguous memory I save one copy operation each time I
need to map portion of file into memory. Reason being, FileStream, even
though using buffering, does not give me access to it. Then to perform
subsequent decoding I have to copy data from FileStream into byte array
and pass it into decoder, on the other hand, I pass pointer to memory
view of file directly into decoder.

>
Any i/o speed advantage you can get with memory mapping, you can get with
normal file i/o using appropriate techniques.

Not with FileStream I fear.

>
In my solution I simply open mapping over whole
file and create views as needed. Anyway, let's say that I did it using
FileStream, I can read some bytes from it, but I still face the same
problem - how to interpret first bytes I have read, whether they are
beginning of character, or maybe end of "previous" character?

I'm not entirely sure I understand the question. Even using a memory mapped
file, if you jump into a random location in the middle, you can't tell
whether you're at the beginning of a new character or in the middle of one.
You need some point of reference to tell the difference.

Obviously true. I build for myself character index, which tells me
approximately where to seek given character. When opening file I decode
each block of file and ask decoder to tell me how many chars are found
in each and every block of file. Then I buld data structure like this
(100, 200, ..., 5000) which means: chars 0-99 are in the first block
100-199 in the second and so on. Then, when I have to read string
starting at, let's say, 250th character, simple index lookup tells me
that I should start mapping at 2nd block. After mapping I decode
contents and calculate needed offset

>
If the file is entirely made up of contiguous Unicode characters, and thus
each character always starts on an even offset from the start of the file,
then that's one easy way to tell when you are at the beginning or middle of
a character. If that's the case though, then you could easily preserve that
characteristic even reading the file using FileStream.

Yes, but it's not the case

>
On the other hand, if you are dealing with some other multibyte character
set, or it's all Unicode but there's other data that can cause the Unicode
characters to get shifted to odd offsets, then even using memory mapped
files you need to find a good point of reference before you decide whether
you're dealing with the start of a Unicode character.

I am using index whic I described above for that "point of reference"

>
Basically, I don't see how using the FileStream class versus using memory
mapping alters the underlying issue of determining what the character
boundaries are. You can read sections of the file using FileStream, and as
long as you keep track of what absolute file position those sections come
from, you can always translate the address of a byte from a partial section
back to an absolute file position, giving you the exact same position
information you'd have when using memory mapping.

It *is* true that reading the file into buffers by sections using the
FileStream class, you could wind up with partial data at the beginning of
end of one of these sections. The question there though is not knowing what
you've got (since as I point out above, you can just as easily determine
that whether using FileStream or memory mapping), but rather how to get back
the other part. To deal with that, you'd need additional layer of
processing that can piece together these data that straddle read boundaries.

Yes, I agree. That;s why I asked Kevin whather he sees some magical way
by which FileStream will get things right. So, I do not think that
using FileStream, or any othr i/o strategy for that matter, will help
me in my problem

>
I agree that this is an area in which memory mapped files are more
convenient, but it shouldn't be that hard for you to maintain a small
"workspace" buffer in which this sort of reconstruction can take place. In
the simplest case, it need only be a single "char" in which you pull out one
byte at a time from the buffer read by FileStream and combine them as pairs
into the "char" buffer (that may or may not be efficient, depending on what
level at which you're processing the data...if you have to look at each and
every character anyway, it may not be all that bad).

Right, so here you come to the point where my doubts are born :-)
First of all - what's the best way to create small buffer - whether
decoder fallback, or maybe some other strategy will do better. Or maybe
I screwed up everything and there is better solution.
And - is it always possible (keep in my mind that some encodings migh
not be so nica as Unicode encodings) to reconstruct character? I do not
know much about encodings in general, but while pondering on this idea
I decided to check a few encodings and see whether I am right. I came
across Shit JIS encoding, which, I fear, can mistake "torn" character
for a different one.

>
Pete

Thanks for helpful reply.

Oct 10 '06 #8

marcin.rzeznicki

Kevin Spencer napisal(a):

No, the FileStream is the .Net equivalent of a FILE pointer (in a sense). It
is positioned and reads from the file according to your code. You must
create a buffer for it to read into. That buffer can be used to read
portions of the file, and used repeatedly. See
http://msdn2.microsoft.com/en-us/library/ms256203.aspx for more detailed
information.

Hi Kevin
The link you gave me leads to XSLT reference section.

Oct 10 '06 #9

marcin.rzeznicki

ma**************@gmail.com napisal(a):

Reason being, FileStream, even
though using buffering, does not give me access to it.

Should be '(...) its buffer'

Oct 10 '06 #10

Kevin Spencer

Sorry! Wrong browser instance. Here you go:

http://msdn2.microsoft.com/en-us/lib...ilestream.aspx

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Shooter
http://unclechutney.blogspot.com

A man, a plan, a canal, a palindrome that has.. oh, never mind.

<ma**************@gmail.comwrote in message
news:11**********************@b28g2000cwb.googlegr oups.com...

>
Kevin Spencer napisal(a):
>No, the FileStream is the .Net equivalent of a FILE pointer (in a sense).
It
is positioned and reads from the file according to your code. You must
create a buffer for it to read into. That buffer can be used to read
portions of the file, and used repeatedly. See
http://msdn2.microsoft.com/en-us/library/ms256203.aspx for more detailed
information.

Hi Kevin
The link you gave me leads to XSLT reference section.

Oct 10 '06 #11

Peter Duniho

<ma**************@gmail.comwrote in message
news:11**********************@e3g2000cwe.googlegro ups.com...

I agree with you. Especially second point is what I struggle to
achieve. I think that there is also other advantage, which lies in
explicit access of "memory buffer". Since I get pointer (it is unsafe I
know :-) ) to contiguous memory I save one copy operation each time I
need to map portion of file into memory.

That's true. But since your use of the file is non-trivial, it is likely
that the copying of data from one memory location to another will not
dominate the performance of your program.

In other words, worry about that bridge when you come to it. First step is
to get something that works. :)

Reason being, FileStream, even
though using buffering, does not give me access to it.

It doesn't give you direct access, you're right. But merely by reading from
the file in large chunks at a time, even if it does so in a way opaque to
your own code, performance may well be acceptable.

Keep in mind that if you are not reading from the file in a purely
sequential way, even memory mapping the file may or may not buffer in a way
that optimizes your access to the file.

[...]
>Any i/o speed advantage you can get with memory mapping, you can get with
normal file i/o using appropriate techniques.

Not with FileStream I fear.

But your fears might be unfounded. I can't really say for sure one way or
the other without having a full-blown implementation in my hands to look at.
But getting data from the hard disk is going to be a major bottleneck, as
will sifting through it after it's been safely stored in memory. As long as
that data has been buffered somewhere, it may not really matter that it gets
copied one or two extra times once in memory.

[...]
>I'm not entirely sure I understand the question. Even using a memory
mapped
file, if you jump into a random location in the middle, you can't tell
whether you're at the beginning of a new character or in the middle of
one.
You need some point of reference to tell the difference.

Obviously true. I build for myself character index, which tells me
approximately where to seek given character.

How large are these indexes? You might keep in mind that consuming RAM in
the form of an index is likely to interfere with the memory mapped file in
at least a couple of ways: one, by fragmenting your virtual memory space
(thereby limiting the size of the file you can deal with) and two, by
consuming physical RAM to deal with the indexes you may wind up flushing
file data out of physical RAM sooner than you'd like.

The latter issue is a problem whether you're using memory mapping or not, so
I'm not trying to say this is a significant factor in deciding between the
two. My main point is that the indexes are one thing that may cause more
disk i/o to occur, and thus further reducing the significance of any
additional memory-to-memory data copies.

[...]
Yes, I agree. That;s why I asked Kevin whather he sees some magical way
by which FileStream will get things right. So, I do not think that
using FileStream, or any othr i/o strategy for that matter, will help
me in my problem

Well, one advantage of using the FileStream class is that since you need to
do more explicit handling of the file i/o, it gives you an opportunity to
address the issue you're asking about.

That said, it seems to me that in terms of the specific question you're
asking, memory mapped file i/o is the best solution. It has its
limitations, as you've already pointed out, but if you can live with those
limitations then it's a good solution.

However, that's not how I interpreted the question you asked. My apologies
if I misunderstood, but the way I read it is that you've stated the
limitations of the memory mapped file i/o and are looking for a means around
it. The only way around it is to use more conventional file i/o, in the
form of the FileStream class or something similar.

[...]
Right, so here you come to the point where my doubts are born :-)
First of all - what's the best way to create small buffer - whether
decoder fallback, or maybe some other strategy will do better.

IMHO, the first thing you should do is try just using a FileStream directly.
Give it some reasonably large buffer size to use (at least a handful of file
blocks, which are usually 4K each), and read data from the file as you need
it. Even if this means reading just a small number of bytes at a time,
between one and four depending on where your encoder is and what data is
being processed.

For example, if your decoder would get "cbDecode" bytes from offset
"ibDecode" (I have no idea how you do this in your code...maybe if you could
post line or two that demonstrates how you actually access the data, that
would be useful), you could do this instead with a FileStream (let's call it
"fsDecode"):

byte[] rgbDecode = new byte[cbDecode];

fsDecode.Seek(ibDecode, SeekOrigin.Begin);
fsDecode.Read(rgbDecode, 0, cbDecode);

Then you've got your bytes in the byte array ready for processing. There's
no tearing issue, and most of the time the read will come from memory,
buffered by the FileStream object. The biggest problem here would be the
high overhead from calling Seek and Read over and over. But it's a nice
simple approach. :)

(A side note: you may actually find the BinaryReader class more suitable, as
the FileStream.Read method can in theory actually return fewer bytes than
you ask for, even if you don't reach the end of the file...I left out the
return value checking for simplicity, but you might need to include that if
you don't use BinaryReader. BinaryReader.ReadBytes will always return as
many bytes as you ask for, unless it reaches the end of the file and can't).

Once you've done that, then you've got your worst-case scenario. That's
likely to be the poorest-performing way to read the file, and if it turns
out to be fast enough, you can just stop right there. :)

If you find that's too slow, then you can accomplish pretty much the same
performance gain you might get from a memory-mapped file (or possibly even
better, depending on what sort of buffering Windows was capable of doing
with your memory-mapped file) by reading the file directly in larger chunks.
If you do that, then yes...you need to worry about the data you're
processing straddling whatever artificial boundary you wind up imposing by
adding the extra layer of buffering in your own code. But that is a
solvable problem (and in fact will be solved in a very similar way to what
the memory-mapped solution has to do behind the scenes for you anyway). If
that last sentence causes you some questions, let me know and I can
elaborate.

Or maybe
I screwed up everything and there is better solution.
And - is it always possible (keep in my mind that some encodings migh
not be so nica as Unicode encodings) to reconstruct character?

I don't know. That's a somewhat different question and doesn't have much to
do with the file i/o method you use. I don't have a lot of experience with
multibyte character encodings, but as far as I recall from my limited use of
them, an initial byte always looks different from a subsequent byte within a
given character. So you can always work your way backwards to find an
initial byte and start decoding from there.

I do not
know much about encodings in general, but while pondering on this idea
I decided to check a few encodings and see whether I am right. I came
across Shit JIS encoding, which, I fear, can mistake "torn" character
for a different one.

I hope that's not the case, but if it is you have that issue whether you use
memory-mapped file i/o or not. Or alternatively, if you think that
memory-mapped file i/o solves that issue, maybe if you explain why it is you
think that, it would help us understand your question better. :)

Pete

Oct 11 '06 #12

marcin.rzeznicki

Peter Duniho wrote:

<ma**************@gmail.comwrote in message
news:11**********************@e3g2000cwe.googlegro ups.com...
I agree with you. Especially second point is what I struggle to
achieve. I think that there is also other advantage, which lies in
explicit access of "memory buffer". Since I get pointer (it is unsafe I
know :-) ) to contiguous memory I save one copy operation each time I
need to map portion of file into memory.

That's true. But since your use of the file is non-trivial, it is likely
that the copying of data from one memory location to another will not
dominate the performance of your program.

In other words, worry about that bridge when you come to it. First step is
to get something that works. :)

Well, it works, I mean - memory mapping solution has been implemented
and works perfectly, at least with one-byte encodings :-)

>
Reason being, FileStream, even
though using buffering, does not give me access to it.

It doesn't give you direct access, you're right. But merely by reading from
the file in large chunks at a time, even if it does so in a way opaque to
your own code, performance may well be acceptable.

That's my mistake, I assumed that memory mapping would be the best way
to do this and I implemented it right away. Fortunately implementation
didn't take much time, 'cause I'd done that stuff before, though not in
managed code. I didn't investigate managed alternatives nor measured
their performance. I also haven't abstracted i/o out very well, so it
may be awkward to replace memory mapping with FileStream and measure
how it performs. I think I suffer from some kind of "premature
optimization" syndrome :-)

>
Keep in mind that if you are not reading from the file in a purely
sequential way, even memory mapping the file may or may not buffer in a way
that optimizes your access to the file.

Well, you can pass hints about your usage of file to memory mapping
function, so I think that OS caches it appropriately.

[...]
Any i/o speed advantage you can get with memory mapping, you can get with
normal file i/o using appropriate techniques.
Not with FileStream I fear.

But your fears might be unfounded. I can't really say for sure one way or
the other without having a full-blown implementation in my hands to look at.
But getting data from the hard disk is going to be a major bottleneck, as
will sifting through it after it's been safely stored in memory. As long as
that data has been buffered somewhere, it may not really matter that it gets
copied one or two extra times once in memory.

Perfect solution would be to solve potential decoding problems with
memory mapping. Hope it is possible.

Obviously true. I build for myself character index, which tells me
approximately where to seek given character.

How large are these indexes? You might keep in mind that consuming RAM in
the form of an index is likely to interfere with the memory mapped file in
at least a couple of ways: one, by fragmenting your virtual memory space
(thereby limiting the size of the file you can deal with) and two, by
consuming physical RAM to deal with the indexes you may wind up flushing
file data out of physical RAM sooner than you'd like.

Index is an integer array with one entry per allocation block. Size of
block depends on machine, on my machine it is 64k. So, assuming average
file length is about 500 MB then index contains more or less eight
thousand entries, each entry being integer, gives us 32k memory
occupied by index. So I do not think it can noticeably degrade
performance.

[...]
Yes, I agree. That;s why I asked Kevin whather he sees some magical way
by which FileStream will get things right. So, I do not think that
using FileStream, or any othr i/o strategy for that matter, will help
me in my problem

Well, one advantage of using the FileStream class is that since you need to
do more explicit handling of the file i/o, it gives you an opportunity to
address the issue you're asking about.

That said, it seems to me that in terms of the specific question you're
asking, memory mapped file i/o is the best solution. It has its
limitations, as you've already pointed out, but if you can live with those
limitations then it's a good solution.

Yes, pure advanatage of FileStream I see so far, is that it enables
file access at any offset, so tearing problem can be prevented. Tearing
problem is born because you have to map file at offsets aligned to
allocation block boundary. But that would not be really much if I knew
that I could solve decoding problems reliably.

However, that's not how I interpreted the question you asked. My apologies
if I misunderstood, but the way I read it is that you've stated the
limitations of the memory mapped file i/o and are looking for a means around
it. The only way around it is to use more conventional file i/o, in the
form of the FileStream class or something similar.

That's what I wrote, except for the part "looking for a means around".
Well, depends on what you mean by this, but I'd not rather disband
memory mapping. So I am not looking for "means around memory mapping"
but: living within memory mapping walls, how can I solve the "tearing"
problem? As I pointed out, one way is to implement your own decoder
fallback, but then another question arises, which is: is it really
reliable? If it is proven that there is no good solution, then I will
drop out memory mapping.

[cut]

>
For example, if your decoder would get "cbDecode" bytes from offset
"ibDecode" (I have no idea how you do this in your code...maybe if you could
post line or two that demonstrates how you actually access the data, that
would be useful), you could do this instead with a FileStream (let's call it
"fsDecode"):

byte[] rgbDecode = new byte[cbDecode];

fsDecode.Seek(ibDecode, SeekOrigin.Begin);
fsDecode.Read(rgbDecode, 0, cbDecode);

Then you've got your bytes in the byte array ready for processing. There's
no tearing issue, and most of the time the read will come from memory,
buffered by the FileStream object. The biggest problem here would be the
high overhead from calling Seek and Read over and over. But it's a nice
simple approach. :)

Certainly it is :-)
That's how I wanted to implement fallback buffer. Each time I detect
"torn" char I reposition file pointer, probe bytes backward till I find
valid char and provide replacement.

(A side note: you may actually find the BinaryReader class more suitable, as
the FileStream.Read method can in theory actually return fewer bytes than
you ask for, even if you don't reach the end of the file...I left out the
return value checking for simplicity, but you might need to include that if
you don't use BinaryReader. BinaryReader.ReadBytes will always return as
many bytes as you ask for, unless it reaches the end of the file and can't).

Once you've done that, then you've got your worst-case scenario. That's
likely to be the poorest-performing way to read the file, and if it turns
out to be fast enough, you can just stop right there. :)

Yeah, nice solution. Even though performance hit may be noticeable, if
I restrict these operations to fallback times only and extend my index
structure to cache "torn" characters I should not need to execute that
code very often. Seems good to me. Yet ... :-( Can I be sure whether
decoder cannot mistake characters?

If you find that's too slow, then you can accomplish pretty much the same
performance gain you might get from a memory-mapped file (or possibly even
better, depending on what sort of buffering Windows was capable of doing
with your memory-mapped file) by reading the file directly in larger chunks.
If you do that, then yes...you need to worry about the data you're
processing straddling whatever artificial boundary you wind up imposing by
adding the extra layer of buffering in your own code. But that is a
solvable problem (and in fact will be solved in a very similar way to what
the memory-mapped solution has to do behind the scenes for you anyway). If
that last sentence causes you some questions, let me know and I can
elaborate.

Well, yes, please. If you are able to show me how to solve that, then I
can mix memory mapping with direct file access at fallback times and be
perfectly happy.

>
Or maybe
I screwed up everything and there is better solution.
And - is it always possible (keep in my mind that some encodings migh
not be so nica as Unicode encodings) to reconstruct character?

I don't know. That's a somewhat different question and doesn't have much to
do with the file i/o method you use. I don't have a lot of experience with
multibyte character encodings, but as far as I recall from my limited use of
them, an initial byte always looks different from a subsequent byte within a
given character. So you can always work your way backwards to find an
initial byte and start decoding from there.

After little afterthought I've found that it is the most significant
question. But let me rephrase what you wrote: it is no problem to find
characters when reading byte sequence forward and every sane encoding
must adhere to this in order to be usable. But is it the same case when
looking backward?

I do not
know much about encodings in general, but while pondering on this idea
I decided to check a few encodings and see whether I am right. I came
across Shit JIS encoding, which, I fear, can mistake "torn" character
for a different one.

I hope that's not the case, but if it is you have that issue whether you use
memory-mapped file i/o or not. Or alternatively, if you think that
memory-mapped file i/o solves that issue, maybe if you explain why it is you
think that, it would help us understand your question better. :)

It is exactly opposite :-) I believe that memory mapping causes this
problem because of mapping offsets limitations. If you used FileStream
then, after some initial playing with decoder getcharcount, you could
find exact character boundaries. That's its big advantage over memory
mapping, because memory mapping imposes restrictions on mapping
offsets.
So, summing up. I think that question reduces to the one about encoding
characteristics. You showed us very good solution using FileStream. It
can be extended to mix these two approches which may be faster but I
still do not know whether it is realiable.

Pete

Oct 11 '06 #13

marcin.rzeznicki

Just one more thing :-)

For example, if your decoder would get "cbDecode" bytes from offset
"ibDecode" (I have no idea how you do this in your code...maybe if you could
post line or two that demonstrates how you actually access the data, that
would be useful)

..Yes, of course, I can post. Unfortunately, I do not have access to
code in question right now, I will post it in few hours.

Oct 11 '06 #14

Peter Duniho

<ma**************@gmail.comwrote in message
news:11**********************@e3g2000cwe.googlegro ups.com...

Just one more thing :-)

>For example, if your decoder would get "cbDecode" bytes from offset
"ibDecode" (I have no idea how you do this in your code...maybe if you
could
post line or two that demonstrates how you actually access the data, that
would be useful)

.Yes, of course, I can post. Unfortunately, I do not have access to
code in question right now, I will post it in few hours.

It seems to me that if you need access to the code in order to post the
general idea I'm asking about, you may be answering the question in far more
detail than I was looking for. :)

If .NET supported memory-mapped file i/o, I probably wouldn't even ask the
question. But since it doesn't, and since you're obviously using some kind
of workaround to incorporate memory-mapped file i/o into your program, some
of the specifics are unknowable to us unless you post them. They may not
even be relevant, but it wouldn't hurt to try to clarify that.

Still, I'm not asking for the whole decoder here. Just some general idea of
how you've merged the non-.NET concept of memory-mapped file i/o into a .NET
context.

Pete

Oct 11 '06 #15

Peter Duniho

<ma**************@gmail.comwrote in message
news:11**********************@c28g2000cwb.googlegr oups.com...

[...]
I didn't investigate managed alternatives nor measured
their performance. I also haven't abstracted i/o out very well, so it
may be awkward to replace memory mapping with FileStream and measure
how it performs. I think I suffer from some kind of "premature
optimization" syndrome :-)

That could be. It's a common enough problem. I find myself occasionally
*paralyzed* by it, when I get stuck trying to decide the most optimal
solution and fail to get any progress toward ANY solution.

>Keep in mind that if you are not reading from the file in a purely
sequential way, even memory mapping the file may or may not buffer in a
way
that optimizes your access to the file.

Well, you can pass hints about your usage of file to memory mapping
function, so I think that OS caches it appropriately.

It will cache as best as it can. But if you are jumping around the file,
the OS simply cannot correct predict what to buffer for you. This is
especially bad when going backwards in the file.

Note that with respect to the hints you can provide, the docs say that best
performance is when you are accessing the file sequentially, and sparsely,
and provide the sequential access hint. That doesn't mean that if you
provide some other hint and are accessing the file differently, you will get
similar performance. :)

The OS doesn't know what you're doing with the file, and it has no way to
predict when you might go backwards in the file. If you are accessing the
file in a sparse manner (as it appears you may be), then you may find that
often when you go backwards, that data hasn't been read yet. Going back
just one byte might incur another disk read.

It's hard to say for sure without all the details...I'm just pointing out
that these caching issues exist whether you're using a memory-mapped file or
just reading normally.

[...]
Yes, pure advanatage of FileStream I see so far, is that it enables
file access at any offset, so tearing problem can be prevented. Tearing
problem is born because you have to map file at offsets aligned to
allocation block boundary. But that would not be really much if I knew
that I could solve decoding problems reliably.

From this, I think that I may still not fully understand the question.

It is true that a file must be mapped to an aligned memory address. But
this should only affect the virtual address used to locate the file in the
virtual address space. That is, the first byte of the file will be on an
aligned address, but the rest of the file is contiguous from there.

Likewise, even if you are mapping sections of the file into different
virtual address locations (why? is this to allow more of the file to be
mapped in spite of virtual address space fragmentation?), resulting in those
sections of the file having to each be aligned, you can still access the
virtual address for the data in a byte-wise fashion.

All that the alignment requirement affects is where the data winds up in
virtual memory. I don't see how it affects your access of the data.

Now, that said, there do seem to be one or two different issues related to
this. I say "one or two" because they are either the exact same problem or
not, depending on how you look at it. :) That is, the inability to map the
entire file to a single contiguous section of your virtual address space at
once. This causes the secondary problem that you may have to jump from one
spot in virtual memory to another as you traverse (forward or backward) the
data in the file. It also may limit how much data you can have mapped at
once.

Judging from this:

That's what I wrote, except for the part "looking for a means around".
Well, depends on what you mean by this, but I'd not rather disband
memory mapping. So I am not looking for "means around memory mapping"
but: living within memory mapping walls, how can I solve the "tearing"
problem?

I'm guessing that both of those issues are really just the same problem for
you. That is, that you have to address the file using non-contiguous
pointers.

Certainly it is :-)
That's how I wanted to implement fallback buffer. Each time I detect
"torn" char I reposition file pointer, probe bytes backward till I find
valid char and provide replacement.

Perhaps you could clarify under what situation you "detect a 'torn' char".
That is, it's unclear to me whether you are referring to simply jumping into
an offset that's not the start of a character, or if this somehow
specifically relates to the sectioning of the file caused by your memory
mapped i/o.

The former would be an issue even if you could map the entire file to a
single contiguous virtual address range. The latter is obviously only an
issue because of the sectioning of the file. I'm confused as to which it
is.

[...]
Yeah, nice solution. Even though performance hit may be noticeable, if
I restrict these operations to fallback times only and extend my index
structure to cache "torn" characters I should not need to execute that
code very often. Seems good to me. Yet ... :-( Can I be sure whether
decoder cannot mistake characters?

Well, as I mentioned...I can't help you with that question. :) That
depends on the nature of the data you're decoding, and I don't know enough
to be able to answer that.

>If you find that's too slow, then you can accomplish pretty much the same
performance gain you might get from a memory-mapped file (or possibly
even
better, depending on what sort of buffering Windows was capable of doing
with your memory-mapped file) by reading the file directly in larger
chunks.
If you do that, then yes...you need to worry about the data you're
processing straddling whatever artificial boundary you wind up imposing
by
adding the extra layer of buffering in your own code. But that is a
solvable problem (and in fact will be solved in a very similar way to
what
the memory-mapped solution has to do behind the scenes for you anyway).
If
that last sentence causes you some questions, let me know and I can
elaborate.

Well, yes, please. If you are able to show me how to solve that, then I
can mix memory mapping with direct file access at fallback times and be
perfectly happy.

Okay, let's see if this makes sense. First, keep in mind that my comment
was assuming a general solution to the file i/o problem. I think you should
be able to apply it as a "fallback" solution, but it may or may not be
better than just falling back to reading a few bytes at a time if that's
your approach.

Also, keep in mind that this is just a simple example of what I mean. I
don't mean to imply that this would be the best implementation...just that
it's a sample of the general idea.

Finally, keep in mind that this doesn't remove the issue of sectioning the
file. It just abstracts it out a bit. Since I didn't realize before that
you may be trying to get rid of the whole issue of having to jump your data
reads from one block of memory to another, I proposed the idea not realizing
it may be exactly the opposite of what you're looking for. :(

That said:

What I meant was that you can read from the file a few blocks at a time,
keeping that buffer centered on where you are currently accessing. You'll
need to keep track of:

-- current file offset
-- an array of blocks read from the file
-- the file offsets those blocks came from
-- the current block

The general idea is to maintain the array of blocks such that there is an
odd number of blocks, at least three, and they are centered on the current
offset within the file you're reading. Normally, you'll be reading from the
middle block. If you skip over to another block, you drop one block from
the far end of the array, and read another adding it to the near end of the
array.

Basically, you're windowing the file in a fixed set of buffers. If you read
new data asynchronously to your use of the data in the buffers you currently
have, then when you drop a block at one end and fill it for use at the other
end, the file i/o can happen while you're still processing the data that you
do have.

Obviously if you jump to a completely different point in the file, you'll
have to wait for the surrounding data to be read, but that's an issue even
if memory mapped files or just reading directly with a FileStream.

[...]
After little afterthought I've found that it is the most significant
question. But let me rephrase what you wrote: it is no problem to find
characters when reading byte sequence forward and every sane encoding
must adhere to this in order to be usable. But is it the same case when
looking backward?

I still don't know. :) I suspect that it is, because it was true with the
basic MBCS I've seen. But I also realize that there are a LOT of different
ways to encode text, and some may be context-sensitive.

Some of this really depends on what you mean by "encoding" and "decoding".
The word "encoding" is applied in a variety of ways. Two that could apply
here are the basic idea of text encoding, which mostly just has to do with
the character set, or some actual conversion of data, which has to do with
compressing the data, or translating it into a more portable format (MIME,
for example). I don't even know which of these meanings you're addressing,
making it even harder for me to know the answer. :)

[...]
So, summing up. I think that question reduces to the one about encoding
characteristics. You showed us very good solution using FileStream. It
can be extended to mix these two approches which may be faster but I
still do not know whether it is realiable.

Indeed, that is a question you should probably figure out. Earlier rather
than later. :) Sorry I can't be of more help on that front.

Pete

Oct 11 '06 #16

marcin.rzeznicki

Peter Duniho napisal(a):

<ma**************@gmail.comwrote in message
news:11**********************@e3g2000cwe.googlegro ups.com...
Just one more thing :-)

For example, if your decoder would get "cbDecode" bytes from offset
"ibDecode" (I have no idea how you do this in your code...maybe if you
could
post line or two that demonstrates how you actually access the data, that
would be useful)
.Yes, of course, I can post. Unfortunately, I do not have access to
code in question right now, I will post it in few hours.

It seems to me that if you need access to the code in order to post the
general idea I'm asking about, you may be answering the question in far more
detail than I was looking for. :)

If .NET supported memory-mapped file i/o, I probably wouldn't even ask the
question. But since it doesn't, and since you're obviously using some kind
of workaround to incorporate memory-mapped file i/o into your program, some
of the specifics are unknowable to us unless you post them. They may not
even be relevant, but it wouldn't hurt to try to clarify that.

Still, I'm not asking for the whole decoder here. Just some general idea of
how you've merged the non-.NET concept of memory-mapped file i/o into a .NET
context.

Hi,
Here it goes. It is no rocket science, I just used P/Invoke in order to
access WinAPI functions and that's all. I'm pasting just relevant
methods, if you feel you need sth more then let me know:

//this method builds index of characters
private void BuildCharIndex()
{
if ( encoding.IsSingleByte ) //if encoding is single byte I assume one
byte - one char correspondence
{
//nothing interesting here
}
else
{
//....
while ( fileOffset < fileLength )
{
ReadPage( blockIndex );
unsafe
{
for ( int i = 0; i < mappedBlocksCount ; ++i )
{
charCount += decoder.GetCharCount( (byte*) page.ToPointer() + i *
blockSize, blockSize, false );
index.Add( charCount );
}
}
//...
}
}

page variable references my own SafeHandle descendant for managing
mapping handles. I wrote simple toPointer method for convenience
ReadPage is responsible for establishing file mapping. Page represents
range of blocks, each block is 64k

private void ReadPage(int startBlock)
{
//...
int fileOffset = startBlock * blockSize;
pageLength = Math.Min( fileLength - fileOffset, PAGE_SIZE * blockSize
);
//I wrote PInvoke signature for this and additional enums for
convenience
page = NativeMethods.MapViewOfFile( fileMapping, FileViewAccess.Read,
0, fileOffset, pageLength );
if ( page.IsInvalid )
Marshal.ThrowExceptionForHR( Marshal.GetLastWin32Error() );
//...
}

If you care for signature of MapViewOfFile:

[DllImport( "kernel32.dll", SetLastError = true )]
internal static extern SafeFileMapViewHandle
MapViewOfFile(SafeGenericHandle hFileMappingObject, FileViewAccess
dwDesiredAccess, int dwFileOffsetHigh, int dwFileOffsetLow, int
dwNumberOfBytesToMap);

So, that's how I read contents. Code is much simplified than original
but I hope it carries the idea

if ( !IsBlockInMemory( firstBlockIndex ) )
{
ReadPage( firstBlockIndex );
CopyCurrentPageToBuffer();
}
//...
int bufferOffset = //here I calculate needed offset using index and
additional calculations
Marshal.PtrToStringUni( new IntPtr( memoryBuffer.ToInt64() +
bufferOffset * CHAR_SIZE ), length );

And the method which causes my problems is CopyCurrentPageToBuffer. It
reads mapped portion and decodes. What is relevant is the line:

unsafe
{
encoding.GetDecoder().GetChars( (byte*) page.ToPointer(), pageLength
//... );
}

>
Pete

Oct 11 '06 #17

marcin.rzeznicki

Hi
[...]

It's hard to say for sure without all the details...I'm just pointing out
that these caching issues exist whether you're using a memory-mapped file or
just reading normally.

In general, form what I observed the most frequent access is random yet
with locality pattern. So I start with random part of file, mess around
not too far away from beginning of mapping, and then jump somewhere
else. So, memory mapped view is sure too be usable for a while, so I
think it pays off too keep that in memory. That characteristic also
ensures me that OS cache can be helpful and performance will not suffer
from misses/disk reads very often.

>
[...]
Yes, pure advanatage of FileStream I see so far, is that it enables
file access at any offset, so tearing problem can be prevented. Tearing
problem is born because you have to map file at offsets aligned to
allocation block boundary. But that would not be really much if I knew
that I could solve decoding problems reliably.

From this, I think that I may still not fully understand the question.

It is true that a file must be mapped to an aligned memory address. But
this should only affect the virtual address used to locate the file in the
virtual address space. That is, the first byte of the file will be on an
aligned address, but the rest of the file is contiguous from there.

Yes, I know. I was referring to something else. Sorry for being
unclear. Docs say:

"(..)must specify an offset within the file that matches the memory
allocation granularity of the system, or the function fails. That is,
the offset must be a multiple of the allocation granularity".

I know that this is going to be aligned to sth in VM, but I do not
care, this is transparent unless you write kernel-mode stuff or,
generally, very low level stuff. What I do care is that I cannot choose
FILE offset at wich mapping starts. And that leads to "tearing"

[...]

>
Judging from this:

That's what I wrote, except for the part "looking for a means around".
Well, depends on what you mean by this, but I'd not rather disband
memory mapping. So I am not looking for "means around memory mapping"
but: living within memory mapping walls, how can I solve the "tearing"
problem?

I'm guessing that both of those issues are really just the same problem for
you. That is, that you have to address the file using non-contiguous
pointers.

Mm, now I do not understand :-) Memory I map is guaranteed to be
contiguous, it does not span the whole file, but contents mapped
(current "page" in my code, if you will) have to start at specific
offset in file - here I seek the rot of all evil :-)

Certainly it is :-)
That's how I wanted to implement fallback buffer. Each time I detect
"torn" char I reposition file pointer, probe bytes backward till I find
valid char and provide replacement.

Perhaps you could clarify under what situation you "detect a 'torn' char".
That is, it's unclear to me whether you are referring to simply jumping into
an offset that's not the start of a character, or if this somehow
specifically relates to the sectioning of the file caused by your memory
mapped i/o.

Well, yes, it can be thought of as jumping into character, which, in
turn, is related to sectioning :-) How I detect? Hmm, that's
interesting question. I am not sure, if I were to used DecoderFallback
then that detection would happend by means of decoder itself.

The former would be an issue even if you could map the entire file to a
single contiguous virtual address range. The latter is obviously only an
issue because of the sectioning of the file. I'm confused as to which it
is.

Hope I clarified :-)

[...]
Yeah, nice solution. Even though performance hit may be noticeable, if
I restrict these operations to fallback times only and extend my index
structure to cache "torn" characters I should not need to execute that
code very often. Seems good to me. Yet ... :-( Can I be sure whether
decoder cannot mistake characters?

Well, as I mentioned...I can't help you with that question. :) That
depends on the nature of the data you're decoding, and I don't know enough
to be able to answer that.

Data is simply plain text file with some human readable text.

If you find that's too slow, then you can accomplish pretty much the same
performance gain you might get from a memory-mapped file (or possibly
even
better, depending on what sort of buffering Windows was capable of doing
with your memory-mapped file) by reading the file directly in larger
chunks.
If you do that, then yes...you need to worry about the data you're
processing straddling whatever artificial boundary you wind up imposing
by
adding the extra layer of buffering in your own code. But that is a
solvable problem (and in fact will be solved in a very similar way to
what
the memory-mapped solution has to do behind the scenes for you anyway).
If
that last sentence causes you some questions, let me know and I can
elaborate.
Well, yes, please. If you are able to show me how to solve that, then I
can mix memory mapping with direct file access at fallback times and be
perfectly happy.

Okay, let's see if this makes sense. First, keep in mind that my comment
was assuming a general solution to the file i/o problem. I think you should
be able to apply it as a "fallback" solution, but it may or may not be
better than just falling back to reading a few bytes at a time if that's
your approach.

That's what I planned to do. Apply it as "fallback", but I think we
somehow disagree on meaning of "fallback". I used that word in terms of
"decoder fallback", speaking C# - it is an instance of DecoderFallback
class, speaking more generally, something which provides replacement
chars to decoder when it cannot, for some reason, decode a sequence. I
somehow suspect that you used that as "another plan". So, I planned to
use your solution as part of DecoderFallback implementation, which will
read few bytes back and try to concatenate these with bytes from
beginning of mapping.

[...]

That said:

What I meant was that you can read from the file a few blocks at a time,
keeping that buffer centered on where you are currently accessing. You'll
need to keep track of:

-- current file offset
-- an array of blocks read from the file
-- the file offsets those blocks came from
-- the current block

The general idea is to maintain the array of blocks such that there is an
odd number of blocks, at least three, and they are centered on the current
offset within the file you're reading. Normally, you'll be reading from the
middle block. If you skip over to another block, you drop one block from
the far end of the array, and read another adding it to the near end of the
array.

Basically, you're windowing the file in a fixed set of buffers. If you read
new data asynchronously to your use of the data in the buffers you currently
have, then when you drop a block at one end and fill it for use at the other
end, the file i/o can happen while you're still processing the data that you
do have.

Obviously if you jump to a completely different point in the file, you'll
have to wait for the surrounding data to be read, but that's an issue even
if memory mapped files or just reading directly with a FileStream.

Well, that's very close to what I have now. Let me specify the details.
I read few "blocks" a time, namely 4, which is 256kb of data (block for
me is memory allocation granularity, as that it is the smallest
addressable part of file when it comes to memory mapping). I try to
adjust offset a little, so that: I always read the whole data I am
requested, and, immediate reads in the neighbourhood will not cause
remapping, whch is close to your idea. But then, how do you know
whether the very first byte of current "window" is the first block of
character?

[...]
After little afterthought I've found that it is the most significant
question. But let me rephrase what you wrote: it is no problem to find
characters when reading byte sequence forward and every sane encoding
must adhere to this in order to be usable. But is it the same case when
looking backward?

I still don't know. :) I suspect that it is, because it was true with the
basic MBCS I've seen. But I also realize that there are a LOT of different
ways to encode text, and some may be context-sensitive.

:-( That's pain in the ass for me. If I knew that I could always look
back for missing parts of single character, then mix of your solution
with memory mapping would be the best scheme

Some of this really depends on what you mean by "encoding" and "decoding".
The word "encoding" is applied in a variety of ways. Two that could apply
here are the basic idea of text encoding, which mostly just has to do with
the character set, or some actual conversion of data, which has to do with
compressing the data, or translating it into a more portable format (MIME,
for example). I don't even know which of these meanings you're addressing,
making it even harder for me to know the answer. :)

I meant "basic idea of text encoding" :-)

[...]
So, summing up. I think that question reduces to the one about encoding
characteristics. You showed us very good solution using FileStream. It
can be extended to mix these two approches which may be faster but I
still do not know whether it is realiable.

Indeed, that is a question you should probably figure out. Earlier rather
than later. :) Sorry I can't be of more help on that front.

Pete, first of all thank you for wonderful discussion, it was really
helpful. And I hope you'll add something more after reading the code
:-)

Pete

Oct 11 '06 #18

Peter Duniho

<ma**************@gmail.comwrote in message
news:11**********************@b28g2000cwb.googlegr oups.com...

[...]
So, memory mapped view is sure too be usable for a while, so I
think it pays off too keep that in memory. That characteristic also
ensures me that OS cache can be helpful and performance will not suffer
from misses/disk reads very often.

That's well and good. However, those characteristics assist in ensuring
that the file data is cached when using other forms of i/o as well,
including using a FileStream. The benefit is not unique to memory mapped
file i/o.

Yes, I know. I was referring to something else. Sorry for being
unclear. Docs say:

"(..)must specify an offset within the file that matches the memory
allocation granularity of the system, or the function fails. That is,
the offset must be a multiple of the allocation granularity".

I know that this is going to be aligned to sth in VM, but I do not
care, this is transparent unless you write kernel-mode stuff or,
generally, very low level stuff. What I do care is that I cannot choose
FILE offset at wich mapping starts. And that leads to "tearing"

Okay, I think I understand better what you meant. I'm going to snip a bunch
of stuff here, and hopefully jump to the core of the issue...

[...]
Well, that's very close to what I have now. Let me specify the details.
I read few "blocks" a time, namely 4, which is 256kb of data (block for
me is memory allocation granularity, as that it is the smallest
addressable part of file when it comes to memory mapping). I try to
adjust offset a little, so that: I always read the whole data I am
requested, and, immediate reads in the neighbourhood will not cause
remapping, whch is close to your idea. But then, how do you know
whether the very first byte of current "window" is the first block of
character?

Thanks. The code you posted helps me understand better what's going on.

In fact, as near as I can tell, you are using memory mapping in practically
the same way as my proposed multiple-buffer solution deals with things.
That is, you're windowing the file with memory mapping the same way I'm
doing it with the buffers.

Here's a dumb question: is there any particular reason you're NOT mapping
the entire file at once? I've mentioned the possibility in previous
messages, making assumptions that you have your reasons for not doing so.
But if you could, all of these issues just go away. Are you genuinely
concerned that you won't have enough contiguous virtual address space to map
the whole file?

Anyway, for the moment let's assume that you can only map a portion of the
file at a time...

Depending on what the actual performance is, it seems to me that either
method would be the correct solution. I suspect that the answer is to
simply do the memory mapping a little differently, but I don't have enough
experience with memory mapped files to know for sure.

Specifically: what if you modified your code that maps the file, so that it
maps a range *around* the starting point, the way I suggested with the
buffers? At certain points (perhaps only when you got right to the very
edge and attempted to read a byte outside your mapped range), you would
remap the file, shifting the window so that the bytes you want to deal with
are within the mapped range.

When you index the data, I would recommend the high-level code using an
index relative to the file beginning. That is, your index is just the file
offset for the data (note that I'm not using the word "index" to relate to
the broader index you calculate for the file data...I just mean a way to
identify which byte you're working on at the moment). Then you translate
that to the actual offset within the mapped range as necessary. That way,
you can be changing the mapped range on the fly without affecting how the
higher-level code that actually processes the data works.

I believe that performance should be fine doing this. When you remap the
file, most of the file should still be in physical RAM and I suspect the OS
will correctly reattach the newly mapped range to the portions of the range
that are already resident in RAM. Only the newly mapped portions of the
file should need to be read.

>[...]
:-( That's pain in the ass for me. If I knew that I could always look
back for missing parts of single character, then mix of your solution
with memory mapping would be the best scheme

Well, at some point you need to come up with some mechanism for finding the
beginning of a valid character. :) How you access the file might make this
easier or harder, but the problem exists even if you can map the entire file
at once. Sorry I can't be more helpful on that front. I agree, that part
actually seems to be the "hard part" of this problem, in spite of all the
space we've consumed discussing the file i/o part. :)

Pete

Oct 11 '06 #19

marcin.rzeznicki

Peter Duniho napisal(a):

<ma**************@gmail.comwrote in message
news:11**********************@b28g2000cwb.googlegr oups.com...
[...]
So, memory mapped view is sure too be usable for a while, so I
think it pays off too keep that in memory. That characteristic also
ensures me that OS cache can be helpful and performance will not suffer
from misses/disk reads very often.

That's well and good. However, those characteristics assist in ensuring
that the file data is cached when using other forms of i/o as well,
including using a FileStream. The benefit is not unique to memory mapped
file i/o.

I see. I am on the winning side though, because I eliminate unnecessary
in-memory copying. But, agreed, that may not be much overall.

Yes, I know. I was referring to something else. Sorry for being
unclear. Docs say:

"(..)must specify an offset within the file that matches the memory
allocation granularity of the system, or the function fails. That is,
the offset must be a multiple of the allocation granularity".

I know that this is going to be aligned to sth in VM, but I do not
care, this is transparent unless you write kernel-mode stuff or,
generally, very low level stuff. What I do care is that I cannot choose
FILE offset at wich mapping starts. And that leads to "tearing"

Okay, I think I understand better what you meant. I'm going to snip a bunch
of stuff here, and hopefully jump to the core of the issue...

[...]
Well, that's very close to what I have now. Let me specify the details.
I read few "blocks" a time, namely 4, which is 256kb of data (block for
me is memory allocation granularity, as that it is the smallest
addressable part of file when it comes to memory mapping). I try to
adjust offset a little, so that: I always read the whole data I am
requested, and, immediate reads in the neighbourhood will not cause
remapping, whch is close to your idea. But then, how do you know
whether the very first byte of current "window" is the first block of
character?

Thanks. The code you posted helps me understand better what's going on.

In fact, as near as I can tell, you are using memory mapping in practically
the same way as my proposed multiple-buffer solution deals with things.
That is, you're windowing the file with memory mapping the same way I'm
doing it with the buffers.

Here's a dumb question: is there any particular reason you're NOT mapping
the entire file at once? I've mentioned the possibility in previous
messages, making assumptions that you have your reasons for not doing so.
But if you could, all of these issues just go away. Are you genuinely
concerned that you won't have enough contiguous virtual address space to map
the whole file?

Well, there are two issues involved, and I do not know which one are
you reffering to. Let me explain. Mapping is actually two-step process,
first of all you reserve VM for mapping and then you commit, which
result in bringing contents of file to memory. So, when it comes to
reservation step, I map entire file at once, code I pasted does not
show this step. What is shown is the commitment step, and I commit only
small portion of reserved memory at once. This app is not going to be
server app, running on high end machines with many gigs of ram. It is
rather intended to be desktop app. So, I do not want to reserve like
500 MB of memory for just one file because it could easily cause
constant swapping and overall performance degradation on user machine.

>
Anyway, for the moment let's assume that you can only map a portion of the
file at a time...

Depending on what the actual performance is, it seems to me that either
method would be the correct solution. I suspect that the answer is to
simply do the memory mapping a little differently, but I don't have enough
experience with memory mapped files to know for sure.

Specifically: what if you modified your code that maps the file, so that it
maps a range *around* the starting point, the way I suggested with the
buffers? At certain points (perhaps only when you got right to the very
edge and attempted to read a byte outside your mapped range), you would
remap the file, shifting the window so that the bytes you want to deal with
are within the mapped range.

It does. Well, I am sorry, because I stripped this code of mapping
logic, but when you see sth like firstBufferIndex it is, almost in all
cases, carefully computed index of a portion which contains requested
data but also its neighbourhood, so that near "jumps" should not cause
remapping. Actually user may as well use enumerated acces, in that case
I know in advance tha data is going to be read forward, then I can, if
I must, map from where previous mapping ends.

When you index the data, I would recommend the high-level code using an
index relative to the file beginning. That is, your index is just the file
offset for the data (note that I'm not using the word "index" to relate to
the broader index you calculate for the file data...I just mean a way to
identify which byte you're working on at the moment). Then you translate
that to the actual offset within the mapped range as necessary. That way,
you can be changing the mapped range on the fly without affecting how the
higher-level code that actually processes the data works.

Well, that is not going to work for me unfortunately. Interfaces I have
to implement imply that data access uses "string coordinates" - so
client code specifies - I want 5th char, not 5th byte, and reckoning
that encoding-hell I would not be able to compute that easily, so I
decided to use only "string coordinates".

I believe that performance should be fine doing this. When you remap the
file, most of the file should still be in physical RAM and I suspect the OS
will correctly reattach the newly mapped range to the portions of the range
that are already resident in RAM. Only the newly mapped portions of the
file should need to be read.

Yeah, I think so too

[...]
:-( That's pain in the ass for me. If I knew that I could always look
back for missing parts of single character, then mix of your solution
with memory mapping would be the best scheme

Well, at some point you need to come up with some mechanism for finding the
beginning of a valid character. :) How you access the file might make this
easier or harder, but the problem exists even if you can map the entire file
at once. Sorry I can't be more helpful on that front. I agree, that part
actually seems to be the "hard part" of this problem, in spite of all the
space we've consumed discussing the file i/o part. :)

Yes, this problem vanishes when you are able to map AND decode entire
file at once. But that's overkill I suppose.
So, I'll try to implement DecoderFallback, with nothing more than HOPE
that it will always be anle to do its job :-)
Thank you.

Pete

Oct 11 '06 #20

Peter Duniho

<ma**************@gmail.comwrote in message
news:11*********************@k70g2000cwa.googlegro ups.com...

[...]
>Here's a dumb question: is there any particular reason you're NOT mapping
the entire file at once? I've mentioned the possibility in previous
messages, making assumptions that you have your reasons for not doing so.
But if you could, all of these issues just go away. Are you genuinely
concerned that you won't have enough contiguous virtual address space to
map
the whole file?

Well, there are two issues involved, and I do not know which one are
you reffering to. Let me explain. Mapping is actually two-step process,
first of all you reserve VM for mapping and then you commit, which
result in bringing contents of file to memory.

That's not the process of memory-mapped file i/o I'm familiar with. That
is, while I know you can use MapViewOfFileEx() to provide a specific virtual
address at which to map the file, this isn't necessary, nor does it to my
knowledge require an explicit commit of the entire file.

The usual method of memory-mapping that I use is this:

* open the file (CreateFile)
* create the file mapping (CreateFileMapping)
* assign virtual address space to file mapping (MapViewOfFile)

When MapViewOfFile returns, the code now has a virtual address that
represents the beginning of the data of the file. Physical RAM is committed
only as the data is actually accessed, and can be reclaimed through the
usual page aging process (older pages get tossed as needed if something else
needs physical RAM that's not available).

So, when it comes to
reservation step, I map entire file at once, code I pasted does not
show this step. What is shown is the commitment step, and I commit only
small portion of reserved memory at once.

The code you posted calls only MapViewOfFile. This doesn't reserve any
physical RAM for the data. It just reserves room in the virtual address
space for it.

This app is not going to be
server app, running on high end machines with many gigs of ram. It is
rather intended to be desktop app. So, I do not want to reserve like
500 MB of memory for just one file because it could easily cause
constant swapping and overall performance degradation on user machine.

Negative. That's one of the nice benefits of memory mapping: you can map an
entire file, even a large one, and use only the physical RAM required to
process the parts you're looking at. In addition, because the physical RAM
being used is backed by the mapped file, it doesn't get swapped out to the
swap file...the file itself can be used for the backing store (this doesn't
necessarily help the physical RAM side of things, but it does ease the
pressure on the swap file itself).

There is no reason that I can think of that would cause mapping a large file
into virtual address space to cause any more swapping than processing that
file would cause in any case. The OS certainly does not read all 500MB of a
mapped 500MB file into physical RAM just because you've mapped the file.

[...]
>Specifically: what if you modified your code that maps the file, so that
it
maps a range *around* the starting point, the way I suggested with the
buffers? At certain points (perhaps only when you got right to the very
edge and attempted to read a byte outside your mapped range), you would
remap the file, shifting the window so that the bytes you want to deal
with
are within the mapped range.

It does. Well, I am sorry, because I stripped this code of mapping
logic, but when you see sth like firstBufferIndex it is, almost in all
cases, carefully computed index of a portion which contains requested
data but also its neighbourhood, so that near "jumps" should not cause
remapping. Actually user may as well use enumerated acces, in that case
I know in advance tha data is going to be read forward, then I can, if
I must, map from where previous mapping ends.

That's not what I mean. If you were doing what I was suggesting already,
then the only issue remaining for you would be figuring out when you need to
back up in the data. The actual backing up would be trivial...you'd just
decrement your pointer and read the byte you want to read. You would have
moments when the mapped section of the file would have to change, but that
would be a momentary diversion and you'd get right back to just reading the
bytes from the mapped address space.

>[...] Then you translate
that to the actual offset within the mapped range as necessary. That
way,
you can be changing the mapped range on the fly without affecting how the
higher-level code that actually processes the data works.

Well, that is not going to work for me unfortunately. Interfaces I have
to implement imply that data access uses "string coordinates" - so
client code specifies - I want 5th char, not 5th byte, and reckoning
that encoding-hell I would not be able to compute that easily, so I
decided to use only "string coordinates".

I don't think you got my meaning. I don't mean that the highest level of
your code has to use a byte offset within the file. Just that the decoder
part need not concern itself with anything other than the byte offset. As
it read bytes, it would ask the file mapping layer of your code for a byte
offset within the file, and the file mapping layer would then translate that
into an offset within the mapped view you're using.

That said, so far I haven't seen an indication that you actually need to be
mapping sections of the file. You seem to be concerned about committing too
much physical RAM at once to the mapping, but unless you're doing something
really odd that you haven't posted in code, your concern is unfounded.

There are reasons that you might not be able to map an entire file into your
virtual address space, but 500MB ought to be within the usual limitations.
It seems to me that you should look at just mapping the entire file all at
once, and if you run into problems with that, then start worrying about
windowing the file.

The reason you might not be able to map the whole file at once is that you
don't have a contiguous range of virtual address space large enough for the
file. That can happen for two reasons: insufficient virtual address space
left or fragmented virtual address space. How much virtual address space
you might have will vary, but even the theoretical 2GB maximum (and of
course, this never comes close to being available) is smaller than some
files. Fragmentation is harder to predict, and could limit your available
virtual address space to something significantly smaller than the actual
virtual address space left. But IMHO, if 500MB is a typical file size for
you, you ought to be able to map that without problems.

Pete

Oct 12 '06 #21

marcin.rzeznicki

Peter Duniho wrote:

<ma**************@gmail.comwrote in message
news:11*********************@k70g2000cwa.googlegro ups.com...
[...]
Here's a dumb question: is there any particular reason you're NOT mapping
the entire file at once? I've mentioned the possibility in previous
messages, making assumptions that you have your reasons for not doing so.
But if you could, all of these issues just go away. Are you genuinely
concerned that you won't have enough contiguous virtual address space to
map
the whole file?
Well, there are two issues involved, and I do not know which one are
you reffering to. Let me explain. Mapping is actually two-step process,
first of all you reserve VM for mapping and then you commit, which
result in bringing contents of file to memory.

That's not the process of memory-mapped file i/o I'm familiar with. That
is, while I know you can use MapViewOfFileEx() to provide a specific virtual
address at which to map the file, this isn't necessary, nor does it to my
knowledge require an explicit commit of the entire file.

The usual method of memory-mapping that I use is this:

* open the file (CreateFile)
* create the file mapping (CreateFileMapping)
* assign virtual address space to file mapping (MapViewOfFile)

That's the same, but under different names. CreateFileMapping reserves
VM range. It is not yet committed, and you pay almost no
resources-usage/performance price. MapViewOfFile commits some part of
previously reserved VM and brings contents of file (maybe lazily, I
don't know for sure)

When MapViewOfFile returns, the code now has a virtual address that
represents the beginning of the data of the file. Physical RAM is committed
only as the data is actually accessed, and can be reclaimed through the
usual page aging process (older pages get tossed as needed if something else
needs physical RAM that's not available).

So, when it comes to
reservation step, I map entire file at once, code I pasted does not
show this step. What is shown is the commitment step, and I commit only
small portion of reserved memory at once.

The code you posted calls only MapViewOfFile. This doesn't reserve any
physical RAM for the data. It just reserves room in the virtual address
space for it.

Well, actually, if I understand docs correctly, CreateFileMap reserves
virtual memory address range and establishes associacion between VM
addresses and file. MapViewOfFile brings contents of file to RAM

This app is not going to be
server app, running on high end machines with many gigs of ram. It is
rather intended to be desktop app. So, I do not want to reserve like
500 MB of memory for just one file because it could easily cause
constant swapping and overall performance degradation on user machine.

Negative. That's one of the nice benefits of memory mapping: you can map an
entire file, even a large one, and use only the physical RAM required to
process the parts you're looking at. In addition, because the physical RAM
being used is backed by the mapped file, it doesn't get swapped out to the
swap file...the file itself can be used for the backing store (this doesn't
necessarily help the physical RAM side of things, but it does ease the
pressure on the swap file itself).

Positivie with respect to "swapping" definition :-) It does not get
swapped to swap file, true, but still it may be swapped to the mapped
file. So, though you are right that memory pressure is removed from
page file, you still pay the price of swapping if lot of RAM is
occupied by file view

There is no reason that I can think of that would cause mapping a large file
into virtual address space to cause any more swapping than processing that
file would cause in any case. The OS certainly does not read all 500MB of a
mapped 500MB file into physical RAM just because you've mapped the file.

I think that when I've established a view then RAM gets occupied. So,
as I said, I map whole file at once as docs assure me that there is
nothing wrong with that, but I restrict myself to moderately sized
views.

[...]
Specifically: what if you modified your code that maps the file, so that
it
maps a range *around* the starting point, the way I suggested with the
buffers? At certain points (perhaps only when you got right to the very
edge and attempted to read a byte outside your mapped range), you would
remap the file, shifting the window so that the bytes you want to deal
with
are within the mapped range.
It does. Well, I am sorry, because I stripped this code of mapping
logic, but when you see sth like firstBufferIndex it is, almost in all
cases, carefully computed index of a portion which contains requested
data but also its neighbourhood, so that near "jumps" should not cause
remapping. Actually user may as well use enumerated acces, in that case
I know in advance tha data is going to be read forward, then I can, if
I must, map from where previous mapping ends.

That's not what I mean. If you were doing what I was suggesting already,
then the only issue remaining for you would be figuring out when you need to
back up in the data. The actual backing up would be trivial...you'd just
decrement your pointer and read the byte you want to read. You would have
moments when the mapped section of the file would have to change, but that
would be a momentary diversion and you'd get right back to just reading the
bytes from the mapped address space.

Sorry Peter, I don't get it then. Could you explain it to me, it seems
to be interesting idea, but now I feel that I 've got lost.

[...] Then you translate
that to the actual offset within the mapped range as necessary. That
way,
you can be changing the mapped range on the fly without affecting how the
higher-level code that actually processes the data works.
Well, that is not going to work for me unfortunately. Interfaces I have
to implement imply that data access uses "string coordinates" - so
client code specifies - I want 5th char, not 5th byte, and reckoning
that encoding-hell I would not be able to compute that easily, so I
decided to use only "string coordinates".

I don't think you got my meaning. I don't mean that the highest level of
your code has to use a byte offset within the file. Just that the decoder
part need not concern itself with anything other than the byte offset. As
it read bytes, it would ask the file mapping layer of your code for a byte
offset within the file, and the file mapping layer would then translate that
into an offset within the mapped view you're using.

Isn't that what ReadPage in my code does? It is asked to bring contents
indexed by block offset, it computes "real" offset and establishes a
view. Decoder part does not even have to think of byte offsets because
it operates on current page only, and pointer to it is constant in time
when decoder operates.

That said, so far I haven't seen an indication that you actually need to be
mapping sections of the file. You seem to be concerned about committing too
much physical RAM at once to the mapping, but unless you're doing something
really odd that you haven't posted in code, your concern is unfounded.

There are reasons that you might not be able to map an entire file into your
virtual address space, but 500MB ought to be within the usual limitations.
It seems to me that you should look at just mapping the entire file all at
once, and if you run into problems with that, then start worrying about
windowing the file.

The reason you might not be able to map the whole file at once is that you
don't have a contiguous range of virtual address space large enough for the
file. That can happen for two reasons: insufficient virtual address space
left or fragmented virtual address space. How much virtual address space
you might have will vary, but even the theoretical 2GB maximum (and of
course, this never comes close to being available) is smaller than some
files. Fragmentation is harder to predict, and could limit your available
virtual address space to something significantly smaller than the actual
virtual address space left. But IMHO, if 500MB is a typical file size for
you, you ought to be able to map that without problems.

So, if I understand correctly what you wrote, I am not concerned with
mapping file at once, I reserve all VM I will need for one file
(CreateFileMapping). But I am concerned when it comes to commit
(MapViewOfFile) because that's where memory resources are really
consumed. Am I missing something?

Pete

Oct 12 '06 #22

Peter Duniho

<ma**************@gmail.comwrote in message
news:11**********************@b28g2000cwb.googlegr oups.com...

That's the same, but under different names. CreateFileMapping reserves
VM range.

That is incorrect. The virtual memory range is not reserved until you call
MapViewOfFile.

[...] It is not yet committed, and you pay almost no
resources-usage/performance price. MapViewOfFile commits some part of
previously reserved VM and brings contents of file (maybe lazily, I
don't know for sure)

That is also incorrect. MapViewOfFile reserves the virtual address space.
There may be some caching, but otherwise committing the file data to
physical RAM does not occur until a specific portion of the reserved virtual
address space is referenced.

I'm offline right now, otherwise I'd provide a link to the MSDN web site.
However, you can easily look those functions up yourself, and the
documentation explicitly describes the behavior as I do above.

From the documentation for CreateFileMapping:

Creating a file mapping object creates the potential for
mapping a view of the file, but does not map the view. The
MapViewOfFile and MapViewOfFileEx functions map a view of
a file into a process address space

If CreateFileMapping was what allocated virtual address space, it would not
make sense for MapViewOfFileEx to even exist, since the main reason for that
function is to allow the program to provide a specific virtual memory
address at which to map the file.

[...]
Well, actually, if I understand docs correctly, CreateFileMap reserves
virtual memory address range and establishes associacion between VM
addresses and file. MapViewOfFile brings contents of file to RAM

What can I say? You don't understand the docs correctly.

[...]
Positivie with respect to "swapping" definition :-) It does not get
swapped to swap file, true, but still it may be swapped to the mapped
file. So, though you are right that memory pressure is removed from
page file, you still pay the price of swapping if lot of RAM is
occupied by file view

My point is that the amount of data in physical RAM will be related to your
use of that data. The OS will keep the data in physical RAM based on your
access of that data, not based on how much of it there is. This is true
whether you use memory mapping or not.

With either technique, you can limit the *maximum* amount of physical RAM
potentially consumed. Using memory mapping, you do this by mapping only a
small range of the file at a time. Using conventional file i/o, you do this
by limiting your own buffers that are used to store data you've read from
the file.

In either case, the OS has the final say on how much physical RAM is
actually used. Using memory mapping, if there are other depends on physical
RAM, then only a portion of mapped virtual address space will actually be
resident at any given time. Likewise, using conventional file i/o, only a
portion of your own program buffers will be resident in physical RAM at any
given time.

But memory mapped file i/o will not in and of itself increase memory
swapping. The only way it could do that is if you not only map the entirety
of a very large file in RAM, but you wind up *accessing* the totality of
that file more frequently than you access anything else. In that case, the
OS would be chasing you trying to keep all of the file data you're
referencing resident, at the same time that other stuff needs to be swapped
in and back out.

This is not a typical case, and doesn't seem relevant to your own situation.
In any case, the OS is pretty smart. If your use of a memory mapped file
starts pressuring other users of physical RAM, the OS is not going to bother
trying to keep all of the memory mapped file in RAM. Even better, as long
as you open the file as read-only, you're assured to never have to have the
cost of writing any data back to the disk if a physical page of RAM used by
the file mapping has to get discarded and used for something else.

Your worries about memory mapping the entire file causing some serious
problem with disk swapping are unfounded.

>There is no reason that I can think of that would cause mapping a large
file
into virtual address space to cause any more swapping than processing
that
file would cause in any case. The OS certainly does not read all 500MB
of a
mapped 500MB file into physical RAM just because you've mapped the file.

I think that when I've established a view then RAM gets occupied. So,
as I said, I map whole file at once as docs assure me that there is
nothing wrong with that, but I restrict myself to moderately sized
views.

But it's not true that when you establish a view then RAM gets occupied.
The "view" is an allocation of virtual address space, not physical RAM.

>That's not what I mean. If you were doing what I was suggesting already,
then the only issue remaining for you would be figuring out when you need
to
back up in the data. The actual backing up would be trivial...you'd just
decrement your pointer and read the byte you want to read. You would
have
moments when the mapped section of the file would have to change, but
that
would be a momentary diversion and you'd get right back to just reading
the
bytes from the mapped address space.

Sorry Peter, I don't get it then. Could you explain it to me, it seems
to be interesting idea, but now I feel that I 've got lost.

Assume you have some code that attempts to retrieve a byte from a specific
file offset. Assume also that you have some code that translates this into
access from your mapped view of the file. Finally, assume that the
higher-level code is trying to access a byte that is just before the lowest
file offset currently being mapped.

In pseudocode then:

// The desired byte offset from the file
long ibFileOffset;
// This is the mapped range, "Min" inclusive, "Mac" exclusive
long ibMappedMin, ibMappedMac;
// The resulting offset within the mapped range
long ibMappedOffset;

if (ibFileOffset < ibMappedMin || ibFileOffset >= ibMappedMac)
{
// remap file so that ibMappedMin < ibFileOffset and
// ibFileOffset < ibMappedMac. Don't forget to make sure
// that ibMappedMin and ibMappedMac remain between 0 and
// the total file length.
}

ibMappedOffset = ibFileOffset - ibMappedMin;
return *(pbMappedData + ibMappedOffset);

Basically, in the normal case, all that the code is doing is translating the
file offset to the mapping offset and returning the data at that offset.
When the requested data falls outside the range, you just shift the offset
enough to accomodate the new request for data.

Most likely, you'd try to center the newly-mapped range on the request file
offset. When you get near the beginning or end of the file, you'll
necessarily wind up at least trimming the mapped range as appropriate
(making it smaller than normal), if not just pinning the range to the
relevant boundary (preserving the total size of the mapping).

Isn't that what ReadPage in my code does? It is asked to bring contents
indexed by block offset, it computes "real" offset and establishes a
view. Decoder part does not even have to think of byte offsets because
it operates on current page only, and pointer to it is constant in time
when decoder operates.

IMHO, there's no reason for the decoder to have to think of pages within the
file. As near as I can tell, that's an arbitrary choice affected by the
implementation of your file i/o. In particular, if I understand correctly
(and maybe I don't), part of the issue of "tearing" that you're worried
about comes about because of the potential for data being read to cross one
of these page boundaries.

The decoder should be concerning itself only with the entire file. That's
why you have the tearing issue. If you allowed the decoder to simply use an
offset relative to the beginning of the file, then the decoder would never
have to worry about whether the data falls outside the currently mapped
range. The i/o code would take care of that instead, and always return
whatever byte it is the decoder wants to handle.

Of course, if you simply map the entire file all at once, the issue becomes
trivial. So this may or may not be a moot point. You don't seem to be
basing your architectural decisions on correct information about how file
mapping works, so maybe understanding correctly how file mapping works you
will find all of this "map a subset of the file" stuff becomes irrelevant.

[...]
So, if I understand correctly what you wrote, I am not concerned with
mapping file at once, I reserve all VM I will need for one file
(CreateFileMapping). But I am concerned when it comes to commit
(MapViewOfFile) because that's where memory resources are really
consumed. Am I missing something?

Yes, I think so. See above. :)

Pete

Oct 13 '06 #23

marcin.rzeznicki

Peter Duniho napisal(a):

<ma**************@gmail.comwrote in message
news:11**********************@b28g2000cwb.googlegr oups.com...
That's the same, but under different names. CreateFileMapping reserves
VM range.

That is incorrect. The virtual memory range is not reserved until you call
MapViewOfFile.

[...] It is not yet committed, and you pay almost no
resources-usage/performance price. MapViewOfFile commits some part of
previously reserved VM and brings contents of file (maybe lazily, I
don't know for sure)

That is also incorrect. MapViewOfFile reserves the virtual address space.
There may be some caching, but otherwise committing the file data to
physical RAM does not occur until a specific portion of the reserved virtual
address space is referenced.

Methinks that we are giving the same subject different names. Here is
quotation from MSDN:

the address range is reserved with the function CreateFileMapping until
portions are requested via a call to function MapViewOfFile. This
permits applications to map a large file (it is possible to load a file
1 GB in size in Windows NT) to a specific range of addresses without
having to load the entire file into memory. Instead, portions (views)
of the file can be loaded on demand directly to the reserved address
space.

[...]

Your worries about memory mapping the entire file causing some serious
problem with disk swapping are unfounded.

Yes, it seems that I was wrong. I'll have to rethink design once again.

[...]

Assume you have some code that attempts to retrieve a byte from a specific
file offset. Assume also that you have some code that translates this into
access from your mapped view of the file. Finally, assume that the
higher-level code is trying to access a byte that is just before the lowest
file offset currently being mapped.

In pseudocode then:

// The desired byte offset from the file
long ibFileOffset;
// This is the mapped range, "Min" inclusive, "Mac" exclusive
long ibMappedMin, ibMappedMac;
// The resulting offset within the mapped range
long ibMappedOffset;

if (ibFileOffset < ibMappedMin || ibFileOffset >= ibMappedMac)
{
// remap file so that ibMappedMin < ibFileOffset and
// ibFileOffset < ibMappedMac. Don't forget to make sure
// that ibMappedMin and ibMappedMac remain between 0 and
// the total file length.
}

ibMappedOffset = ibFileOffset - ibMappedMin;
return *(pbMappedData + ibMappedOffset);

But, is there any difference if it seems that mapping whole file at
once will do?

[...]

Of course, if you simply map the entire file all at once, the issue becomes
trivial. So this may or may not be a moot point. You don't seem to be
basing your architectural decisions on correct information about how file
mapping works, so maybe understanding correctly how file mapping works you
will find all of this "map a subset of the file" stuff becomes irrelevant.

Yes, and that's the whole point. You are perfectly right about MMF, I
shouldn't have worried about tearing, because I am able to map file at
once and rely on OS when it comes to swapping.

[...]
So, if I understand correctly what you wrote, I am not concerned with
mapping file at once, I reserve all VM I will need for one file
(CreateFileMapping). But I am concerned when it comes to commit
(MapViewOfFile) because that's where memory resources are really
consumed. Am I missing something?

Yes, I think so. See above. :)

Thank you very much. You clarified me this whole mapping issue :-)
Thanks once again

Oct 16 '06 #24

Peter Duniho

<ma**************@gmail.comwrote in message
news:11**********************@m73g2000cwd.googlegr oups.com...

Methinks that we are giving the same subject different names. Here is
quotation from MSDN: [...]

I'm not sure of that. You still seem to believe that CreateFileMapping
affects the use of the virtual address space, and you still seem to believe
that calling MapViewOfFile affects the use of physical memory.

As far as this specific misunderstanding goes, IMHO you should be very
careful about believing a statement found in a general article, rather than
specific comments found in the documentation for the functions you're trying
to understand. In particular, the comments found in the documentation for
CreateFileMapping, MapViewOfFile, and MapViewOfFileEx trump any other
documentation you might find, unless you have independent confirmation that
suggests otherwise.

In this case, I am aware of no other independent confirmation. It seems
most likely to me that the article is simply mentioning in passing some
behavior of the functions that may or may not be relevant to your use.

If you look at the documentation for CreateFileMapping, you'll note that
there is a way of calling it for a mapping not backed by a specific disk
file. In this use, it may be true that CreateFileMapping reserves a virtual
address range. However, that doesn't mean that that's what the function
does in all cases.

As I've pointed out, the behavior of CreateFileMapping and MapViewOfFile(Ex)
are specifically documented contrary to your understanding and contrary to
the article you've referenced. In particular, it would make no sense:

1) That CreateFileMapping could reserve any virtual address space, when
the whole point of the MapViewOfFileEx function is to specify a specific
virtual address at which to map the file. If CreateFileMapping had already
reserved virtual address space, then there would be no way to ask for a
specific virtual address later, as CreateFileMapping would have already
determined the mapped virtual address (you can't reserve a range of virtual
address space without knowing where in the virtual address space it is), or

2) That CreateFileMapping can reserve virtual address space before
knowing how much address space to reserve. When you call CreateFileMapping,
you tell it the full extent of the file you wish to map. It is perfectly
legal for this extent to be larger than 2GB. How would CreateFileMapping
reserve virtual adddress space in this case? Does it pick an arbitrary
length for the range? What happens when you ask to map more than the
arbitrary length it chose? No, I think it much more likely that the
documentation is correct and that virtual address space is not reserved
until you call MapViewOfFile(Ex).

By the way, you should be able to use the VirtualXXX functions or possibly
performance counters to confirm the behavior. I haven't looked closely at
what's available, but I'm sure there's some mechanism for querying the state
of the process's virtual memory. In particular, if you call
CreateFileMapping and the available virtual memory before the call and after
the call is reduced by the size of the mapping you've requested, then that
would support your interpretation that CreateFileMapping is reserving
virtual address space.

I suspect you'll find that a large change in the virtual memory available
happens only after MapViewOfFile. :)

[...]
But, is there any difference if it seems that mapping whole file at
once will do?

No, I don't think so. If it is suitable for your needs to map the entire
file at once, then any issues related to windowing the file simply go away.

Yes, and that's the whole point. You are perfectly right about MMF, I
shouldn't have worried about tearing, because I am able to map file at
once and rely on OS when it comes to swapping.

Indeed. :)

Thank you very much. You clarified me this whole mapping issue :-)
Thanks once again

You're very welcome. I only regret that it seems as though none of this
thread has anything to do with C#. :)

Pete

Oct 17 '06 #25

marcin.rzeznicki

Peter Duniho napisal(a):

<ma**************@gmail.comwrote in message
news:11**********************@m73g2000cwd.googlegr oups.com...

You're very welcome. I only regret that it seems as though none of this
thread has anything to do with C#. :)

It's had, did you forget that I implemented this in C# ?:-)))

>
Pete

Oct 17 '06 #26

Similar topics