"Robert Bevington" <rbevington@freenet.deschrieb
Quote:
Hi all,
>
I ran into memory problems while tying to search and replace a very
large text file. To solve this I break the file up into chunks and
run the search and replace on each chunk. This works fine and has
solved the OutOfMemory problem.
>
However, on the last loop when the array c is written to CleanTMX, a
number of 0x00 characters are written at the end of the file. This
causes problems in a further XMLTransformation as this character is
not allowed in XML. I looked at the values of the index. Th eproblem
seems to be caused by index values at the end of the array being set
to Nothing.
>
Question: How can I get rid of these characters? Or how can I reduce
the array to only contain index values that are not Nothing?
>
Here's the code that writes the CleanTMX file:
>
Dim c(My.Settings.ReadChunkSize) As Char 'ReadChunkSize is a
user-defined setting, normally set to 10000
>
Using sr As StreamReader = New StreamReader(OriginalTMX,
System.Text.Encoding.UTF8, True)
Do While sr.Peek() >= 0
sr.Read(c, 0, c.Length)
Dim i As Integer
For i = 0 To arrFind.Length - 1
c = Regex.Replace(c, arrFind(i), arrReplace(i))
Next
Try
Using sw As StreamWriter = New StreamWriter(CleanTMX, True,
System.Text.Encoding.UTF8)
sw.Write(c)
End Using
>
Catch ex As Exception
End Try
Loop
>
Would really appreciate any help on this one.
I'm not sure if it's correct in this context, but I think sr.Read
returns the number of characters read. Hence, you have to write only as
many characters as have been read.
dim CharCount as integer
charcount = sr.read(c, 0, c.length)
...
sw.write(c, 0, charcount)
I think this explains the additional characters.
However, you should reposition the file pointer after reading a chunk.
I'm not sure if that's possible using the StreamReader because of the
internal buffer, so you'd have to use a BinaryReader and do the UTF8
decoding on your own, while being able to set the file pointer
backwards. Otherwise, you will not recognize search strings that are
split across chunks boundaries. For example,
chunk #1: "Robert B"
chunk #2: "evington"
You don't find "Bev" in any of the chunks.
Armin