Convert large XML file to UTF-8

bbb

Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}
}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards

Aug 5 '06 #1

Subscribe Post Reply

3883

Nicholas Paldino [.NET/C# MVP]

Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are, you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"bbb" <bb*****@gmail.comwrote in message
news:11**********************@p79g2000cwp.googlegr oups.com...

Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}
}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards

Aug 5 '06 #2

bbb

Nicholas,
Thanks for reply.
Unfortunately I cannot choose the format ( XML ) - that's given.
100MB files appear once in a while but they need to be processed.
I understand, that's I'm doing it wrong way.
Can you show me please how to do it correct.

Thanks in advance.
Regards
Nicholas Paldino [.NET/C# MVP] wrote:

Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are, you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"bbb" <bb*****@gmail.comwrote in message
news:11**********************@p79g2000cwp.googlegr oups.com...
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}
}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards

Aug 5 '06 #3

Nicholas Paldino [.NET/C# MVP]

bbb,

You aren't checking the return value to the call to read. That value
tells you how many bytes were read into the buffer. Subsequently, you
should only be trying to convert those number of bytes, not the whole
buffer.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"bbb" <bb*****@gmail.comwrote in message
news:11*********************@75g2000cwc.googlegrou ps.com...

Nicholas,
Thanks for reply.
Unfortunately I cannot choose the format ( XML ) - that's given.
100MB files appear once in a while but they need to be processed.
I understand, that's I'm doing it wrong way.
Can you show me please how to do it correct.

Thanks in advance.
Regards
Nicholas Paldino [.NET/C# MVP] wrote:
>Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are,
you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"bbb" <bb*****@gmail.comwrote in message
news:11**********************@p79g2000cwp.googleg roups.com...
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}
}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards

Aug 5 '06 #4

Greg Young

int bytesread = fs.Read(b,0,buffer)

bytesread != 1024
bytesread == 900

then you process the entire array ...

byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

Just make this only process bytesread bytes of the buffer.
http://msdn.microsoft.com/library/de...vertTopic1.asp
Should handle this for you ending up with ..

byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b, 0, bytesread);

Cheers,

Greg Young
MVP - C#
http://codebetter.com/blogs/gregyoung
"bbb" <bb*****@gmail.comwrote in message
news:11*********************@75g2000cwc.googlegrou ps.com...

Nicholas,
Thanks for reply.
Unfortunately I cannot choose the format ( XML ) - that's given.
100MB files appear once in a while but they need to be processed.
I understand, that's I'm doing it wrong way.
Can you show me please how to do it correct.

Thanks in advance.
Regards
Nicholas Paldino [.NET/C# MVP] wrote:
>Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are,
you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"bbb" <bb*****@gmail.comwrote in message
news:11**********************@p79g2000cwp.googleg roups.com...
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}
}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards

Aug 5 '06 #5

bbb

Thank you very much for your help.
It works perfect.

Regards,

Greg Young wrote:

int bytesread = fs.Read(b,0,buffer)

bytesread != 1024
bytesread == 900

then you process the entire array ...

byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

Just make this only process bytesread bytes of the buffer.
http://msdn.microsoft.com/library/de...vertTopic1.asp
Should handle this for you ending up with ..

byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b, 0, bytesread);

Cheers,

Greg Young
MVP - C#
http://codebetter.com/blogs/gregyoung
"bbb" <bb*****@gmail.comwrote in message
news:11*********************@75g2000cwc.googlegrou ps.com...
Nicholas,
Thanks for reply.
Unfortunately I cannot choose the format ( XML ) - that's given.
100MB files appear once in a while but they need to be processed.
I understand, that's I'm doing it wrong way.
Can you show me please how to do it correct.

Thanks in advance.
Regards
Nicholas Paldino [.NET/C# MVP] wrote:
Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are,
you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"bbb" <bb*****@gmail.comwrote in message
news:11**********************@p79g2000cwp.googlegr oups.com...
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}
}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards

Aug 6 '06 #6

Similar topics

Problem reading large file

by: ohaya | last post by:

Hi, I'm a real newbie, but have been asked to try to fix a problem in one of our JSP pages that is suppose to read in a text file and display it. From my testing thus far, it appears this page...

Java

ZODB memory problems (was: processing a Very Large file)

by: DJTB | last post by:

zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of...

Python

Issues compiling with large file support

by: pruebauno | last post by:

Hello all, I am having issues compiling Python with large file support. I tried forcing the configure script to add it but then it bombs in the make process. Any help will be appreciated. ...

Python

large file support

by: Joseph | last post by:

Hi, I'm having bit of questions on recursive pointer. I have following code that supports upto 8K files but when i do a file like 12K i get a segment fault. I Know it is in this line of code. ...

C / C++

How to open VERY large file using std::ifstream

by: Charlie | last post by:

Dear all, I am currently writting a trace analyzer in C++. It always fails to open a very large input file (3.7Gb). I tried on a simple program, same thing happens:...

C / C++

Handling large file upload

by: Thomas Due | last post by:

Hi, I am writing an ASP.NET project where I allow users to upload files to the server. I have changed to web.config to allow a total file size of 100MB. My problem is that if the total file size...

ASP.NET

Uploading large file

by: David | last post by:

Hello. I can't upload large file with HtmlInputFile control:( Is there any file size limitation in HtmlInputFile control? If yes how can I upload to server large size file? Than you.

ASP.NET

How can I get the encode of a txt file and convert it into UTF-8?

by: ZSP747 | last post by:

How can I get the encode of a txt file and convert it into UTF-8? I just want to find a class can do this in a simple way. And if I want to handled a UTF-8 string which class should it use? Can...

C / C++

Convert wstring to UTF-8

by: Jared Wiltshire | last post by:

I'm trying to convert a wstring (actually a BSTR) to UTF-8. This is what I've currently got: size_t arraySize; setlocale(LC_CTYPE,"C-UTF-8"); arraySize = wcstombs(NULL, wstr, 0); char...

C / C++

Large file support >2/4GB ?

by: robert | last post by:

Somebody who uses my app gets a error : os.stat('/path/filename') OSError: Value too large for defined data type: '/path/filename' on a big file >4GB ( Python 2.4.4 / Linux )

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing