473,385 Members | 2,029 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Convert large XML file to UTF-8

bbb
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}
}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards

Aug 5 '06 #1
5 3883
Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are, you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"bbb" <bb*****@gmail.comwrote in message
news:11**********************@p79g2000cwp.googlegr oups.com...
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}
}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards

Aug 5 '06 #2
bbb
Nicholas,
Thanks for reply.
Unfortunately I cannot choose the format ( XML ) - that's given.
100MB files appear once in a while but they need to be processed.
I understand, that's I'm doing it wrong way.
Can you show me please how to do it correct.

Thanks in advance.
Regards
Nicholas Paldino [.NET/C# MVP] wrote:
Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are, you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"bbb" <bb*****@gmail.comwrote in message
news:11**********************@p79g2000cwp.googlegr oups.com...
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}
}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards
Aug 5 '06 #3
bbb,

You aren't checking the return value to the call to read. That value
tells you how many bytes were read into the buffer. Subsequently, you
should only be trying to convert those number of bytes, not the whole
buffer.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"bbb" <bb*****@gmail.comwrote in message
news:11*********************@75g2000cwc.googlegrou ps.com...
Nicholas,
Thanks for reply.
Unfortunately I cannot choose the format ( XML ) - that's given.
100MB files appear once in a while but they need to be processed.
I understand, that's I'm doing it wrong way.
Can you show me please how to do it correct.

Thanks in advance.
Regards
Nicholas Paldino [.NET/C# MVP] wrote:
>Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are,
you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"bbb" <bb*****@gmail.comwrote in message
news:11**********************@p79g2000cwp.googleg roups.com...
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}
}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards

Aug 5 '06 #4
int bytesread = fs.Read(b,0,buffer)

bytesread != 1024
bytesread == 900

then you process the entire array ...
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);
Just make this only process bytesread bytes of the buffer.
http://msdn.microsoft.com/library/de...vertTopic1.asp
Should handle this for you ending up with ..

byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b, 0, bytesread);

Cheers,

Greg Young
MVP - C#
http://codebetter.com/blogs/gregyoung
"bbb" <bb*****@gmail.comwrote in message
news:11*********************@75g2000cwc.googlegrou ps.com...
Nicholas,
Thanks for reply.
Unfortunately I cannot choose the format ( XML ) - that's given.
100MB files appear once in a while but they need to be processed.
I understand, that's I'm doing it wrong way.
Can you show me please how to do it correct.

Thanks in advance.
Regards
Nicholas Paldino [.NET/C# MVP] wrote:
>Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are,
you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"bbb" <bb*****@gmail.comwrote in message
news:11**********************@p79g2000cwp.googleg roups.com...
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}
}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards

Aug 5 '06 #5
bbb
Thank you very much for your help.
It works perfect.

Regards,

Greg Young wrote:
int bytesread = fs.Read(b,0,buffer)

bytesread != 1024
bytesread == 900

then you process the entire array ...
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

Just make this only process bytesread bytes of the buffer.
http://msdn.microsoft.com/library/de...vertTopic1.asp
Should handle this for you ending up with ..

byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b, 0, bytesread);

Cheers,

Greg Young
MVP - C#
http://codebetter.com/blogs/gregyoung
"bbb" <bb*****@gmail.comwrote in message
news:11*********************@75g2000cwc.googlegrou ps.com...
Nicholas,
Thanks for reply.
Unfortunately I cannot choose the format ( XML ) - that's given.
100MB files appear once in a while but they need to be processed.
I understand, that's I'm doing it wrong way.
Can you show me please how to do it correct.

Thanks in advance.
Regards
Nicholas Paldino [.NET/C# MVP] wrote:
Your code to read in chuncks isn't correct. You are assuming that the
buffer is filled completely, when in reality, it is not. Chances are,
you
are adding the 124 bytes yourself, instead of writing only what is read.

You might also want to consider some other mechanism for storing your
data other than XML for these files. At 100MB, its going to be very
difficult to process this file for updates if the time comes.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"bbb" <bb*****@gmail.comwrote in message
news:11**********************@p79g2000cwp.googlegr oups.com...
Hi,
I need to convert XML files from Japanese encoding to UTF-8.
I was using the following code:

using ( FileStream fs = File.OpenRead(fromFile) )
{
int fileSize = (int)fs.Length;
int buffer = fileSize;
byte[] b = new byte[buffer];

using(StreamWriter sw = new StreamWriter(toFile, true, toEnc))
{
while (fs.Read(b,0,buffer) 0)
{
byte[] utf8Bytes = Encoding.Convert(fromEnc, toEnc, b);

// Convert the new byte[] into a char[] and then into a string.
char[] utf8Chars = new char[toEnc.GetCharCount(utf8Bytes, 0,
utf8Bytes.Length)];
toEnc.GetChars(utf8Bytes, 0, utf8Bytes.Length, utf8Chars,
0);

string utfString = new string(utf8Chars);

sw.Write(replaceXmlEncodingHeader(utfString, fromEncHeader
,toEncHeader));

}
}

}

Everything worked fine until we get 100MB file - I got OutOfMemory
exception.
I've tried to read it by pieces :
if (fileSize >30000000)
{
buffer = 1024;
}
but then it fills the end of the buffer with other bytes (lets say last
chunk 900 bytes - it adds 124 bytes from somewhere else) -so my
converted xml is not well formed.

Please help.
Thanks in advance.
Regards
Aug 6 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: ohaya | last post by:
Hi, I'm a real newbie, but have been asked to try to fix a problem in one of our JSP pages that is suppose to read in a text file and display it. From my testing thus far, it appears this page...
1
by: DJTB | last post by:
zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of...
0
by: pruebauno | last post by:
Hello all, I am having issues compiling Python with large file support. I tried forcing the configure script to add it but then it bombs in the make process. Any help will be appreciated. ...
7
by: Joseph | last post by:
Hi, I'm having bit of questions on recursive pointer. I have following code that supports upto 8K files but when i do a file like 12K i get a segment fault. I Know it is in this line of code. ...
3
by: Charlie | last post by:
Dear all, I am currently writting a trace analyzer in C++. It always fails to open a very large input file (3.7Gb). I tried on a simple program, same thing happens:...
6
by: Thomas Due | last post by:
Hi, I am writing an ASP.NET project where I allow users to upload files to the server. I have changed to web.config to allow a total file size of 100MB. My problem is that if the total file size...
1
by: David | last post by:
Hello. I can't upload large file with HtmlInputFile control:( Is there any file size limitation in HtmlInputFile control? If yes how can I upload to server large size file? Than you.
1
by: ZSP747 | last post by:
How can I get the encode of a txt file and convert it into UTF-8? I just want to find a class can do this in a simple way. And if I want to handled a UTF-8 string which class should it use? Can...
3
by: Jared Wiltshire | last post by:
I'm trying to convert a wstring (actually a BSTR) to UTF-8. This is what I've currently got: size_t arraySize; setlocale(LC_CTYPE,"C-UTF-8"); arraySize = wcstombs(NULL, wstr, 0); char...
2
by: robert | last post by:
Somebody who uses my app gets a error : os.stat('/path/filename') OSError: Value too large for defined data type: '/path/filename' on a big file >4GB ( Python 2.4.4 / Linux )
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.