471,319 Members | 1,895 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,319 software developers and data experts.

Reading a BIG text file (200Mb)

ACC
Hi!

I'm developing an application to analyze some information that is
inside a text file. The size of the text file goes from 50Mb to
220Mb...

The text file is like a table, with "rows" and "columns", and several
"cells" represent an objects in my application.

What I need to do is, read the text file line by line, parse the
information, create my objects and save it smewhere (ArrayList) so that
I can process it afterwards.

And the question is, how to do it in a efficient way??

Thank you in advance for your help,

ACC

Feb 16 '06 #1
13 5850
In .NET 2.0 . The System.IO.File class has a ReadAllLines("file.txt")
that accepts a filepath. That method returns a sting array of all the
lines in that file. Each line is seperated by a carriage return, line
feed. You can then further break each line apart by doing a split
using some delimiter. The .NET framework will be doing all of the file
reading for you so maybe they are doing it most effeciently.

I hope this helps.

Feb 16 '06 #2
ACC,

Is there any reason why you need all of the items in your array list at
one time? Can you process it in batches?

What is the format of your file? If it is a delimited text file, you
might want to consider loading it into a table in SQL Server through Bulk
Copy Services, and then perform your operations on the table. Of course,
this is only applicable if you are performing some sort of set operations.
If you are not (there is some super complex calc you are doing), then this
isn't appropriate.

Ultimately, processing each object as it comes in is the best way, and
only keeping around what you need (instead of all of them).

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"ACC" <ma*******@gmail.com> wrote in message
news:11**********************@g47g2000cwa.googlegr oups.com...
Hi!

I'm developing an application to analyze some information that is
inside a text file. The size of the text file goes from 50Mb to
220Mb...

The text file is like a table, with "rows" and "columns", and several
"cells" represent an objects in my application.

What I need to do is, read the text file line by line, parse the
information, create my objects and save it smewhere (ArrayList) so that
I can process it afterwards.

And the question is, how to do it in a efficient way??

Thank you in advance for your help,

ACC

Feb 16 '06 #3
This would be a VERY bad idea. If the file is 50-200MB, reading all of
those lines in the file into a string (or array of strings) is going to be a
performance nightmare.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"tdavisjr" <td******@gmail.com> wrote in message
news:11*********************@g47g2000cwa.googlegro ups.com...
In .NET 2.0 . The System.IO.File class has a ReadAllLines("file.txt")
that accepts a filepath. That method returns a sting array of all the
lines in that file. Each line is seperated by a carriage return, line
feed. You can then further break each line apart by doing a split
using some delimiter. The .NET framework will be doing all of the file
reading for you so maybe they are doing it most effeciently.

I hope this helps.

Feb 16 '06 #4
But of course you can do all the reading on a different thread

Feb 16 '06 #5
It's still going to be a nightmare. You are going to read 200MB into
your app? Doing this on another thread isn't going to make a difference,
the fact that you are going to have an array of over 400 MB most likely is
NOT a good idea.

And I say 400 MB because chances are the file is ASCII and then when
read into a unicode string, it will double in size essentially.

Processing this line-by-line is the only way to go.

--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"tdavisjr" <td******@gmail.com> wrote in message
news:11**********************@g47g2000cwa.googlegr oups.com...
But of course you can do all the reading on a different thread

Feb 16 '06 #6
I can see your point.

Feb 16 '06 #7
Hi,

"ACC" <ma*******@gmail.com> wrote in message
news:11**********************@g47g2000cwa.googlegr oups.com...
Hi!

I'm developing an application to analyze some information that is
inside a text file. The size of the text file goes from 50Mb to
220Mb...

The text file is like a table, with "rows" and "columns", and several
"cells" represent an objects in my application.

What I need to do is, read the text file line by line, parse the
information, create my objects and save it smewhere (ArrayList) so that
I can process it afterwards.

And the question is, how to do it in a efficient way??

Thank you in advance for your help,

Do you need all the data at once? you could read a record at a time and
update the variables where you keep the going data.

if you need ALL the data at once you would be better using MSDE , it's free
and as powerful as SQL server, you can use T-sql to better handle the data.
you can create a DTS to import the data into MSDE, luego usas ADO.net para
query la data, you can execute the DTS from inside your code so you could
change the source of the data.

--
Ignacio Machin,
ignacio.machin AT dot.state.fl.us
Florida Department Of Transportation
Feb 16 '06 #8
Hi,

"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard.caspershouse.com> wrote in
message news:%2****************@tk2msftngp13.phx.gbl...
This would be a VERY bad idea. If the file is 50-200MB, reading all of
those lines in the file into a string (or array of strings) is going to be
a performance nightmare.

no only that, but almost useless as the data is still raw , you need to
split each line in a record.
--
Ignacio Machin,
ignacio.machin AT dot.state.fl.us
Florida Department Of Transportation
Feb 16 '06 #9
ACC
Thank you all for replying!!

Unfortenatelly, yes I need all the data....

I'll try to explain a little bit more.
The text file represents some processes that ran in a dsp, and each
line represents something that happened. The users will use my
application too check if the dsp is working as predicted.

The problem is that each user wants to check something diferent. So, I
never know what they would like to check when I load the file. For
instance, one may want to check if a specific process is occuring every
10 ms, and if not, then there is a problem in the dsp. Other might just
want to know the average of some values.

so, as you see, I need to work with all the data.

About sql server, I dont have it...

now... at this moment, I can actually reading a text file of 203Mb,
processing it and displaying it in a datagrid in 2 minutes. I'm using a
StreamReader, read line by line, process each line and pass to the
next.
Thank you for your help!

Feb 17 '06 #10
I can see what you are doing, but can I challenge the assumption that you
need to hold all the data in memory?

In particular, (line-by-line) readers are *incredibly* suited to performing
averages etc - you simply read one row at a time, find the value you want,
and then adjust a tally somewhere - either using a total/count pair (but
the total can get big) or by adjusting a floating average on the fly (less
accuarate, but less risk of overflow).

Equally, checking whether something happens every 10ms just involves keeping
a counter of the last time you found a match; if the row you are reading is
10ms past (assuming rows are incremental in time) then you have a problem.


If you can isolate the common "what you need to do" bits, I'm pretty sure
you can do most of this without ever loading a *single* line into an object,
let alone all of them. You could also do some tricks with delegates and/or
custom iterators to mean that you only write the stream code once, and then
just write stubs that work on the formatted data; in this scenario, perhaps
a custom iterator that returns a single instance per call would be in order
(so you only ever have one row in memory) - quite easy to do, and pretty
quick.

Otherwise, you are just going to cripple the client unless you have some
serious memory; and even then you'd have to switch most UI elements to
virtual mode, as they are *not* going to like having that many items in a UI
collection.

Just my thoughts,

Marc
Feb 17 '06 #11
Thinking of virtual mode - is there any way you could attach the UI to the
file using virtual mode? i.e. only read lines as requested by the UI?

You might have to parse the file once first and build an index of the offset
of every <x> lines so that you can jump directly to the right place in the
stream... and perhaps a file-system watcher so that you get notified if the
file changes (so you can re-index and update the UI), but it might be worth
a few minutes consideration,

Just a thought,

Marc
Feb 17 '06 #12
ACC
Marc,

is not that simple...

A line of the text file has at least the following

state number
the address of the memory that is being accessed
the data in the memory
the instance of time when occured

When the users just wants to check if objectx (with address = x) occurs
every 10ms is ok...

but he might want to Find all object1 (object2 address X followed by
object3 with Address Y and data superior to W followed by object4... )

The search is more complex... but might be possible!! :)
But I still need to load the textfile to a datagrid :(

Thanks
ACC
Marc Gravell wrote:
Thinking of virtual mode - is there any way you could attach the UI to the
file using virtual mode? i.e. only read lines as requested by the UI?

You might have to parse the file once first and build an index of the offset
of every <x> lines so that you can jump directly to the right place in the
stream... and perhaps a file-system watcher so that you get notified if the
file changes (so you can re-index and update the UI), but it might be worth
a few minutes consideration,

Just a thought,

Marc


Feb 17 '06 #13
Hi,

The problem is that each user wants to check something diferent. So, I
never know what they would like to check when I load the file. For
instance, one may want to check if a specific process is occuring every
10 ms, and if not, then there is a problem in the dsp. Other might just
want to know the average of some values.

so, as you see, I need to work with all the data.
you need all the data in memory only if you are going to make more than one
run on it and the results are not accumulatives.

About sql server, I dont have it...


you do not need sql server, MSDE is like a "sql server lite" and IT'S FREE

now, if on each run the file change (which probably does) then you need to
evaluate the usefulness of inserting that data first in the sql engine, if
you only do one run per file you do not need to do this and your current
solution is an acceptable compromise.
--
Ignacio Machin,
ignacio.machin AT dot.state.fl.us
Florida Department Of Transportation
Feb 17 '06 #14

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by fabrice | last post: by
14 posts views Thread by Job Lot | last post: by
8 posts views Thread by Phil Slater | last post: by
4 posts views Thread by Nina | last post: by
4 posts views Thread by Amit Maheshwari | last post: by
1 post views Thread by John | last post: by
3 posts views Thread by John | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.