473,320 Members | 1,831 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Help - parsing large text files

Hi,
I have a web application that looks for a particular string in a set
of huge files( the files grow into MBs - max i have seen is 30 MB ). (
search using reg expressions ). the string can occur multiple times in
a file. whenver the string is found in a line, the whole line must be
printed in the output. What i am doing is,
1. traversing each file in the directory
2. In each file, read line by line
3. Match the regular expression. If the search string is there, add
the line to a datatable( which will be used for display )

The problems am facing with this approach is the operation takes tooo
long due to the size of the files and number of files to search too.

The limitations are.
1. I cannot read the entire file into a string and then do the search
because of the size of the files.
2. Network - The files are scattered over a LAN.
3. I could not find a way to use BufferedStream or something like that
( what if the searched string itself is split accross different
chunks? )

can anyone help me with this.

Thanks & Regards,
Arun Prakash. B
Nov 16 '05 #1
6 3138
Hi,

The only way I see to improve your performance is using threads for reading
several files at the same time, and even so I'm not very sure this would
help you much though.

Also this does not seems like a good task for a web app, you may get
timeout too often.
It would better if a user can "schedule a job" then the search is done in
the background ( MSMQ? ) and does not impact the user experience.
Hope this help,
--
Ignacio Machin,
ignacio.machin AT dot.state.fl.us
Florida Department Of Transportation

"ArunPrakash" <ar**********@yahoo.com> wrote in message
news:61**************************@posting.google.c om...
Hi,
I have a web application that looks for a particular string in a set
of huge files( the files grow into MBs - max i have seen is 30 MB ). (
search using reg expressions ). the string can occur multiple times in
a file. whenver the string is found in a line, the whole line must be
printed in the output. What i am doing is,
1. traversing each file in the directory
2. In each file, read line by line
3. Match the regular expression. If the search string is there, add
the line to a datatable( which will be used for display )

The problems am facing with this approach is the operation takes tooo
long due to the size of the files and number of files to search too.

The limitations are.
1. I cannot read the entire file into a string and then do the search
because of the size of the files.
2. Network - The files are scattered over a LAN.
3. I could not find a way to use BufferedStream or something like that
( what if the searched string itself is split accross different
chunks? )

can anyone help me with this.

Thanks & Regards,
Arun Prakash. B

Nov 16 '05 #2
Actually we considered windows applications also but found that web
application suiting our needs better. The problem with the background
process is how to pass back the info to the end-user( the HTTP protocol
!!! ).

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Nov 16 '05 #3
I think you should first identify the bottleneck: you can measure the number
of bytes/second in your current application, and compare that to the
throughput of reading big chunks of data without any processing.
If the network is the bottleneck, the best thing you can do is do the
processing locally, transfering only the results; or buy a faster network.
Otherwise, I'd suggest reading the file in big chunks: Searching for a RE
should be faster on a long string than on many short ones. You can look for
newlines to the "left" and to the "right" form the matches.
I assume the RE is compiled?

Niki

"ArunPrakash" <ar**********@yahoo.com> wrote in
news:61**************************@posting.google.c om...
Hi,
I have a web application that looks for a particular string in a set
of huge files( the files grow into MBs - max i have seen is 30 MB ). (
search using reg expressions ). the string can occur multiple times in
a file. whenver the string is found in a line, the whole line must be
printed in the output. What i am doing is,
1. traversing each file in the directory
2. In each file, read line by line
3. Match the regular expression. If the search string is there, add
the line to a datatable( which will be used for display )

The problems am facing with this approach is the operation takes tooo
long due to the size of the files and number of files to search too.

The limitations are.
1. I cannot read the entire file into a string and then do the search
because of the size of the files.
2. Network - The files are scattered over a LAN.
3. I could not find a way to use BufferedStream or something like that
( what if the searched string itself is split accross different
chunks? )

can anyone help me with this.

Thanks & Regards,
Arun Prakash. B

Nov 16 '05 #4
Yeah. We've considered copying the files to the local folder also. But
again, i guess the bottleneck is with the size of the files only( some
benchmark that we did with files on various machines and locally ).
anybody knows how unix grep optimizes the search. i found a grep utility
in C# which again does the same thing( reading line by line and finding
a match ).. if i can find out how unix grep optimizes the search, i can
implement that and see how it works.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Nov 16 '05 #5
Hi,

If the process takes a long time, as I'm sure it does !!!, it is not
suitable to run it on real time, do as I said, schedule it and then the user
can get notified by email or when he logs back in the page, otherwise you
will have to create a "Wait" page , like the one you get when using expedia,
hotels.com, etc
even so I believe that this search can take more longer that what is
advised in a web app.

cheers,

--
Ignacio Machin,
ignacio.machin AT dot.state.fl.us
Florida Department Of Transportation

"ArunPrakash Balakrishnan" <ar**********@yahoo.com> wrote in message
news:uv*************@tk2msftngp13.phx.gbl...
Actually we considered windows applications also but found that web
application suiting our needs better. The problem with the background
process is how to pass back the info to the end-user( the HTTP protocol
!!! ).

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Nov 16 '05 #6
"ArunPrakash Balakrishnan" <ar**********@yahoo.com> wrote in
news:ed**************@TK2MSFTNGP11.phx.gbl...
Yeah. We've considered copying the files to the local folder also.
I don't think this would do much good. As far as I got you, the files you
search in are distributed over a network; Instead of reading complete files
over the network just to find out a few lines you should send the search
pattern to the computer which hosts the file and do the processing there.
But
again, i guess the bottleneck is with the size of the files only( some
benchmark that we did with files on various machines and locally ).
No, you need to find the bottleneck that limits your application's
throughput. You have two factors, network bandwidth and processing speed.
One of the two is the limiting factor, the "bottleneck". Tweaking the other
one will have little or no effect on the throughput. (this doesn't depend on
file sizes)
anybody knows how unix grep optimizes the search. i found a grep utility
in C# which again does the same thing( reading line by line and finding
a match ).. if i can find out how unix grep optimizes the search, i can
implement that and see how it works.


If you only have fixed search strings, google for "boyer moore". (.net
regex's use this optimization.)

Niki
Nov 16 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Stuart | last post by:
Hi, Please can anyone help me join 5 large (1.8gb) text files togeather to create 1 very large file. I have some code in PHP but it bombs out at 2gb (seems there is a limit and php needs re...
9
by: PedroX | last post by:
Hello: I need to parse some large XML files, and save the data in an Access DB. I was using MSXML 2 and ASP, but it turns out to be extremely slow when then XML documents are like 10 mb in...
7
by: mpdsal | last post by:
Hello. I have some very large text files that I need to import into Access. These files are basically SAP system reports that can be up to 100,000 records but they contain a boatload of data I...
6
by: Rajorshi Biswas | last post by:
Hi folks, Suppose I have a large (1 GB) text file which I want to read in reverse. The number of characters I want to read at a time is insignificant. I'm confused as to how best to do it. Upon...
1
by: Rahul | last post by:
CSharp Gurus, Is there any designed classes for parsing large file of size 1GB? What is the best design to do the following operation. I would like to look for the lines that match any of the...
1
by: Hutty | last post by:
I have a program that open text files and compares them, however, when reading files larger than 500kb the programs seems to bomb. I get re-directed to "page not found". Any idea how to get...
6
by: Yi Xing | last post by:
Hi, I need to read specific lines of huge text files. Each time, I know exactly which line(s) I want to read. readlines() or readline() in a loop is just too slow. Since different lines have...
1
by: flavourofbru | last post by:
Hi All, I have a small question regarding accessing the text files in C++. Here is what I am writing in C++. Tha algorithm is as follows: for j = 2:length(image) { Red =...
8
by: James From Canb | last post by:
Is there a way to open very large text files in VBA? I wrote some code in VBA under Excel 2003 to read database extracts, add the field names as the first line, and to convert the fixed length...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.