473,791 Members | 3,186 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regex on whole (large) text file

Hi,

I'm sorry if these questions are trivial, but I've searched the net and
haven't had any luck finding the information I need.

I need to perform some regular expression search and replace on a large
text file. The patterns I need to match are multi-line, so I can't do it
one line at a time. Instead I currently read in the entire text file in
a string using the code below.

File fin = new File("input.txt ");
FileInputStream fis = new FileInputStream (fin);
BufferedReader in = new BufferedReader( new InputStreamRead er(fis));
String aLine = null;
String theText = "";
while((aLine = in.readLine()) != null) {
theText = theText + aLine + "\n";
}

The problem with this is that the first couple of thousand lines read in
very fast, but it gets slower and slower, and as we approach line 4000
it gets really slow per line.

Is there a better way to read in an entire text file into a string?

Is storing the entire text file in a string a bad idea? And if so, what
are the alternatives?

Is it possible to perform multiple-line regular expressions on a text
file without loading the whole text file into memory?

Thanks in advance,
Rune
Jul 17 '05 #1
8 18836
Rune Johansen wrote:
Hi,

I'm sorry if these questions are trivial, but I've searched the net and
haven't had any luck finding the information I need.

I need to perform some regular expression search and replace on a large
text file. The patterns I need to match are multi-line, so I can't do it
one line at a time. Instead I currently read in the entire text file in
a string using the code below.

File fin = new File("input.txt ");
FileInputStream fis = new FileInputStream (fin);
BufferedReader in = new BufferedReader( new InputStreamRead er(fis));
String aLine = null;
String theText = "";
while((aLine = in.readLine()) != null) {
theText = theText + aLine + "\n";
}

The problem with this is that the first couple of thousand lines read in
very fast, but it gets slower and slower, and as we approach line 4000
it gets really slow per line.

Is there a better way to read in an entire text file into a string?


Absolutely. Instead of using + to concatenate strings, you should use a
StringBuffer and convert to a String at the end:

File fin = new File("input.txt ");
FileInputStream fis = new FileInputStream (fin);
BufferedReader in = new BufferedReader( new InputStreamRead er(fis));
String aLine = null;
StringBuffer theText = new StringBuffer((i nt)fin.length() );
while((aLine = in.readLine()) != null)
{
// Question: Why are you converting all the line breaks
// to \n?
theText.append( aLine).append(" \n");
}

Ray

--
XML is the programmer's duct tape.
Jul 17 '05 #2
Raymond DeCampo wrote:
Absolutely. Instead of using + to concatenate strings, you should
use a StringBuffer and convert to a String at the end
Thanks a lot! This indeed solves the problem.
Question: Why are you converting all the line breaks to \n?


How do I preserve the line breaks? Before I added the \n, the whole
string, when written to a file, was in a single line.

Rune
--
3D images and anims, include files, tutorials and more:
rune|vision: http://runevision.com **updated Apr 27**
POV-Ray Ring: http://webring.povray.co.uk
Jul 17 '05 #3

"Rune Johansen" <rune[insert_current_ year_here]@runevision.com > wrote in
message news:Ts******** ************@ne ws000.worldonli ne.dk...
Raymond DeCampo wrote:
Absolutely. Instead of using + to concatenate strings, you should
use a StringBuffer and convert to a String at the end


Thanks a lot! This indeed solves the problem.
Question: Why are you converting all the line breaks to \n?


How do I preserve the line breaks? Before I added the \n, the whole
string, when written to a file, was in a single line.

Rune
--
3D images and anims, include files, tutorials and more:
rune|vision: http://runevision.com **updated Apr 27**
POV-Ray Ring: http://webring.povray.co.uk


Read characters (or bytes) instead of lines. Reading lines is useless unless
you want to process individual lines.

Silvio Bierman
Jul 17 '05 #4
Silvio Bierman wrote:
Read characters (or bytes) instead of lines. Reading lines
is useless unless you want to process individual lines.


Okay, I now read characters instead, using the following method:

public static String readTextFile(St ring filename) {
try {
File fin = new File(filename);
FileInputStream fis = new FileInputStream (fin);
BufferedReader in = new BufferedReader( new
InputStreamRead er(fis));
char[] chrArr = new char[(int)fin.length ()];
while(in.ready( )==false) {}
in.read(chrArr) ;
in.close();
return new String(chrArr);
}
catch (FileNotFoundEx ception e) { return ""; }
catch (IOException e) { return ""; }
}

Except for the poor exception handling, is there anything obvious that
could be improved here?

Rune
Jul 17 '05 #5

"Rune Johansen" <rune[insert_current_ year_here]@runevision.com > wrote in
message news:65******** ************@ne ws000.worldonli ne.dk...
Silvio Bierman wrote:
Read characters (or bytes) instead of lines. Reading lines
is useless unless you want to process individual lines.


Okay, I now read characters instead, using the following method:

public static String readTextFile(St ring filename) {
try {
File fin = new File(filename);
FileInputStream fis = new FileInputStream (fin);
BufferedReader in = new BufferedReader( new
InputStreamRead er(fis));
char[] chrArr = new char[(int)fin.length ()];
while(in.ready( )==false) {}
in.read(chrArr) ;
in.close();
return new String(chrArr);
}
catch (FileNotFoundEx ception e) { return ""; }
catch (IOException e) { return ""; }
}

Except for the poor exception handling, is there anything obvious that
could be improved here?

Rune


Rune,

You could drop the BufferedReader and read from the InputStreamRead er
directly. This will be somewhat faster. In cases where you would be
processing the file as a character stream you should use the BufferedRe\ader
though.

Silvio Bierman
Jul 17 '05 #6
Silvio Bierman wrote:
"Rune Johansen" <rune[insert_current_ year_here]@runevision.com > wrote in
message news:65******** ************@ne ws000.worldonli ne.dk...
Silvio Bierman wrote:
Read characters (or bytes) instead of lines. Reading lines
is useless unless you want to process individual lines.


Okay, I now read characters instead, using the following method:

public static String readTextFile(St ring filename) {
try {
File fin = new File(filename);
FileInputStream fis = new FileInputStream (fin);
BufferedReader in = new BufferedReader( new
InputStreamRe ader(fis));
char[] chrArr = new char[(int)fin.length ()];
while(in.ready( )==false) {}
in.read(chrArr) ;
in.close();
return new String(chrArr);
}
catch (FileNotFoundEx ception e) { return ""; }
catch (IOException e) { return ""; }
}

Except for the poor exception handling, is there anything obvious that
could be improved here?

Rune

Rune,

You could drop the BufferedReader and read from the InputStreamRead er
directly. This will be somewhat faster. In cases where you would be
processing the file as a character stream you should use the BufferedRe\ader
though.

Silvio Bierman


I definitely disagree with that. My understanding is that
FileInputStream , FileOutputStrea m, FileReader and FileWriter are not
buffered and will go to the file system for every byte/character. So
they should almost always be wrapped with the appropriate buffered stream.

I would however, ditch the FileInputStream/InputStreamRead er combination
in favor of FileReader.

The other potential issue is that InputStream.in( ) is not guaranteed to
fill the array (although in practice I think it usually does). So the
paranoid way to do this would be in a loop that ensures that all the
desired characters are read.

Finally, I don't think the while loop you have adds any value and will
just eat CPU cycles if it does anything.

And speaking of finally, you should use a finally clause to close your
streams.

Ray

--
XML is the programmer's duct tape.
Jul 17 '05 #7

"Raymond DeCampo" <rd******@spam. twcny.spam.rr.s pam.com.spam> wrote in
message news:52******** ************@tw ister.nyroc.rr. com...
Silvio Bierman wrote:
"Rune Johansen" <rune[insert_current_ year_here]@runevision.com > wrote in
message news:65******** ************@ne ws000.worldonli ne.dk...
Silvio Bierman wrote:

Read characters (or bytes) instead of lines. Reading lines
is useless unless you want to process individual lines.

Okay, I now read characters instead, using the following method:

public static String readTextFile(St ring filename) {
try {
File fin = new File(filename);
FileInputStream fis = new FileInputStream (fin);
BufferedReader in = new BufferedReader( new
InputStreamRe ader(fis));
char[] chrArr = new char[(int)fin.length ()];
while(in.ready( )==false) {}
in.read(chrArr) ;
in.close();
return new String(chrArr);
}
catch (FileNotFoundEx ception e) { return ""; }
catch (IOException e) { return ""; }
}

Except for the poor exception handling, is there anything obvious that
could be improved here?

Rune

Rune,

You could drop the BufferedReader and read from the InputStreamRead er
directly. This will be somewhat faster. In cases where you would be
processing the file as a character stream you should use the BufferedRe\ader though.

Silvio Bierman


I definitely disagree with that. My understanding is that
FileInputStream , FileOutputStrea m, FileReader and FileWriter are not
buffered and will go to the file system for every byte/character. So
they should almost always be wrapped with the appropriate buffered stream.

I would however, ditch the FileInputStream/InputStreamRead er combination
in favor of FileReader.

The other potential issue is that InputStream.in( ) is not guaranteed to
fill the array (although in practice I think it usually does). So the
paranoid way to do this would be in a loop that ensures that all the
desired characters are read.

Finally, I don't think the while loop you have adds any value and will
just eat CPU cycles if it does anything.

And speaking of finally, you should use a finally clause to close your
streams.

Ray

--
XML is the programmer's duct tape.


Raymond,

The non-buffered streams and readers do not go to the filesystem for every
byte/character but for every read action instead. If you plan to read say 1M
bytes in a single read a plain stream will do a single filesystem level read
where a buffered reader will read multiple times its buffer size until the
1M bytes are read. If the raw variants where so dumb it would be impossible
for the buffered ones to use them efficiently.

As I said If you intend to read single (or very small counts of)
bytes/characters frequently the buffered variants will group the reads and
therefore give better performance.

It is a common misconception that you should always use buffered
streams/readers. If this where true the plain ones would have porobably been
left out of the API.

Regards,

Silvio Bierman
Jul 17 '05 #8
Silvio Bierman wrote:
"Raymond DeCampo" <rd******@spam. twcny.spam.rr.s pam.com.spam> wrote in
message news:52******** ************@tw ister.nyroc.rr. com...
Silvio Bierman wrote:
Rune,

You could drop the BufferedReader and read from the InputStreamRead er
directly. This will be somewhat faster. In cases where you would be
processing the file as a character stream you should use the
BufferedRe\ader
though.

Silvio Bierman


I definitely disagree with that. My understanding is that
FileInputStre am, FileOutputStrea m, FileReader and FileWriter are not
buffered and will go to the file system for every byte/character. So
they should almost always be wrapped with the appropriate buffered stream.
Ray

--
XML is the programmer's duct tape.

Raymond,

The non-buffered streams and readers do not go to the filesystem for every
byte/character but for every read action instead. If you plan to read say 1M
bytes in a single read a plain stream will do a single filesystem level read
where a buffered reader will read multiple times its buffer size until the
1M bytes are read. If the raw variants where so dumb it would be impossible
for the buffered ones to use them efficiently.

As I said If you intend to read single (or very small counts of)
bytes/characters frequently the buffered variants will group the reads and
therefore give better performance.

It is a common misconception that you should always use buffered
streams/readers. If this where true the plain ones would have porobably been
left out of the API.

Regards,

Silvio Bierman


Hmm, that makes sense. Thanks for the clarification.

Ray

--
XML is the programmer's duct tape.
Jul 17 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
2417
by: Alan Pretre | last post by:
Can anyone help me figure out a regex pattern for the following input example: xxx:a=b,c=d,yyy:e=f,zzz:www:g=h,i=j,l=m I would want four matches from this: 1. xxx a=b,c=d 2. yyy e=f 3. zzz (empty) 4. www g=h,i=j,l=m
2
3071
by: jimmyfishbean | last post by:
Hi, I am using VB6, SAX (implementing IVBSAXContentHandler). I need to extract binary encoded data (images) from large XML files and decode this data and generate the appropriate images onto disk. My XML files have the following structure: <?xml version="1.0" encoding="utf-8" ?> <imagepla xmlns:dt="urn:schemas-microsoft-com:datatypes">
7
5730
by: alphatan | last post by:
Is there relative source or document for this purpose? I've searched the index of "Mastering Regular Expression", but cannot get the useful information for C. Thanks in advanced. -- Learning is to improve, but not to prove.
1
3385
by: Mark | last post by:
Hi, I've seen some postings on this but not exactly relating to this posting. I'm reading in a large mail message as a string. In the string is an xml attachment that I need to parse out and remove from the message once processed. I have to do this as a string and not using any CDO libraries. My problem is that there's normally a large pdf in the file so when I read the file in it's massive and I don't knwo if the XML is at the...
8
1863
by: rjb | last post by:
Hi! Could somebody have a look and help me to optimize the code below. It may look like very bad way of coding, but this stuff is very, very new for me. I've included just few lines. Regex regxUserName = new Regex(@"(?<=User-Name = )\""(+)\""", RegexOptions.None);
4
3611
by: shonend | last post by:
I am trying to extract the pattern like this : "SUB: some text LOT: one-word" Described, "SUB" and "LOT" are key words; I want those words, everything in between and one word following the "LOT:". Source text may contain multiple "SUB: ... LOT:" blocks. For example this is my source text:
4
1922
by: MooMaster | last post by:
I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: cond(a,b,c) and I want them to look like this: cond(c,a,b)
17
1994
by: garrickp | last post by:
While creating a log parser for fairly large logs, we have run into an issue where the time to process was relatively unacceptable (upwards of 5 minutes for 1-2 million lines of logs). In contrast, using the Linux tool grep would complete the same search in a matter of seconds. The search we used was a regex of 6 elements "or"ed together, with an exclusionary set of ~3 elements. Due to the size of the files, we decided to run these line...
1
2723
by: =?Utf-8?B?QWxCcnVBbg==?= | last post by:
I have a regular expression for capturing all occurrences of words contained between {{ and }} in a file. My problem is I need to capture what is between those symbols. For instance, if I have tags such as {{FirstName}}, {{LastName}}, and {{Address}} placed in the file, I need to be able to capture the text strings of FirstName, LastName and Address, respectively. I'm sure it can be done with Regex as easily as finding the locations of...
0
9669
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10207
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9030
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6776
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5435
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5559
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4110
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3718
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2916
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.