Hi,
I'm sorry if these questions are trivial, but I've searched the net and
haven't had any luck finding the information I need.
I need to perform some regular expression search and replace on a large
text file. The patterns I need to match are multi-line, so I can't do it
one line at a time. Instead I currently read in the entire text file in
a string using the code below.
File fin = new File("input.txt");
FileInputStream fis = new FileInputStream(fin);
BufferedReader in = new BufferedReader(new InputStreamReader(fis));
String aLine = null;
String theText = "";
while((aLine = in.readLine()) != null) {
theText = theText + aLine + "\n";
}
The problem with this is that the first couple of thousand lines read in
very fast, but it gets slower and slower, and as we approach line 4000
it gets really slow per line.
Is there a better way to read in an entire text file into a string?
Is storing the entire text file in a string a bad idea? And if so, what
are the alternatives?
Is it possible to perform multiple-line regular expressions on a text
file without loading the whole text file into memory?
Thanks in advance,
Rune 8 18450
Rune Johansen wrote: Hi,
I'm sorry if these questions are trivial, but I've searched the net and haven't had any luck finding the information I need.
I need to perform some regular expression search and replace on a large text file. The patterns I need to match are multi-line, so I can't do it one line at a time. Instead I currently read in the entire text file in a string using the code below.
File fin = new File("input.txt"); FileInputStream fis = new FileInputStream(fin); BufferedReader in = new BufferedReader(new InputStreamReader(fis)); String aLine = null; String theText = ""; while((aLine = in.readLine()) != null) { theText = theText + aLine + "\n"; }
The problem with this is that the first couple of thousand lines read in very fast, but it gets slower and slower, and as we approach line 4000 it gets really slow per line.
Is there a better way to read in an entire text file into a string?
Absolutely. Instead of using + to concatenate strings, you should use a
StringBuffer and convert to a String at the end:
File fin = new File("input.txt");
FileInputStream fis = new FileInputStream(fin);
BufferedReader in = new BufferedReader(new InputStreamReader(fis));
String aLine = null;
StringBuffer theText = new StringBuffer((int)fin.length());
while((aLine = in.readLine()) != null)
{
// Question: Why are you converting all the line breaks
// to \n?
theText.append(aLine).append("\n");
}
Ray
--
XML is the programmer's duct tape.
Raymond DeCampo wrote: Absolutely. Instead of using + to concatenate strings, you should use a StringBuffer and convert to a String at the end
Thanks a lot! This indeed solves the problem.
Question: Why are you converting all the line breaks to \n?
How do I preserve the line breaks? Before I added the \n, the whole
string, when written to a file, was in a single line.
Rune
--
3D images and anims, include files, tutorials and more:
rune|vision: http://runevision.com **updated Apr 27**
POV-Ray Ring: http://webring.povray.co.uk
"Rune Johansen" <rune[insert_current_year_here]@runevision.com> wrote in
message news:Ts********************@news000.worldonline.dk ... Raymond DeCampo wrote: Absolutely. Instead of using + to concatenate strings, you should use a StringBuffer and convert to a String at the end
Thanks a lot! This indeed solves the problem.
Question: Why are you converting all the line breaks to \n?
How do I preserve the line breaks? Before I added the \n, the whole string, when written to a file, was in a single line.
Rune -- 3D images and anims, include files, tutorials and more: rune|vision: http://runevision.com **updated Apr 27** POV-Ray Ring: http://webring.povray.co.uk
Read characters (or bytes) instead of lines. Reading lines is useless unless
you want to process individual lines.
Silvio Bierman
Silvio Bierman wrote: Read characters (or bytes) instead of lines. Reading lines is useless unless you want to process individual lines.
Okay, I now read characters instead, using the following method:
public static String readTextFile(String filename) {
try {
File fin = new File(filename);
FileInputStream fis = new FileInputStream(fin);
BufferedReader in = new BufferedReader(new
InputStreamReader(fis));
char[] chrArr = new char[(int)fin.length()];
while(in.ready()==false) {}
in.read(chrArr);
in.close();
return new String(chrArr);
}
catch (FileNotFoundException e) { return ""; }
catch (IOException e) { return ""; }
}
Except for the poor exception handling, is there anything obvious that
could be improved here?
Rune
"Rune Johansen" <rune[insert_current_year_here]@runevision.com> wrote in
message news:65********************@news000.worldonline.dk ... Silvio Bierman wrote: Read characters (or bytes) instead of lines. Reading lines is useless unless you want to process individual lines.
Okay, I now read characters instead, using the following method:
public static String readTextFile(String filename) { try { File fin = new File(filename); FileInputStream fis = new FileInputStream(fin); BufferedReader in = new BufferedReader(new InputStreamReader(fis)); char[] chrArr = new char[(int)fin.length()]; while(in.ready()==false) {} in.read(chrArr); in.close(); return new String(chrArr); } catch (FileNotFoundException e) { return ""; } catch (IOException e) { return ""; } }
Except for the poor exception handling, is there anything obvious that could be improved here?
Rune
Rune,
You could drop the BufferedReader and read from the InputStreamReader
directly. This will be somewhat faster. In cases where you would be
processing the file as a character stream you should use the BufferedRe\ader
though.
Silvio Bierman
Silvio Bierman wrote: "Rune Johansen" <rune[insert_current_year_here]@runevision.com> wrote in message news:65********************@news000.worldonline.dk ...
Silvio Bierman wrote:
Read characters (or bytes) instead of lines. Reading lines is useless unless you want to process individual lines.
Okay, I now read characters instead, using the following method:
public static String readTextFile(String filename) { try { File fin = new File(filename); FileInputStream fis = new FileInputStream(fin); BufferedReader in = new BufferedReader(new InputStreamReader(fis)); char[] chrArr = new char[(int)fin.length()]; while(in.ready()==false) {} in.read(chrArr); in.close(); return new String(chrArr); } catch (FileNotFoundException e) { return ""; } catch (IOException e) { return ""; } }
Except for the poor exception handling, is there anything obvious that could be improved here?
Rune
Rune,
You could drop the BufferedReader and read from the InputStreamReader directly. This will be somewhat faster. In cases where you would be processing the file as a character stream you should use the BufferedRe\ader though.
Silvio Bierman
I definitely disagree with that. My understanding is that
FileInputStream, FileOutputStream, FileReader and FileWriter are not
buffered and will go to the file system for every byte/character. So
they should almost always be wrapped with the appropriate buffered stream.
I would however, ditch the FileInputStream/InputStreamReader combination
in favor of FileReader.
The other potential issue is that InputStream.in() is not guaranteed to
fill the array (although in practice I think it usually does). So the
paranoid way to do this would be in a loop that ensures that all the
desired characters are read.
Finally, I don't think the while loop you have adds any value and will
just eat CPU cycles if it does anything.
And speaking of finally, you should use a finally clause to close your
streams.
Ray
--
XML is the programmer's duct tape.
"Raymond DeCampo" <rd******@spam.twcny.spam.rr.spam.com.spam> wrote in
message news:52********************@twister.nyroc.rr.com.. . Silvio Bierman wrote: "Rune Johansen" <rune[insert_current_year_here]@runevision.com> wrote in message news:65********************@news000.worldonline.dk ...
Silvio Bierman wrote:
Read characters (or bytes) instead of lines. Reading lines is useless unless you want to process individual lines.
Okay, I now read characters instead, using the following method:
public static String readTextFile(String filename) { try { File fin = new File(filename); FileInputStream fis = new FileInputStream(fin); BufferedReader in = new BufferedReader(new InputStreamReader(fis)); char[] chrArr = new char[(int)fin.length()]; while(in.ready()==false) {} in.read(chrArr); in.close(); return new String(chrArr); } catch (FileNotFoundException e) { return ""; } catch (IOException e) { return ""; } }
Except for the poor exception handling, is there anything obvious that could be improved here?
Rune
Rune,
You could drop the BufferedReader and read from the InputStreamReader directly. This will be somewhat faster. In cases where you would be processing the file as a character stream you should use the
BufferedRe\ader though.
Silvio Bierman
I definitely disagree with that. My understanding is that FileInputStream, FileOutputStream, FileReader and FileWriter are not buffered and will go to the file system for every byte/character. So they should almost always be wrapped with the appropriate buffered stream.
I would however, ditch the FileInputStream/InputStreamReader combination in favor of FileReader.
The other potential issue is that InputStream.in() is not guaranteed to fill the array (although in practice I think it usually does). So the paranoid way to do this would be in a loop that ensures that all the desired characters are read.
Finally, I don't think the while loop you have adds any value and will just eat CPU cycles if it does anything.
And speaking of finally, you should use a finally clause to close your streams.
Ray
-- XML is the programmer's duct tape.
Raymond,
The non-buffered streams and readers do not go to the filesystem for every
byte/character but for every read action instead. If you plan to read say 1M
bytes in a single read a plain stream will do a single filesystem level read
where a buffered reader will read multiple times its buffer size until the
1M bytes are read. If the raw variants where so dumb it would be impossible
for the buffered ones to use them efficiently.
As I said If you intend to read single (or very small counts of)
bytes/characters frequently the buffered variants will group the reads and
therefore give better performance.
It is a common misconception that you should always use buffered
streams/readers. If this where true the plain ones would have porobably been
left out of the API.
Regards,
Silvio Bierman
Silvio Bierman wrote: "Raymond DeCampo" <rd******@spam.twcny.spam.rr.spam.com.spam> wrote in message news:52********************@twister.nyroc.rr.com.. .
Silvio Bierman wrote:
Rune,
You could drop the BufferedReader and read from the InputStreamReader directly. This will be somewhat faster. In cases where you would be processing the file as a character stream you should use the BufferedRe\ader though.
Silvio Bierman
I definitely disagree with that. My understanding is that FileInputStream, FileOutputStream, FileReader and FileWriter are not buffered and will go to the file system for every byte/character. So they should almost always be wrapped with the appropriate buffered stream.
Ray
-- XML is the programmer's duct tape.
Raymond,
The non-buffered streams and readers do not go to the filesystem for every byte/character but for every read action instead. If you plan to read say 1M bytes in a single read a plain stream will do a single filesystem level read where a buffered reader will read multiple times its buffer size until the 1M bytes are read. If the raw variants where so dumb it would be impossible for the buffered ones to use them efficiently.
As I said If you intend to read single (or very small counts of) bytes/characters frequently the buffered variants will group the reads and therefore give better performance.
It is a common misconception that you should always use buffered streams/readers. If this where true the plain ones would have porobably been left out of the API.
Regards,
Silvio Bierman
Hmm, that makes sense. Thanks for the clarification.
Ray
--
XML is the programmer's duct tape. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Alan Pretre |
last post by:
Can anyone help me figure out a regex pattern for the following input
example:
xxx:a=b,c=d,yyy:e=f,zzz:www:g=h,i=j,l=m
I would want four matches from this:
1. xxx a=b,c=d
2. yyy e=f
3....
|
by: jimmyfishbean |
last post by:
Hi,
I am using VB6, SAX (implementing IVBSAXContentHandler).
I need to extract binary encoded data (images) from large XML files and
decode this data and generate the appropriate images onto...
|
by: alphatan |
last post by:
Is there relative source or document for this purpose?
I've searched the index of "Mastering Regular Expression", but cannot
get the useful information for C.
Thanks in advanced.
--
Learning...
|
by: Mark |
last post by:
Hi,
I've seen some postings on this but not exactly relating to this
posting. I'm reading in a large mail message as a string. In the
string is an xml attachment that I need to parse out and...
|
by: rjb |
last post by:
Hi!
Could somebody have a look and help me to optimize the code below.
It may look like very bad way of coding, but this stuff is very, very new
for me.
I've included just few lines.
Regex...
|
by: shonend |
last post by:
I am trying to extract the pattern like this :
"SUB: some text LOT: one-word"
Described, "SUB" and "LOT" are key words; I want those words,
everything in between and one word following the...
|
by: MooMaster |
last post by:
I'm trying to develop a little script that does some string
manipulation. I have some few hundred strings that currently look like
this:
cond(a,b,c)
and I want them to look like this:
...
|
by: garrickp |
last post by:
While creating a log parser for fairly large logs, we have run into an
issue where the time to process was relatively unacceptable (upwards
of 5 minutes for 1-2 million lines of logs). In contrast,...
|
by: =?Utf-8?B?QWxCcnVBbg==?= |
last post by:
I have a regular expression for capturing all occurrences of words contained
between {{ and }} in a file. My problem is I need to capture what is between
those symbols. For instance, if I have...
|
by: Kemmylinns12 |
last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and efficiency. While initially associated with cryptocurrencies...
|
by: Naresh1 |
last post by:
What is WebLogic Admin Training?
WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge required to effectively administer and manage Oracle...
|
by: AndyPSV |
last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and...
|
by: Arjunsri |
last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and credentials and received a successful connection...
|
by: WisdomUfot |
last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific technical details, Gmail likely implements measures...
|
by: Matthew3360 |
last post by:
Hi,
I have been trying to connect to a local host using php curl. But I am finding it hard to do this. I am doing the curl get request from my web server and have made sure to enable curl. I get a...
|
by: BLUEPANDA |
last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS starter kit that's not only easy to use but also...
|
by: Rahul1995seven |
last post by:
Introduction:
In the realm of programming languages, Python has emerged as a powerhouse. With its simplicity, versatility, and robustness, Python has gained popularity among beginners and experts...
|
by: Ricardo de Mila |
last post by:
Dear people, good afternoon...
I have a form in msAccess with lots of controls and a specific routine must be triggered if the mouse_down event happens in any control.
Than I need to discover what...
| |