By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,645 Members | 1,065 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,645 IT Pros & Developers. It's quick & easy.

Modifying the contents of a file

P: n/a
I would like to modify the contents of a file, replacing all occurances of
one string with another. I wrote these functions:

bool read_file(std::string name, std::string &s);
bool write_file(std::string name, const std::string &s);
void find_replace(std::string &s, std::string first, std::string second);

bool find_replace_file(std::string name, std::string first, std::string
second)
{
std::string s;
if (!read_file(name, s))
return false;
find_replace(s, first, second); // replace first with second in s
return write_file(name, s);
}

Have I got the right idea? If not, what is wrong? Thanks.
Jul 23 '05 #1
Share this Question
Share on Google+
33 Replies


P: n/a
"Jason Heyes" <ja********@optusnet.com.au> wrote in message
news:42***********************@news.optusnet.com.a u :
I would like to modify the contents of a file, replacing all occurances of
one string with another. I wrote these functions:

bool read_file(std::string name, std::string &s);
bool write_file(std::string name, const std::string &s);
void find_replace(std::string &s, std::string first, std::string second);

bool find_replace_file(std::string name, std::string first, std::string
second)
{
std::string s;
if (!read_file(name, s))
return false;
find_replace(s, first, second); // replace first with second in s
return write_file(name, s);
}

Have I got the right idea? If not, what is wrong? Thanks.


If there's more than one line of text in the file you're
reading, I don't see how that would work. I'd think you'd
need to suck the contents of the "raw" file into a container,
such as a vector or list of strings; operate on the strings;
then write the results, either to a new file or over-writing
the old one.

Of course, you can open a file for read-write, and operate
on it one line at a time; but I prefer not to leave files
open that long. What happens if there's a problem? You bump
the reset button, or Windows crashes? You could loose data.
So I tend to suck all the data into RAM, close the file,
operate on the data, open the output file, write, close.

I've written programs that do what you describe, and my basic
modus operandi has been something like this (mixed C++ and
pseudocode):

std::ifstream IFS (InputFile);
std::string Buffer;
std::list<std::string> Text;
while (IFS)
{
getline(IFS, Buffer);
if(IFS.eof()) break;
Text.push_back(Buffer);
}
IFS.close();
typedef std::list<std::string>::const_iterator LSCI;
for (LSCI i = Text.begin(); i != Text.end(); ++i)
{
// iterate through Text one line at a time, replacing
// stuff at will. (*i) will always refer to the
// "current" line of text.
}

// Now write the processed text to an output file:
std::ofstream OFS (OutputFile);
for (LSCI i = Text.begin(); OFS && i != Text.end(); ++i)
{
OFS << (*i) << endl;
}
OFS.close();

Something like that should work for you, I think.
--
Cheers,
Robbie Hatley
Tustin, CA, USA
email: lonewolfintj at pacbell dot net
web: home dot pacbell dot net slant earnur slant

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
Jul 23 '05 #2

P: n/a
Actually, there's one glaring error in my last post:
When iterating through a list of strings, altering them,
one can NOT use a const_iterator; a regular iterator
must be used instead. (The const_iterator is fine for
writing strings to file, because the list is not being
altered.) So I should have written:

std::ifstream IFS (InputFile);
std::string Buffer;
std::list<std::string> Text;
while (IFS)
{
getline(IFS, Buffer);
if(IFS.eof()) break;
Text.push_back(Buffer);
}
IFS.close();

// Note NON-const iterator so we can WRITE to list:
typedef std::list<std::string>::iterator LSI;
for (LSI i = Text.begin(); i != Text.end(); ++i)
{
// iterate through Text one line at a time, replacing
// stuff at will. (*i) will always refer to the
// "current" line of text.
}

// Now write the processed text to an output file:
std::ofstream OFS (OutputFile);
typedef std::list<std::string>::const_iterator LSCI;
for (LSCI i = Text.begin(); OFS && i != Text.end(); ++i)
{
OFS << (*i) << endl;
}
OFS.close();

--
Cheers,
Robbie Hatley
Tustin, CA, USA
email: lonewolfintj at pacbell dot net
web: home dot pacbell dot net slant earnur slant

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
Jul 23 '05 #3

P: n/a
Robbie Hatley wrote:
"Jason Heyes" <ja********@optusnet.com.au> wrote in message
news:42***********************@news.optusnet.com.a u :
I would like to modify the contents of a file, replacing all occurances of
one string with another. I wrote these functions:

bool read_file(std::string name, std::string &s);
bool write_file(std::string name, const std::string &s);
void find_replace(std::string &s, std::string first, std::string second);

bool find_replace_file(std::string name, std::string first, std::string
second)
{
std::string s;
if (!read_file(name, s))
return false;
find_replace(s, first, second); // replace first with second in s
return write_file(name, s);
}

Have I got the right idea? If not, what is wrong? Thanks.

Yes, I think you do have the right idea.
If there's more than one line of text in the file you're
reading, I don't see how that would work. I'd think you'd
need to suck the contents of the "raw" file into a container,
such as a vector or list of strings; operate on the strings;
then write the results, either to a new file or over-writing
the old one.


I don't think there's anything wrong with reading the entire file into a
single string. If your processing needed to treat the file as a
sequence of lines then a container<string> might be appropriate, but in
this case I think it is an unnecessary complication.

--Phil.

Jul 23 '05 #4

P: n/a
"Robbie Hatley" <lonewolfintj at pacbell dot net> wrote in message
news:42********@spool9-west.superfeed.net...

If there's more than one line of text in the file you're
reading, I don't see how that would work. I'd think you'd
need to suck the contents of the "raw" file into a container,
such as a vector or list of strings; operate on the strings;
then write the results, either to a new file or over-writing
the old one.


I read the whole file into a single string instead of a sequence. Here is
the function to do it:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
ss << in.rdbuf();
s = ss.str();
return true;
}

Jul 23 '05 #5

P: n/a
Jason Heyes wrote:
I read the whole file into a single string instead of a sequence. Here is
the function to do it:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
ss << in.rdbuf();
s = ss.str();
Why do you duplicate the file contents so often?
return true;
Why no error checking?
}


Jul 23 '05 #6

P: n/a
"Panjandrum" <pa********@spambob.com> wrote in message
news:11**********************@g43g2000cwa.googlegr oups.com...
Jason Heyes wrote:
I read the whole file into a single string instead of a sequence. Here is
the function to do it:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
ss << in.rdbuf();
s = ss.str();


Why do you duplicate the file contents so often?
return true;


Why no error checking?
}


bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}

Jul 23 '05 #7

P: n/a
Jason Heyes wrote:
bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}


Much better! Just one unnecessary copy left (file -> stringstream ->
string).

Jul 23 '05 #8

P: n/a
Panjandrum wrote:
Jason Heyes wrote:
bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}

Much better! Just one unnecessary copy left (file -> stringstream ->
string).


This would be my preferred solution:

#include <iterator>

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

s.assign(std::istream_iterator<char>(in),
std::istream_iterator<char>()) ;

return true ;
}
Jul 23 '05 #9

P: n/a
Alan Johnson wrote:
This would be my preferred solution:

#include <iterator>

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

s.assign(std::istream_iterator<char>(in),
std::istream_iterator<char>()) ;

return true ;
}


Another interesting solution! But you omit error handling and give
false respone in case of error ('return true'). I also prefer Jason
Heyes' swap-after-success idiom and I'd try to reserve enough space in
the string to avoid unnecessary reallocations.

Jul 23 '05 #10

P: n/a
I'm curious about the amount of copying that is really going on in the
read-from-file-to-string functions that are being posted here. My hope
is that the libraries have a smart string implementation that allows us
to write "naive" code for problems like this while still getting decent
performance. So I'm testing them.

Here's the test platform: /usr/share/dict/words is just under 1
megabyte on my old Celeron-500 Linux PC. I have plenty of RAM though,
so these tests will not be testing the I/O bandwidth after the first
run. Compiler is gcc 3.3.6, optimisation -O9. My main program does a
superficial check that the string is good, and in the process proves
that nothing has been entirely optimised out.

None of these programs should do better than dd to /dev/null, which is a
rather low-level copy program being asked to throw away its output:

$ time dd if=/usr/share/dict/words of=/dev/null bs=8192
110+1 records in
110+1 records out
906950 bytes transferred in 0.080980 seconds (11199684 bytes/sec)

real 0m0.088s
user 0m0.004s
sys 0m0.010s

As you can see the time taken is not really measurable.

OK, now for the first function, from Jason Heyes:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
ss << in.rdbuf();
s = ss.str();
return true;
}

$ time ./tst /usr/share/dict/words

real 0m0.164s
user 0m0.039s
sys 0m0.053s

Now the next one, also from Jason, using a swap:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}

That doesn't compile. Swap needs a string&, but ss.str() is not an
lvalue. Oh dear.

Next up this, using assign and stream iterators, from Alan Johnson:

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

s.assign(std::istream_iterator<char>(in),
std::istream_iterator<char>()) ;

return true ;
}

$ time ./tst /usr/share/dict/words

real 0m2.320s
user 0m1.174s
sys 0m1.048s

Wow! A quick check with some larger files suggests this is > O(n).

OK, here's my entry. This is modified from something in my mail server,
Decimail. I don't remember why I wrote it like this, but it obviously
made sense at the time.

bool read_file(const std::string& name, std::string& s)
{
std::ifstream in(name.c_str());
if (!in.is_open()) return false;
unsigned int read_this_time;
do {
const int block_size=1024;
char buf[block_size];
read_this_time=in.readsome(buf,block_size);
s.append(buf,read_this_time);
} while(read_this_time>0 && in.good());
return true;
}

$ time ./tst /usr/share/dict/words

real 0m0.140s
user 0m0.025s
sys 0m0.034s
Any other suggestions? How does this compare with other compilers /
platforms? Does anyone have any other functions they'd like to compare?

--Phil.

Jul 23 '05 #11

P: n/a
"Phil Endecott" <ph*******@chezphil.org> wrote in message
news:Ri*****************@newsfe3-gui.ntli.net...
I'm curious about the amount of copying that is really going on in the
read-from-file-to-string functions that are being posted here. My hope is
that the libraries have a smart string implementation that allows us to
write "naive" code for problems like this while still getting decent
performance. So I'm testing them.

Here's the test platform: /usr/share/dict/words is just under 1 megabyte
on my old Celeron-500 Linux PC. I have plenty of RAM though, so these
tests will not be testing the I/O bandwidth after the first run. Compiler
is gcc 3.3.6, optimisation -O9. My main program does a superficial check
that the string is good, and in the process proves that nothing has been
entirely optimised out.

None of these programs should do better than dd to /dev/null, which is a
rather low-level copy program being asked to throw away its output:

$ time dd if=/usr/share/dict/words of=/dev/null bs=8192
110+1 records in
110+1 records out
906950 bytes transferred in 0.080980 seconds (11199684 bytes/sec)

real 0m0.088s
user 0m0.004s
sys 0m0.010s

As you can see the time taken is not really measurable.

OK, now for the first function, from Jason Heyes:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
ss << in.rdbuf();
s = ss.str();
return true;
}

$ time ./tst /usr/share/dict/words

real 0m0.164s
user 0m0.039s
sys 0m0.053s

Now the next one, also from Jason, using a swap:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}

That doesn't compile. Swap needs a string&, but ss.str() is not an
lvalue. Oh dear.

Next up this, using assign and stream iterators, from Alan Johnson:

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

s.assign(std::istream_iterator<char>(in),
std::istream_iterator<char>()) ;

return true ;
}

$ time ./tst /usr/share/dict/words

real 0m2.320s
user 0m1.174s
sys 0m1.048s

Wow! A quick check with some larger files suggests this is > O(n).

OK, here's my entry. This is modified from something in my mail server,
Decimail. I don't remember why I wrote it like this, but it obviously
made sense at the time.

bool read_file(const std::string& name, std::string& s)
{
std::ifstream in(name.c_str());
if (!in.is_open()) return false;
unsigned int read_this_time;
do {
const int block_size=1024;
char buf[block_size];
read_this_time=in.readsome(buf,block_size);
s.append(buf,read_this_time);
} while(read_this_time>0 && in.good());
return true;
}

$ time ./tst /usr/share/dict/words

real 0m0.140s
user 0m0.025s
sys 0m0.034s
Any other suggestions? How does this compare with other compilers /
platforms? Does anyone have any other functions they'd like to compare?


Very interesting results. I was surprised at how Alan Johnson's function
performed. I would like to see this next function compared. It is naively
implemented:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

s = "";
char c;
while (in.get(c))
s += c;
return true;
}

I predict a time similar to Alan Johnson's.

Your block-reading function has similar performance to my original function.
Why is that? Are both functions doing the same thing?
Jul 23 '05 #12

P: n/a
Phil Endecott wrote:
Now the next one, also from Jason, using a swap:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}

That doesn't compile. Swap needs a string&, but ss.str() is not an
lvalue. Oh dear.
std::string tmp (ss.str());
s.swap (tmp);

....
OK, here's my entry. This is modified from something in my mail server,
Decimail. I don't remember why I wrote it like this, but it obviously
made sense at the time.

bool read_file(const std::string& name, std::string& s)
{ // read file_size with stat
std::string tmp;
tmp.reserve (file_size);
std::ifstream in(name.c_str());
if (!in.is_open()) return false;
unsigned int read_this_time;
do {
const int block_size=1024;
const int block_size=BUFSIZ; // the 'right' size
char buf[block_size];
read_this_time=in.readsome(buf,block_size);
s.append(buf,read_this_time);
tmp.append (...)
} while(read_this_time>0 && in.good());


if (in.eof() {
tmp.swap (s);
return true;
} else {
return false;
}

Jul 23 '05 #13

P: n/a
"Panjandrum" <pa********@spambob.com> wrote in message
news:11**********************@g14g2000cwa.googlegr oups.com...
Phil Endecott wrote:
Now the next one, also from Jason, using a swap:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
if (!(ss << in.rdbuf()))
return false;

s.swap(ss.str());
return true;
}

That doesn't compile. Swap needs a string&, but ss.str() is not an
lvalue. Oh dear.


std::string tmp (ss.str());
s.swap (tmp);

...
OK, here's my entry. This is modified from something in my mail server,
Decimail. I don't remember why I wrote it like this, but it obviously
made sense at the time.

bool read_file(const std::string& name, std::string& s)
{

// read file_size with stat
std::string tmp;
tmp.reserve (file_size);
std::ifstream in(name.c_str());
if (!in.is_open()) return false;
unsigned int read_this_time;
do {
const int block_size=1024;


const int block_size=BUFSIZ; // the 'right' size
char buf[block_size];
read_this_time=in.readsome(buf,block_size);
s.append(buf,read_this_time);


tmp.append (...)
} while(read_this_time>0 && in.good());


if (in.eof() {
tmp.swap (s);
return true;
} else {
return false;
}


It is not necessary to use a temporary string. The written function is
already correct.
Jul 23 '05 #14

P: n/a
Jason Heyes wrote:
"Panjandrum" <pa********@spambob.com> wrote in message

....
if (in.eof() {
tmp.swap (s);
return true;
} else {
return false;
}


It is not necessary to use a temporary string. The written function is
already correct.


Only in case of no error. Otherwise you destroy the user's string and
give back truncated contents.

Jul 23 '05 #15

P: n/a
"Panjandrum" <pa********@spambob.com> wrote in message
news:11**********************@g43g2000cwa.googlegr oups.com...
Jason Heyes wrote:
"Panjandrum" <pa********@spambob.com> wrote in message

...
> if (in.eof() {
> tmp.swap (s);
> return true;
> } else {
> return false;
> }
>


It is not necessary to use a temporary string. The written function is
already correct.


Only in case of no error. Otherwise you destroy the user's string and
give back truncated contents.


No. The loop only terminates once eof is reached. So it is guarenteed that
in.eof() is true in your if-statement. This is why you don't need to use a
temporary string.
Jul 23 '05 #16

P: n/a
Jason Heyes wrote:
bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

s = "";
char c;
while (in.get(c))
s += c;
return true;
}

$ time ./tst /usr/share/dict/words

real 0m0.670s
user 0m0.555s
sys 0m0.030s

I've tried adding s.reserve() at the start and it doesn't seem to make
much difference.

The problem with Alan Johnson's function is that it is O(n^2) in the
file size: if I run it on a double-length input it takes four times as
long. I don't know why this is, because as the function above (which is
O(n)) shows, just appending to a string is not the problem. Any
thoughts? Do other compilers / libraries do better?
Panjandrum, the changes you suggest seem mostly to do with not changing
the initial value of s when an error has occured. I think that in the
cited application it is fine for s to be undefined in this case:

bool find_replace_file(std::string name, std::string first, std::string
second)
{
std::string s;
if (!read_file(name, s))
return false;
find_replace(s, first, second); // replace first with second in s
return write_file(name, s);
}

The question of whether peak memory use is reduced by using swap()
rather than assignment is an interesting one, but I do not have a good
way of measuring peak memory use. Does anyone have any suggestions?

--Phil.

Jul 23 '05 #17

P: n/a
Jason Heyes wrote:
No. The loop only terminates once eof is reached.
Nope. Actually the temination condition is probably wrong anyway (see
below). But who really knows iostream error handling.
So it is guarenteed that
in.eof() is true in your if-statement. This is why you don't need to use a
temporary string.


Nothing is 'guarenteed'. It's probably:
while (read_this_time>0 && in); // not in.good()
if (in.eof()) ... // success
else ... // error

Jul 23 '05 #18

P: n/a
"Phil Endecott" <ph*******@chezphil.org> wrote in message
news:uM*******************@newsfe5-gui.ntli.net...
Jason Heyes wrote:
bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

s = "";
char c;
while (in.get(c))
s += c;
return true;
}

$ time ./tst /usr/share/dict/words

real 0m0.670s
user 0m0.555s
sys 0m0.030s

I've tried adding s.reserve() at the start and it doesn't seem to make
much difference.

The problem with Alan Johnson's function is that it is O(n^2) in the file
size: if I run it on a double-length input it takes four times as long. I
don't know why this is, because as the function above (which is O(n))
shows, just appending to a string is not the problem. Any thoughts? Do
other compilers / libraries do better?


I know why. The assign function used by Alan Johnson does two things, which
are:

1. It computes the number of elements to copy by taking the distance between
two iterators. Since the iterators are not random access iterators, the
difference is calculated by iterating between them. To do that it must read
the entire file.

2. It copies the elements in a for-loop, reading the entire file a second
time.

Alan Johnson's function reads the whole file twice! This explains
everything, I guess.

The question of whether peak memory use is reduced by using swap() rather
than assignment is an interesting one, but I do not have a good way of
measuring peak memory use. Does anyone have any suggestions?


What is peak memory use?
Jul 23 '05 #19

P: n/a
Phil Endecott wrote:
Jason Heyes wrote:

The problem with Alan Johnson's function is that it is O(n^2) in the
file size: if I run it on a double-length input it takes four times as
long. I don't know why this is, because as the function above (which is
O(n)) shows, just appending to a string is not the problem. Any
thoughts? Do other compilers / libraries do better?


I'm not able to reproduce a quadratic running time. Here are the
results running on my machine (gcc 3.4.4 in Cygwin) on 1MB, 10MB, and
100MB files (respectively).

$ time ./a.exe testfile1

real 0m1.149s
user 0m1.171s
sys 0m0.015s

$ time ./a.exe testfile2

real 0m11.340s
user 0m11.327s
sys 0m0.046s

$ time ./a.exe testfile3

real 1m53.727s
user 1m52.905s
sys 0m0.796s
This is almost exactly linear growth in running time. You might try the
following to see if it also has quadratic time on your system.

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

std::istream_iterator<char> begin(in) ;
std::istream_iterator<char> end ;

while (begin != end) ++begin ;

return true ;
}

If not, then I would suspect that either your library's string class
poorly handles the "assign" method, or perhaps you are running into some
sort of memory swapping issues on larger files (though it seems you
would have seen the same problems with any of the other functions you
tested).

-Alan
Jul 23 '05 #20

P: n/a
>>The problem with Alan Johnson's function is that it is O(n^2) in the file
size: if I run it on a double-length input it takes four times as long.

Alan Johnson's function reads the whole file twice! This explains
everything, I guess.


No, that's still linear, not quadratic.
What is peak memory use?


The maximum amount of memory that the program uses.

--Phil.
Jul 23 '05 #21

P: n/a
>> The problem with Alan Johnson's function is that it is O(n^2) in the
file size
I'm not able to reproduce a quadratic running time.

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

std::istream_iterator<char> begin(in) ;
std::istream_iterator<char> end ;

while (begin != end) ++begin ;

return true ;
}


That's linear.

If I add to the string inside the loop:

while (begin != end) {
++begin;
s += (*begin);
}

That's also linear.

If I replace the loop with a call to append:

s.append(begin,end);

That's quadratic.

But... if I use gcc 3.4 (rather than 3.3) it suddenly becomes linear!

Mystery solved. OK, time to upgrade to 3.4 all round and see how many
of my programs get a sudden speed increase.

Here are the results for my original test code using gcc 3.4:

Jason's original function:
real 0m0.153s
user 0m0.024s
sys 0m0.051s

Alan's original function:
real 0m0.366s
user 0m0.243s
sys 0m0.047s

My one:
real 0m0.143s
user 0m0.021s
sys 0m0.039s
--Phil.
Jul 23 '05 #22

P: n/a
"Phil Endecott" <ph*******@chezphil.org> wrote in message
news:pE*******************@newsfe5-gui.ntli.net...
The problem with Alan Johnson's function is that it is O(n^2) in the file
size: if I run it on a double-length input it takes four times as long.

Alan Johnson's function reads the whole file twice! This explains
everything, I guess.


No, that's still linear, not quadratic.


Yes. Its O(2n) performance, not O(n^2). So thats linear.
Jul 23 '05 #23

P: n/a
"Alan Johnson" <al**@undisclosed.com> wrote in message
news:d8**********@news.Stanford.EDU...
Phil Endecott wrote:
Jason Heyes wrote:

The problem with Alan Johnson's function is that it is O(n^2) in the file
size: if I run it on a double-length input it takes four times as long.
I don't know why this is, because as the function above (which is O(n))
shows, just appending to a string is not the problem. Any thoughts? Do
other compilers / libraries do better?


I'm not able to reproduce a quadratic running time. Here are the results
running on my machine (gcc 3.4.4 in Cygwin) on 1MB, 10MB, and 100MB files
(respectively).

$ time ./a.exe testfile1

real 0m1.149s
user 0m1.171s
sys 0m0.015s

$ time ./a.exe testfile2

real 0m11.340s
user 0m11.327s
sys 0m0.046s

$ time ./a.exe testfile3

real 1m53.727s
user 1m52.905s
sys 0m0.796s
This is almost exactly linear growth in running time. You might try the
following to see if it also has quadratic time on your system.

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;

std::istream_iterator<char> begin(in) ;
std::istream_iterator<char> end ;

while (begin != end) ++begin ;

return true ;
}

If not, then I would suspect that either your library's string class
poorly handles the "assign" method, or perhaps you are running into some
sort of memory swapping issues on larger files (though it seems you would
have seen the same problems with any of the other functions you tested).


Assign performs O(2n) when the iterators are not random-access iterators
(e.g., std::istream_iterator).
Jul 23 '05 #24

P: n/a
"Phil Endecott" <ph*******@chezphil.org> wrote in message
news:yU******************@newsfe5-gui.ntli.net...
The problem with Alan Johnson's function is that it is O(n^2) in the
file size


I'm not able to reproduce a quadratic running time.

bool read_file(const std::string &name, std::string &s)
{
std::ifstream in(name.c_str()) ;
if (!in.is_open())
return false ;
in.unsetf(std::ios_base::skipws) ;
std::istream_iterator<char> begin(in) ;
std::istream_iterator<char> end ;

while (begin != end) ++begin ;

return true ;
}


That's linear.

If I add to the string inside the loop:

while (begin != end) {
++begin;
s += (*begin);
}

That's also linear.

If I replace the loop with a call to append:

s.append(begin,end);

That's quadratic.

But... if I use gcc 3.4 (rather than 3.3) it suddenly becomes linear!

Mystery solved. OK, time to upgrade to 3.4 all round and see how many of
my programs get a sudden speed increase.

Here are the results for my original test code using gcc 3.4:

Jason's original function:
real 0m0.153s
user 0m0.024s
sys 0m0.051s

Alan's original function:
real 0m0.366s
user 0m0.243s
sys 0m0.047s

My one:
real 0m0.143s
user 0m0.021s
sys 0m0.039s


Better. I'm still curious how you get quadratic time using gcc 3.3. Perhaps
the for-loop calls distance in it's condition:

template <typename InIt>
void assign(InIt first, InIt second)
{
replace(0, size(), distance(first, second), 0);
for (iterator it = begin(); distance(first, second) > 0; it++, first++)
*it = *first;
}
Jul 23 '05 #25

P: n/a
"Panjandrum" <pa********@spambob.com> wrote in message
news:11**********************@g14g2000cwa.googlegr oups.com...
Jason Heyes wrote:
No. The loop only terminates once eof is reached.


Nope. Actually the temination condition is probably wrong anyway (see
below). But who really knows iostream error handling.
So it is guarenteed that
in.eof() is true in your if-statement. This is why you don't need to use
a
temporary string.


Nothing is 'guarenteed'. It's probably:
while (read_this_time>0 && in); // not in.good()
if (in.eof()) ... // success
else ... // error


If this was your usual I/O that calls formatted input functions, you would
need to test eof. But here we call low-level, unformatted input functions
and the only danger is a loss of integrity in the stream buffer (i.e.,
in.bad() evaluates true). How often does that happen? Hardly seems worth the
effort and run-time required of using a temporary string. What do you think?
Jul 23 '05 #26

P: n/a
Jason Heyes wrote:
[snip]
Assign performs O(2n) when the iterators are not random-access iterators
(e.g., std::istream_iterator).


First, O(2n) is a fairly nonsensical expression, as O(2n) defines
precisely the same set of functions as O(n).

Second, if by O(2n) you mean to imply that the entire file is read twice
(or more precisely, that assign iterates over the range twice), then I
would ask to see where in the standard that behavior is defined. In
fact, tracing through my compiler's library, assign is implemented by
calling replace, which is implemented by constructing a new string using
the InputIterator range, which is implemented by called a function
called _S_construct that iterates over the range only once, allocating
more space as it is needed.

-Alan
Jul 23 '05 #27

P: n/a
"Jason Heyes" <ja********@optusnet.com.au> wrote in message
news:42***********************@news.optusnet.com.a u :
"Robbie Hatley" <lonewolfintj at pacbell dot net> wrote in message
news:42********@spool9-west.superfeed.net...

If there's more than one line of text in the file you're
reading, I don't see how that would work. I'd think you'd
need to suck the contents of the "raw" file into a container,
such as a vector or list of strings; operate on the strings;
then write the results, either to a new file or over-writing
the old one.

I read the whole file into a single string instead of a sequence.


Ok, if that's more convenient for the kind of processing you're
trying to do.

I'm often interesting in processing files with many short lines,
such as killfiles, dictionary files, name-list files, etc.
I'm often interesting in inserting, erasing, altering, sorting,
or de-duping lines of text, so I use std::list<std::string> a lot.
It makes that kind of processing a breeze.

But if you want to suck the whole file into a single std:string,
I can see how that could have its own advantages. (Such as
using the vast assortment of methods -- substr, find, etc. --
that come with std::string.)
Here is the function to do it:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
ss << in.rdbuf();
s = ss.str();
return true;
}


Yikes. What is "rdbuf"? Never heard of that.
Lemme look it up in Bjarne's book...
Hmmm... he doesn't say much about it. In fact, his only
explanation is the cryptic two-word comment, "get buffer".
So what is this "rdbuf" doing?

And how does this function handle newlines? Does it
insert them into the string?

Curious,
Robbie Hatley
Tustin, CA, USA
lo**********@pacbell.net
http://home.pacbell.net/earnur/

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
Jul 23 '05 #28

P: n/a
Jason Heyes wrote:

I know why. The assign function used by Alan Johnson does two things, which
are:

1. It computes the number of elements to copy by taking the distance between
two iterators. Since the iterators are not random access iterators, the
difference is calculated by iterating between them. To do that it must read
the entire file.

2. It copies the elements in a for-loop, reading the entire file a second
time.

Alan Johnson's function reads the whole file twice! This explains
everything, I guess.


From the C++ Standard, Clause 24.1.1.3: "... Algorithms on input
iterators should never attempt to pass through the same iterator twice.
They should be single pass algorithms. ..."

Therefore, in a standard compliant implementation, assign is a single
pass algorithm.

-Alan
Jul 23 '05 #29

P: n/a
Jason Heyes wrote:
If this was your usual I/O that calls formatted input functions, you would
need to test eof.
You need to test for an error (if you need to test for an error).
But here we call low-level, unformatted input functions
and the only danger is a loss of integrity in the stream buffer (i.e.,
in.bad() evaluates true).
You may just lose your user's data.
How often does that happen? Hardly seems worth the
effort and run-time required of using a temporary string. What do you think?


You mean it hardly seems worth the effort to safely handle your user's
data?
BTW, the cost of a temporary string that is swapped later is close to
immeasurable.

Jul 23 '05 #30

P: n/a
"Alan Johnson" <al**@undisclosed.com> wrote in message
news:d8**********@news.Stanford.EDU...
Jason Heyes wrote:

I know why. The assign function used by Alan Johnson does two things,
which are:

1. It computes the number of elements to copy by taking the distance
between two iterators. Since the iterators are not random access
iterators, the difference is calculated by iterating between them. To do
that it must read the entire file.

2. It copies the elements in a for-loop, reading the entire file a second
time.

Alan Johnson's function reads the whole file twice! This explains
everything, I guess.


From the C++ Standard, Clause 24.1.1.3: "... Algorithms on input iterators
should never attempt to pass through the same iterator twice. They should
be single pass algorithms. ..."

Therefore, in a standard compliant implementation, assign is a single pass
algorithm.


Does this mean the gcc 3.3 library is not compliant with that clause? How
else do we explain the O(2n) performance?
Jul 23 '05 #31

P: n/a
"Alan Johnson" <al**@undisclosed.com> wrote in message
news:d8**********@news.Stanford.EDU...
Jason Heyes wrote:
[snip]
Assign performs O(2n) when the iterators are not random-access iterators
(e.g., std::istream_iterator).


First, O(2n) is a fairly nonsensical expression, as O(2n) defines
precisely the same set of functions as O(n).

Second, if by O(2n) you mean to imply that the entire file is read twice
(or more precisely, that assign iterates over the range twice), then I
would ask to see where in the standard that behavior is defined. In fact,
tracing through my compiler's library, assign is implemented by calling
replace, which is implemented by constructing a new string using the
InputIterator range, which is implemented by called a function called
_S_construct that iterates over the range only once, allocating more space
as it is needed.


That's all fine. How do you explain your function taking twice as long to
execute when compiled with the gcc 3.3 compiler? Is the compiler's library
not compliant with the standard?
Jul 23 '05 #32

P: n/a
"Robbie Hatley" <lonewolfintj at pacbell dot net> wrote in message
news:42**********@spool9-west.superfeed.net...
"Jason Heyes" <ja********@optusnet.com.au> wrote in message
news:42***********************@news.optusnet.com.a u :
"Robbie Hatley" <lonewolfintj at pacbell dot net> wrote in message
news:42********@spool9-west.superfeed.net...
>
> If there's more than one line of text in the file you're
> reading, I don't see how that would work. I'd think you'd
> need to suck the contents of the "raw" file into a container,
> such as a vector or list of strings; operate on the strings;
> then write the results, either to a new file or over-writing
> the old one.
>


I read the whole file into a single string instead of a sequence.


Ok, if that's more convenient for the kind of processing you're
trying to do.

I'm often interesting in processing files with many short lines,
such as killfiles, dictionary files, name-list files, etc.
I'm often interesting in inserting, erasing, altering, sorting,
or de-duping lines of text, so I use std::list<std::string> a lot.
It makes that kind of processing a breeze.

But if you want to suck the whole file into a single std:string,
I can see how that could have its own advantages. (Such as
using the vast assortment of methods -- substr, find, etc. --
that come with std::string.)
Here is the function to do it:

bool read_file(std::string name, std::string &s)
{
std::ifstream in(name.c_str());
if (!in.is_open())
return false;

std::stringstream ss;
ss << in.rdbuf();
s = ss.str();
return true;
}


Yikes. What is "rdbuf"? Never heard of that.
Lemme look it up in Bjarne's book...
Hmmm... he doesn't say much about it. In fact, his only
explanation is the cryptic two-word comment, "get buffer".
So what is this "rdbuf" doing?

And how does this function handle newlines? Does it
insert them into the string?


rdbuf() returns a stream buffer, which is like a low-level version of an I/O
stream.

All whitespace (including the newline) is inserted into the string during
extraction.
Jul 23 '05 #33

P: n/a
Jason Heyes wrote:
"Alan Johnson" <al**@undisclosed.com> wrote in message
news:d8**********@news.Stanford.EDU...
Jason Heyes wrote:
I know why. The assign function used by Alan Johnson does two things,
which are:

1. It computes the number of elements to copy by taking the distance
between two iterators. Since the iterators are not random access
iterators, the difference is calculated by iterating between them. To do
that it must read the entire file.

2. It copies the elements in a for-loop, reading the entire file a second
time.

Alan Johnson's function reads the whole file twice! This explains
everything, I guess.

From the C++ Standard, Clause 24.1.1.3: "... Algorithms on input iterators
should never attempt to pass through the same iterator twice. They should
be single pass algorithms. ..."

Therefore, in a standard compliant implementation, assign is a single pass
algorithm.

Does this mean the gcc 3.3 library is not compliant with that clause? How
else do we explain the O(2n) performance?


Merging another branch of this thread because it has turned into
essentially the same discussion, Jason Heyes wrote:

That's all fine. How do you explain your function taking twice as long to
execute when compiled with the gcc 3.3 compiler? Is the compiler's library
not compliant with the standard?


You need to explain what you mean by O(2n). I could just as easily say
that all the other algorithms had running times in O(1000000n), and I'd
be correct, because those all define the same set of functions. (i.e.
the constants don't mean anything.)

The behavior that Phil Endecott observed was not that my algorithm took
twice as long, but that it had quadratic (i.e. O(n^2) ) performance on
his system. If in fact his implementation did (in violation of the
standard) iterate over the range twice, that still would only be linear
performance. If I had to take a wild guess at what was making the
performace quadratic, I'd guess that the memory allocation scheme used
in his library's "assign" method was allocating a constant amount of
memory each time it ran out (provably quadratic) instead of doing
something like doubling the buffer size each time it ran out of memory
(provably linear).

I don't have easy access to test on a gcc 3.3 compiler, but I would say
that his results do in fact indicate an error in his library, especially
since it suddenly was corrected when he upgraded compilers.

-Alan
Jul 23 '05 #34

This discussion thread is closed

Replies have been disabled for this discussion.