By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,137 Members | 2,282 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,137 IT Pros & Developers. It's quick & easy.

Efficiently reading a string from a specific point in a file

P: n/a
Hi,

I'm writing a program which creates an index of text files. For each
file it
processes, the program records the start and end positions (as
returned by
tellg()) of sections of interest, and then some time later uses these
positions
to read the interesting sections from the file.

When reading the sections, I'm currently using get() to read
characters from the
file one by one and concatenating them to what has already been read.
However, I
guess this will be fairly inefficient if the text to extract is long.

Is there a more efficient way to do this, perhaps using an existing
library
function? I'd imagine that this question has been asked before, but
when
googling for answers I could only find solutions for reading entire
files
completely; I can't do that because the files are too large to store
in memory.

My code is below; any advice would be gratefully received!

#include <iostream>
#include <string>
#include <fstream>
std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end) {

in.seekg(start);

std::string s;

while (in.tellg() != end) {
s += in.get(); // Not very efficient?
}

return s;
}

int main(void) {

std::ifstream in("test_file", std::ios_base::binary);

// Hard-coded positions below; these would normally be returned
from tellg()
std::cout << "\"" << get_string(in, 10, 19) << "\"" << std::endl;

return 0;
}

May 11 '07 #1
Share this Question
Share on Google+
7 Replies


P: n/a
On 11 Maj, 14:11, random guy <r...@mail.comwrote:
Hi,

I'm writing a program which creates an index of text files. For each
file it
processes, the program records the start and end positions (as
returned by
tellg()) of sections of interest, and then some time later uses these
positions
to read the interesting sections from the file.

When reading the sections, I'm currently using get() to read
characters from the
file one by one and concatenating them to what has already been read.
However, I
guess this will be fairly inefficient if the text to extract is long.

Is there a more efficient way to do this, perhaps using an existing
library
function? I'd imagine that this question has been asked before, but
when
googling for answers I could only find solutions for reading entire
files
completely; I can't do that because the files are too large to store
in memory.

My code is below; any advice would be gratefully received!

#include <iostream>
#include <string>
#include <fstream>

std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end) {

in.seekg(start);

std::string s;

while (in.tellg() != end) {
s += in.get(); // Not very efficient?
}

return s;

}
You can do something like this:

std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end)
{
char* s = new char[end - start + 1];
in.get(s, end - start + 1);
return std::string(s);
}

Notice that by default get() stops reading at \n, if you don't want
that behaviour you need to provide a third argument which is a
delimiting character, \0 should work if you never want it to stop
reading. I'm not sure what will happen if it reaches EOF.

--
Erik Wikström

May 11 '07 #2

P: n/a
On May 11, 11:01 pm, Erik Wikström <eri...@student.chalmers.sewrote:
....
std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end)
{
char* s = new char[end - start + 1];
no corresponding delete[] ...

use std::vector<chars(end - start + 1);
in.get(s, end - start + 1);
return std::string(s);

}

Notice that by default get() stops reading at \n, if you don't want
that behaviour you need to provide a third argument which is a
delimiting character, \0 should work if you never want it to stop
reading. I'm not sure what will happen if it reaches EOF.

--
Erik Wikström

May 11 '07 #3

P: n/a
In message <11*********************@q75g2000hsh.googlegroups. com>,
Gianni Mariani <gi*******@mariani.wswrites
>On May 11, 11:01 pm, Erik Wikström <eri...@student.chalmers.sewrote:
...
>std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end)
{
char* s = new char[end - start + 1];

no corresponding delete[] ...

use std::vector<chars(end - start + 1);
> in.get(s, end - start + 1);
return std::string(s);

}

Notice that by default get() stops reading at \n, if you don't want
that behaviour you need to provide a third argument which is a
delimiting character, \0 should work if you never want it to stop
reading.
If you know exactly how many characters you want to read, use in.read().
>I'm not sure what will happen if it reaches EOF.
--
Richard Herring
May 11 '07 #4

P: n/a
On 2007-05-11 17:28, Richard Herring wrote:
In message <11*********************@q75g2000hsh.googlegroups. com>,
Gianni Mariani <gi*******@mariani.wswrites
>>On May 11, 11:01 pm, Erik Wikström <eri...@student.chalmers.sewrote:
...
>>std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end)
{
char* s = new char[end - start + 1];

no corresponding delete[] ...

use std::vector<chars(end - start + 1);
>> in.get(s, end - start + 1);
return std::string(s);

}

Notice that by default get() stops reading at \n, if you don't want
that behaviour you need to provide a third argument which is a
delimiting character, \0 should work if you never want it to stop
reading.

If you know exactly how many characters you want to read, use in.read().
No, read() is for unformated data (binary) get() should be used for text.

--
Erik Wikström
May 11 '07 #5

P: n/a
On May 11, 7:09 pm, Erik Wikström <Erik-wikst...@telia.comwrote:
On 2007-05-11 17:28, Richard Herring wrote:
In message <1178892634.840378.29...@q75g2000hsh.googlegroups. com>,
Gianni Mariani <gi3nos...@mariani.wswrites
>On May 11, 11:01 pm, Erik Wikström <eri...@student.chalmers.sewrote:
...
std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end)
{
char* s = new char[end - start + 1];
>no corresponding delete[] ...
>use std::vector<chars(end - start + 1);
> in.get(s, end - start + 1);
return std::string(s);
>}
>Notice that by default get() stops reading at \n, if you don't want
that behaviour you need to provide a third argument which is a
delimiting character, \0 should work if you never want it to stop
reading.
If you know exactly how many characters you want to read, use in.read().
No, read() is for unformated data (binary) get() should be used for text.
What makes you say that? read() works perfectly well for text.

Note, however, that there is not necessarily a relationship
between the number of characters, and the difference end -
start, converted to an integral type. It will probably work
under Unix, but will certainly result in two many characters
under Windows, and on some systems, it may result in nothing
even remotely usable.

Also, of course, on a lot of systems, you can't necessarily
allocate a buffer this big anyway.

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 11 '07 #6

P: n/a
On 2007-05-11 21:56, James Kanze wrote:
On May 11, 7:09 pm, Erik Wikström <Erik-wikst...@telia.comwrote:
>On 2007-05-11 17:28, Richard Herring wrote:
In message <1178892634.840378.29...@q75g2000hsh.googlegroups. com>,
Gianni Mariani <gi3nos...@mariani.wswrites
On May 11, 11:01 pm, Erik Wikström <eri...@student.chalmers.sewrote:
...
std::string get_string(std::ifstream &in,
std::ifstream::pos_type start,
std::ifstream::pos_type end)
{
char* s = new char[end - start + 1];
>>no corresponding delete[] ...
>>use std::vector<chars(end - start + 1);
>> in.get(s, end - start + 1);
return std::string(s);
>>}
>>Notice that by default get() stops reading at \n, if you don't want
that behaviour you need to provide a third argument which is a
delimiting character, \0 should work if you never want it to stop
reading.
If you know exactly how many characters you want to read, use in.read().
>No, read() is for unformated data (binary) get() should be used for text.

What makes you say that? read() works perfectly well for text.

Note, however, that there is not necessarily a relationship
between the number of characters, and the difference end -
start, converted to an integral type. It will probably work
under Unix, but will certainly result in two many characters
under Windows, and on some systems, it may result in nothing
even remotely usable.
Well, you can of course use whichever one you like, but with get() you
get the null-character at the end of the array for free, which you don't
with read().

--
Erik Wikström
May 11 '07 #7

P: n/a
On May 11, 10:54 pm, Erik Wikström <Erik-wikst...@telia.comwrote:
On 2007-05-11 21:56, James Kanze wrote:
[...]
Well, you can of course use whichever one you like, but with get() you
get the null-character at the end of the array for free, which you don't
with read().
He's using it to construct a string, so he doesn't need the null
character.

FWIW: the next version of the standard will allow reading the
string "in place". Something like:

std::string result ;
result.resize( size ) ;
if ( ! in.get( &result[ 0 ], result.size( 0 ) ) {
result.resize( in.gcount() ) ;
}

This will also work with all current implementations, and since
it will be standard in the future, the probability of an
implementation changing so that it won't work is pretty small.

The real problem in his code, of course, was the arithmetic on
streampos, which isn't guaranteed to give anything usable for
other than positionning in a file. (In particular, under most
systems---Unix is the only exception I know of---the difference
between two streampos will *not* result in the number of char
that can be read between those two positions. Under Windows,
the number will typically be somewhat larger, and on other
systems, who knows.)

--
James Kanze (Gabi Software) email: ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

May 12 '07 #8

This discussion thread is closed

Replies have been disabled for this discussion.