By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,991 Members | 1,887 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,991 IT Pros & Developers. It's quick & easy.

How to store the data of a large map<string, int>?

P: n/a
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
....

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.

Aug 6 '07 #1
Share this Question
Share on Google+
13 Replies


P: n/a
liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
....

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.
How are you reading the data and where is your current bottleneck?

--
Ian Collins.
Aug 6 '07 #2

P: n/a
liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.
I think you have to convert your text file into binary mode, built as a
dictionary indexed file.

You can have such structure to serialize your data into the dictionary file

struct Foo {
unsigned int word_len;
char* word;
int key;
};

and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.

You can reference StarDict, an open source dictionary, it will give you
some hints.
Aug 6 '07 #3

P: n/a
On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
....
Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.

How are you reading the data and where is your current bottleneck?

--
Ian Collins.
I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}

Aug 6 '07 #4

P: n/a
liujiaping wrote:
On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
>How are you reading the data and where is your current bottleneck?
Please don't quote signatures.
>
I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
Have you profiled the reading to see where most of the time is being used?

--
Ian Collins.
Aug 6 '07 #5

P: n/a
On Aug 6, 11:31 am, Barry <dh...@126.comwrote:
liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...
Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.

I think you have to convert your text file into binary mode, built as a
dictionary indexed file.

You can have such structure to serialize your data into the dictionary file

struct Foo {
unsigned int word_len;
char* word;
int key;

};

and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.

You can reference StarDict, an open source dictionary, it will give you
some hints.
Thanks for ur advice. But how to write the struct Foo to a binary
file?
Can the function fwrite() do that? And given the binary file, how do
you
read from it to the struct Foo? Is there any example about it?
Thank you~

Aug 6 '07 #6

P: n/a
liujiaping wrote:
On Aug 6, 11:31 am, Barry <dh...@126.comwrote:
>liujiaping wrote:
>>Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...
Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.
I think you have to convert your text file into binary mode, built as a
dictionary indexed file.

You can have such structure to serialize your data into the dictionary file

struct Foo {
unsigned int word_len;
char* word;
int key;

};

and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.

You can reference StarDict, an open source dictionary, it will give you
some hints.

Thanks for ur advice. But how to write the struct Foo to a binary
file?
since you load the text file into vector<map<string, int
say word_map

struct Serializer {
Serializer(ofstream& ofs) : ofs(ofs) {}
void operator() (pair<string, intconst& p) const {
string::size_type len = p.first.size();
ofs.write((char const*)&len, sizeof(string::size_type));
ofs.write(p.first.data(), len);
ofs.write((char const*)&p.second, sizeof(int));
}
ofstream& ofs;
};

ofstream ofs("out.dict", ios::binary);
for_each (word_map.begin(), word_map.end(), Serializer(ofs)));
Can the function fwrite() do that? And given the binary file, how do
you
read from it to the struct Foo? Is there any example about it?
word_map;
void load(iftream& ifs) {
while (!ifs.eof()) {
string::size_type len = -1;
ifs.read((char*)&len, sizeof(string::size_type));
assert(len <= 1024);
char buf[1024]; // maximum buffer for a word
ifs.read(buf, len);
string word(buf, len);
int key;
ifs.read((char*)&key, sizeof(int));
word_map.insert(make_pair(word, key));
}
}

////////

In addiction, if you wanna have index file for your data (then you can
load just one specific word data), you have to write the hash code(have
to be unique) and the file offset in the dict data file, when you're
loading one word, just look up in the index file, then locate the word
data with offset in the data file.
Aug 6 '07 #7

P: n/a
On Aug 6, 12:27 pm, Barry <dh...@126.comwrote:
liujiaping wrote:
On Aug 6, 11:31 am, Barry <dh...@126.comwrote:
liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...
Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.
I think you have to convert your text file into binary mode, built as a
dictionary indexed file.
You can have such structure to serialize your data into the dictionary file
struct Foo {
unsigned int word_len;
char* word;
int key;
};
and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.
You can reference StarDict, an open source dictionary, it will give you
some hints.
Thanks for ur advice. But how to write the struct Foo to a binary
file?

since you load the text file into vector<map<string, int
say word_map

struct Serializer {
Serializer(ofstream& ofs) : ofs(ofs) {}
void operator() (pair<string, intconst& p) const {
string::size_type len = p.first.size();
ofs.write((char const*)&len, sizeof(string::size_type));
ofs.write(p.first.data(), len);
ofs.write((char const*)&p.second, sizeof(int));
}
ofstream& ofs;

};

ofstream ofs("out.dict", ios::binary);
for_each (word_map.begin(), word_map.end(), Serializer(ofs)));
Can the function fwrite() do that? And given the binary file, how do
you
read from it to the struct Foo? Is there any example about it?

word_map;
void load(iftream& ifs) {
while (!ifs.eof()) {
string::size_type len = -1;
ifs.read((char*)&len, sizeof(string::size_type));
assert(len <= 1024);
char buf[1024]; // maximum buffer for a word
ifs.read(buf, len);
string word(buf, len);
int key;
ifs.read((char*)&key, sizeof(int));
word_map.insert(make_pair(word, key));
}

}

////////

In addiction, if you wanna have index file for your data (then you can
load just one specific word data), you have to write the hash code(have
to be unique) and the file offset in the dict data file, when you're
loading one word, just look up in the index file, then locate the word
data with offset in the data file.
Nice! Thanks for your help. I'll try it.

Aug 6 '07 #8

P: n/a
On Aug 6, 11:41 am, Ian Collins <ian-n...@hotmail.comwrote:
liujiaping wrote:
On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
How are you reading the data and where is your current bottleneck?

Please don't quote signatures.
I just use the file stream to read the data, simply like this:
fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}

Have you profiled the reading to see where most of the time is being used?

--
Ian Collins.
Nope. Is there any other way to read this file?

Aug 6 '07 #9

P: n/a
In article <11**********************@q3g2000prf.googlegroups. com>,
lj******@gmail.com says...

[ ... ]
fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
There are a number of possibilities to consider. First of all, in many
implementations, stream-based I/O is fairly slow -- you may gain quite a
bit of speed by reading the data with something like fscanf or
fgets/strtok.

Second, it sounds like the dictionary remains constant after you build
it. If so, you're probably better off reading the data into a vector,
sorting it, and then using a binary search to retrieve it later. This
will often improve speed somewhat during the reading phase, and even
more when you're using the data.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Aug 6 '07 #10

P: n/a

liujiaping <lj******@gmail.comwrote in message...
On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
How are you reading the data and where is your current bottleneck?

I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
/*
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
*/
while( data_file >word >key ){
key_map[word] = key;
}

--
Bob R
POVrookie
Aug 6 '07 #11

P: n/a
liujiaping wrote:
On Aug 6, 11:41 am, Ian Collins <ian-n...@hotmail.comwrote:
>liujiaping wrote:
>>On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
How are you reading the data and where is your current bottleneck?
Please don't quote signatures.
>>I just use the file stream to read the data, simply like this:
fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
Have you profiled the reading to see where most of the time is being used?
As I said before, *please don't quote signatures*.
>
Nope. Is there any other way to read this file?
Why do you refuse to profile it? Before you do, any optimisation
suggestions are just guesses. The bottleneck might be in the I/O, or it
might be in loading the map. How you optimise your code should be based
on real data, not speculation.

--
Ian Collins.
Aug 6 '07 #12

P: n/a
On Aug 6, 5:41 am, Ian Collins <ian-n...@hotmail.comwrote:
liujiaping wrote:
On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
How are you reading the data and where is your current bottleneck?
Please don't quote signatures.
I just use the file stream to read the data, simply like this:
fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
Have you profiled the reading to see where most of the time is
being used?
Before profiling it, he should probably modify it to ensure that
it is correct. As it is, it happens to work if the file has no
syntax errors in it, but only because processing the last line
twice doesn't cause an error. I'd write something more along
the lines of:

std::string line ;
while ( std::getline( data_file, line ) ) {
std::istringstream s( line ) ;
s >word >key >std::ws ;
if ( ! s || s.get() != EOF ) {
// Syntax error...
} else {
key_map.insert( Map::value_type( word, key ) ) ;
}
}

That's likely to be even slower:-). But at least it's correct.

Once that's done, of course, there's no way to reliably speed it
up without the profiling data. And depending on how much
speeding up is necessary, there may not even be a portable
solution.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 6 '07 #13

P: n/a
On Aug 6, 7:06 am, Jerry Coffin <jcof...@taeus.comwrote:
In article <1186371485.256316.249...@q3g2000prf.googlegroups. com>,
ljiap...@gmail.com says...
[ ... ]
fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
There are a number of possibilities to consider. First of all, in many
implementations, stream-based I/O is fairly slow -- you may gain quite a
bit of speed by reading the data with something like fscanf or
fgets/strtok.
That's not been my experience (except with Sun CC), but it
depends on the implementation. Given that he's having trouble
getting iostream input correct, however, I definitly wouldn't
suggest his trying stdio.h.
Second, it sounds like the dictionary remains constant after you build
it. If so, you're probably better off reading the data into a vector,
sorting it, and then using a binary search to retrieve it later. This
will often improve speed somewhat during the reading phase, and even
more when you're using the data.
Another question is where the file is coming from. If he could
generate it sorted to begin with, or sort it off line, that
could speed up the initialization of the std::vector or lot. Or
even of std::map, since he could give an accurate hint for each
insertion.

But before considering any of that, we really need profiling
data. It's also possible (or even probable) that his
initialization is IO bound anyway, so no changes in code can
improve it.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 6 '07 #14

This discussion thread is closed

Replies have been disabled for this discussion.