473,386 Members | 1,958 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

How to store the data of a large map<string, int>?

Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
....

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.

Aug 6 '07 #1
13 3378
liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
....

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.
How are you reading the data and where is your current bottleneck?

--
Ian Collins.
Aug 6 '07 #2
liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.
I think you have to convert your text file into binary mode, built as a
dictionary indexed file.

You can have such structure to serialize your data into the dictionary file

struct Foo {
unsigned int word_len;
char* word;
int key;
};

and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.

You can reference StarDict, an open source dictionary, it will give you
some hints.
Aug 6 '07 #3
On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
....
Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.

How are you reading the data and where is your current bottleneck?

--
Ian Collins.
I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}

Aug 6 '07 #4
liujiaping wrote:
On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
>How are you reading the data and where is your current bottleneck?
Please don't quote signatures.
>
I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
Have you profiled the reading to see where most of the time is being used?

--
Ian Collins.
Aug 6 '07 #5
On Aug 6, 11:31 am, Barry <dh...@126.comwrote:
liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...
Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.

I think you have to convert your text file into binary mode, built as a
dictionary indexed file.

You can have such structure to serialize your data into the dictionary file

struct Foo {
unsigned int word_len;
char* word;
int key;

};

and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.

You can reference StarDict, an open source dictionary, it will give you
some hints.
Thanks for ur advice. But how to write the struct Foo to a binary
file?
Can the function fwrite() do that? And given the binary file, how do
you
read from it to the struct Foo? Is there any example about it?
Thank you~

Aug 6 '07 #6
liujiaping wrote:
On Aug 6, 11:31 am, Barry <dh...@126.comwrote:
>liujiaping wrote:
>>Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...
Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.
I think you have to convert your text file into binary mode, built as a
dictionary indexed file.

You can have such structure to serialize your data into the dictionary file

struct Foo {
unsigned int word_len;
char* word;
int key;

};

and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.

You can reference StarDict, an open source dictionary, it will give you
some hints.

Thanks for ur advice. But how to write the struct Foo to a binary
file?
since you load the text file into vector<map<string, int
say word_map

struct Serializer {
Serializer(ofstream& ofs) : ofs(ofs) {}
void operator() (pair<string, intconst& p) const {
string::size_type len = p.first.size();
ofs.write((char const*)&len, sizeof(string::size_type));
ofs.write(p.first.data(), len);
ofs.write((char const*)&p.second, sizeof(int));
}
ofstream& ofs;
};

ofstream ofs("out.dict", ios::binary);
for_each (word_map.begin(), word_map.end(), Serializer(ofs)));
Can the function fwrite() do that? And given the binary file, how do
you
read from it to the struct Foo? Is there any example about it?
word_map;
void load(iftream& ifs) {
while (!ifs.eof()) {
string::size_type len = -1;
ifs.read((char*)&len, sizeof(string::size_type));
assert(len <= 1024);
char buf[1024]; // maximum buffer for a word
ifs.read(buf, len);
string word(buf, len);
int key;
ifs.read((char*)&key, sizeof(int));
word_map.insert(make_pair(word, key));
}
}

////////

In addiction, if you wanna have index file for your data (then you can
load just one specific word data), you have to write the hash code(have
to be unique) and the file offset in the dict data file, when you're
loading one word, just look up in the index file, then locate the word
data with offset in the data file.
Aug 6 '07 #7
On Aug 6, 12:27 pm, Barry <dh...@126.comwrote:
liujiaping wrote:
On Aug 6, 11:31 am, Barry <dh...@126.comwrote:
liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...
Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.
I think you have to convert your text file into binary mode, built as a
dictionary indexed file.
You can have such structure to serialize your data into the dictionary file
struct Foo {
unsigned int word_len;
char* word;
int key;
};
and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.
You can reference StarDict, an open source dictionary, it will give you
some hints.
Thanks for ur advice. But how to write the struct Foo to a binary
file?

since you load the text file into vector<map<string, int
say word_map

struct Serializer {
Serializer(ofstream& ofs) : ofs(ofs) {}
void operator() (pair<string, intconst& p) const {
string::size_type len = p.first.size();
ofs.write((char const*)&len, sizeof(string::size_type));
ofs.write(p.first.data(), len);
ofs.write((char const*)&p.second, sizeof(int));
}
ofstream& ofs;

};

ofstream ofs("out.dict", ios::binary);
for_each (word_map.begin(), word_map.end(), Serializer(ofs)));
Can the function fwrite() do that? And given the binary file, how do
you
read from it to the struct Foo? Is there any example about it?

word_map;
void load(iftream& ifs) {
while (!ifs.eof()) {
string::size_type len = -1;
ifs.read((char*)&len, sizeof(string::size_type));
assert(len <= 1024);
char buf[1024]; // maximum buffer for a word
ifs.read(buf, len);
string word(buf, len);
int key;
ifs.read((char*)&key, sizeof(int));
word_map.insert(make_pair(word, key));
}

}

////////

In addiction, if you wanna have index file for your data (then you can
load just one specific word data), you have to write the hash code(have
to be unique) and the file offset in the dict data file, when you're
loading one word, just look up in the index file, then locate the word
data with offset in the data file.
Nice! Thanks for your help. I'll try it.

Aug 6 '07 #8
On Aug 6, 11:41 am, Ian Collins <ian-n...@hotmail.comwrote:
liujiaping wrote:
On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
How are you reading the data and where is your current bottleneck?

Please don't quote signatures.
I just use the file stream to read the data, simply like this:
fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}

Have you profiled the reading to see where most of the time is being used?

--
Ian Collins.
Nope. Is there any other way to read this file?

Aug 6 '07 #9
In article <11**********************@q3g2000prf.googlegroups. com>,
lj******@gmail.com says...

[ ... ]
fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
There are a number of possibilities to consider. First of all, in many
implementations, stream-based I/O is fairly slow -- you may gain quite a
bit of speed by reading the data with something like fscanf or
fgets/strtok.

Second, it sounds like the dictionary remains constant after you build
it. If so, you're probably better off reading the data into a vector,
sorting it, and then using a binary search to retrieve it later. This
will often improve speed somewhat during the reading phase, and even
more when you're using the data.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Aug 6 '07 #10

liujiaping <lj******@gmail.comwrote in message...
On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
How are you reading the data and where is your current bottleneck?

I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
/*
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
*/
while( data_file >word >key ){
key_map[word] = key;
}

--
Bob R
POVrookie
Aug 6 '07 #11
liujiaping wrote:
On Aug 6, 11:41 am, Ian Collins <ian-n...@hotmail.comwrote:
>liujiaping wrote:
>>On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
How are you reading the data and where is your current bottleneck?
Please don't quote signatures.
>>I just use the file stream to read the data, simply like this:
fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
Have you profiled the reading to see where most of the time is being used?
As I said before, *please don't quote signatures*.
>
Nope. Is there any other way to read this file?
Why do you refuse to profile it? Before you do, any optimisation
suggestions are just guesses. The bottleneck might be in the I/O, or it
might be in loading the map. How you optimise your code should be based
on real data, not speculation.

--
Ian Collins.
Aug 6 '07 #12
On Aug 6, 5:41 am, Ian Collins <ian-n...@hotmail.comwrote:
liujiaping wrote:
On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
How are you reading the data and where is your current bottleneck?
Please don't quote signatures.
I just use the file stream to read the data, simply like this:
fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
Have you profiled the reading to see where most of the time is
being used?
Before profiling it, he should probably modify it to ensure that
it is correct. As it is, it happens to work if the file has no
syntax errors in it, but only because processing the last line
twice doesn't cause an error. I'd write something more along
the lines of:

std::string line ;
while ( std::getline( data_file, line ) ) {
std::istringstream s( line ) ;
s >word >key >std::ws ;
if ( ! s || s.get() != EOF ) {
// Syntax error...
} else {
key_map.insert( Map::value_type( word, key ) ) ;
}
}

That's likely to be even slower:-). But at least it's correct.

Once that's done, of course, there's no way to reliably speed it
up without the profiling data. And depending on how much
speeding up is necessary, there may not even be a portable
solution.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 6 '07 #13
On Aug 6, 7:06 am, Jerry Coffin <jcof...@taeus.comwrote:
In article <1186371485.256316.249...@q3g2000prf.googlegroups. com>,
ljiap...@gmail.com says...
[ ... ]
fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
There are a number of possibilities to consider. First of all, in many
implementations, stream-based I/O is fairly slow -- you may gain quite a
bit of speed by reading the data with something like fscanf or
fgets/strtok.
That's not been my experience (except with Sun CC), but it
depends on the implementation. Given that he's having trouble
getting iostream input correct, however, I definitly wouldn't
suggest his trying stdio.h.
Second, it sounds like the dictionary remains constant after you build
it. If so, you're probably better off reading the data into a vector,
sorting it, and then using a binary search to retrieve it later. This
will often improve speed somewhat during the reading phase, and even
more when you're using the data.
Another question is where the file is coming from. If he could
generate it sorted to begin with, or sort it off line, that
could speed up the initialization of the std::vector or lot. Or
even of std::map, since he could give an accurate hint for each
insertion.

But before considering any of that, we really need profiling
data. It's also possible (or even probable) that his
initialization is IO bound anyway, so no changes in code can
improve it.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 6 '07 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

17
by: Karl Ebener | last post by:
Hi! I asked a similar question before but then changed everything to using char-Arrays instead of the string class, but I would rather not do this again. So, does anyone know of a...
1
by: gipsy boy | last post by:
// -- using namespace std; map<string,string> formMap; list<string> formParams; for(list<string>::iterator fit = formParams.begin(); fit!=formParams.end(); fit++) { cout << "key=" << *fit;...
11
by: Martin Jørgensen | last post by:
Hi, - - - - - - - - - - - - - - - #include <iostream> #include <string> #include <map> using namespace std; int main() {
2
Soujiro
by: Soujiro | last post by:
typedef struct { int age; string name; } structure; int functionCall( map< char* , structure* >* map_o ) { map< char* , structure* >* map_op; map_op = map_o;
3
by: asclearuc | last post by:
Hello Is it possible to use map<string&, string&>? Why I need it. I have a large amount of data obtained from XML file. I should do processing of this data. The processing takes many...
3
by: eiji.anonremail | last post by:
Hi folks, I have a compile problem on linux, and maybe someone has an idea: #include <map> #include <iostream>
6
by: Mr. K.V.B.L. | last post by:
I want to start a map with keys but an empty vector<string>. Not sure what the syntax is here. Something like: map<string, vector<string MapVector; MapVector.insert(make_pair("string1",...
42
by: barcaroller | last post by:
In the boost::program_options tutorial, the author included the following code: cout << "Input files are: " << vm.as< vector<string() << "\n"; Basically, he is trying to print a vector...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.