How to store the data of a large map<string, int>?

liujiaping

Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
....

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.

Aug 6 '07 #1

Subscribe Post Reply

3378

Ian Collins

liujiaping wrote:

Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
....

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.

How are you reading the data and where is your current bottleneck?

--
Ian Collins.

Aug 6 '07 #2

Barry

liujiaping wrote:

Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.

I think you have to convert your text file into binary mode, built as a
dictionary indexed file.

You can have such structure to serialize your data into the dictionary file

struct Foo {
unsigned int word_len;
char* word;
int key;
};

and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.

You can reference StarDict, an open source dictionary, it will give you
some hints.

Aug 6 '07 #3

liujiaping

On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:

liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
....

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.

How are you reading the data and where is your current bottleneck?

--
Ian Collins.

I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}

Aug 6 '07 #4

Ian Collins

liujiaping wrote:

On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
>How are you reading the data and where is your current bottleneck?

Please don't quote signatures.

>
I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}

Have you profiled the reading to see where most of the time is being used?

--
Ian Collins.

Aug 6 '07 #5

liujiaping

On Aug 6, 11:31 am, Barry <dh...@126.comwrote:

liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.

I think you have to convert your text file into binary mode, built as a
dictionary indexed file.

You can have such structure to serialize your data into the dictionary file

struct Foo {
unsigned int word_len;
char* word;
int key;

};

and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.

You can reference StarDict, an open source dictionary, it will give you
some hints.

Thanks for ur advice. But how to write the struct Foo to a binary
file?
Can the function fwrite() do that? And given the binary file, how do
you
read from it to the struct Foo? Is there any example about it?
Thank you~

Aug 6 '07 #6

Barry

liujiaping wrote:

On Aug 6, 11:31 am, Barry <dh...@126.comwrote:
>liujiaping wrote:
>>Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...
Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.
I think you have to convert your text file into binary mode, built as a
dictionary indexed file.

You can have such structure to serialize your data into the dictionary file

struct Foo {
unsigned int word_len;
char* word;
int key;

};

and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.

You can reference StarDict, an open source dictionary, it will give you
some hints.

Thanks for ur advice. But how to write the struct Foo to a binary
file?

since you load the text file into vector<map<string, int
say word_map

struct Serializer {
Serializer(ofstream& ofs) : ofs(ofs) {}
void operator() (pair<string, intconst& p) const {
string::size_type len = p.first.size();
ofs.write((char const*)&len, sizeof(string::size_type));
ofs.write(p.first.data(), len);
ofs.write((char const*)&p.second, sizeof(int));
}
ofstream& ofs;
};

ofstream ofs("out.dict", ios::binary);
for_each (word_map.begin(), word_map.end(), Serializer(ofs)));

Can the function fwrite() do that? And given the binary file, how do
you
read from it to the struct Foo? Is there any example about it?

word_map;
void load(iftream& ifs) {
while (!ifs.eof()) {
string::size_type len = -1;
ifs.read((char*)&len, sizeof(string::size_type));
assert(len <= 1024);
char buf[1024]; // maximum buffer for a word
ifs.read(buf, len);
string word(buf, len);
int key;
ifs.read((char*)&key, sizeof(int));
word_map.insert(make_pair(word, key));
}
}

////////

In addiction, if you wanna have index file for your data (then you can
load just one specific word data), you have to write the hash code(have
to be unique) and the file offset in the dict data file, when you're
loading one word, just look up in the index file, then locate the word
data with offset in the data file.

Aug 6 '07 #7

liujiaping

On Aug 6, 12:27 pm, Barry <dh...@126.comwrote:

liujiaping wrote:
On Aug 6, 11:31 am, Barry <dh...@126.comwrote:
liujiaping wrote:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...
Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, intto store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.
I think you have to convert your text file into binary mode, built as a
dictionary indexed file.

You can have such structure to serialize your data into the dictionary file

struct Foo {
unsigned int word_len;
char* word;
int key;

};

and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.

You can reference StarDict, an open source dictionary, it will give you
some hints.

Thanks for ur advice. But how to write the struct Foo to a binary
file?

since you load the text file into vector<map<string, int
say word_map

struct Serializer {
Serializer(ofstream& ofs) : ofs(ofs) {}
void operator() (pair<string, intconst& p) const {
string::size_type len = p.first.size();
ofs.write((char const*)&len, sizeof(string::size_type));
ofs.write(p.first.data(), len);
ofs.write((char const*)&p.second, sizeof(int));
}
ofstream& ofs;

};

ofstream ofs("out.dict", ios::binary);
for_each (word_map.begin(), word_map.end(), Serializer(ofs)));

Can the function fwrite() do that? And given the binary file, how do
you
read from it to the struct Foo? Is there any example about it?

word_map;
void load(iftream& ifs) {
while (!ifs.eof()) {
string::size_type len = -1;
ifs.read((char*)&len, sizeof(string::size_type));
assert(len <= 1024);
char buf[1024]; // maximum buffer for a word
ifs.read(buf, len);
string word(buf, len);
int key;
ifs.read((char*)&key, sizeof(int));
word_map.insert(make_pair(word, key));
}

}

////////

In addiction, if you wanna have index file for your data (then you can
load just one specific word data), you have to write the hash code(have
to be unique) and the file offset in the dict data file, when you're
loading one word, just look up in the index file, then locate the word
data with offset in the data file.

Nice! Thanks for your help. I'll try it.

Aug 6 '07 #8

liujiaping

On Aug 6, 11:41 am, Ian Collins <ian-n...@hotmail.comwrote:

liujiaping wrote:
On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
How are you reading the data and where is your current bottleneck?

Please don't quote signatures.

I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}

Have you profiled the reading to see where most of the time is being used?

--
Ian Collins.

Nope. Is there any other way to read this file?

Aug 6 '07 #9

Jerry Coffin

In article <11**********************@q3g2000prf.googlegroups. com>,
lj******@gmail.com says...

[ ... ]

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}

There are a number of possibilities to consider. First of all, in many
implementations, stream-based I/O is fairly slow -- you may gain quite a
bit of speed by reading the data with something like fscanf or
fgets/strtok.

Second, it sounds like the dictionary remains constant after you build
it. If so, you're probably better off reading the data into a vector,
sorting it, and then using a binary search to retrieve it later. This
will often improve speed somewhat during the reading phase, and even
more when you're using the data.

--
Later,
Jerry.

The universe is a figment of its own imagination.

Aug 6 '07 #10

BobR

liujiaping <lj******@gmail.comwrote in message...

On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
How are you reading the data and where is your current bottleneck?

I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;

while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}

*/
while( data_file >word >key ){
key_map[word] = key;
}

--
Bob R
POVrookie

Aug 6 '07 #11

Ian Collins

liujiaping wrote:

On Aug 6, 11:41 am, Ian Collins <ian-n...@hotmail.comwrote:
>liujiaping wrote:
>>On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
How are you reading the data and where is your current bottleneck?
Please don't quote signatures.

>>I just use the file stream to read the data, simply like this:
fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}
Have you profiled the reading to see where most of the time is being used?

As I said before, *please don't quote signatures*.

>
Nope. Is there any other way to read this file?

Why do you refuse to profile it? Before you do, any optimisation
suggestions are just guesses. The bottleneck might be in the I/O, or it
might be in loading the map. How you optimise your code should be based
on real data, not speculation.

--
Ian Collins.

Aug 6 '07 #12

James Kanze

On Aug 6, 5:41 am, Ian Collins <ian-n...@hotmail.comwrote:

liujiaping wrote:
On Aug 6, 11:25 am, Ian Collins <ian-n...@hotmail.comwrote:
How are you reading the data and where is your current bottleneck?

Please don't quote signatures.

I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}

Have you profiled the reading to see where most of the time is
being used?

Before profiling it, he should probably modify it to ensure that
it is correct. As it is, it happens to work if the file has no
syntax errors in it, but only because processing the last line
twice doesn't cause an error. I'd write something more along
the lines of:

std::string line ;
while ( std::getline( data_file, line ) ) {
std::istringstream s( line ) ;
s >word >key >std::ws ;
if ( ! s || s.get() != EOF ) {
// Syntax error...
} else {
key_map.insert( Map::value_type( word, key ) ) ;
}
}

That's likely to be even slower:-). But at least it's correct.

Once that's done, of course, there's no way to reliably speed it
up without the profiling data. And depending on how much
speeding up is necessary, there may not even be a portable
solution.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 6 '07 #13

James Kanze

On Aug 6, 7:06 am, Jerry Coffin <jcof...@taeus.comwrote:

In article <1186371485.256316.249...@q3g2000prf.googlegroups. com>,
ljiap...@gmail.com says...

[ ... ]

fstream data_file("data.txt");
string word;
int key;
map<string, intkeymap;
while(!data_file.eof())
{
data_file >word;
data_file >key;
key_map[word] = key;
}

There are a number of possibilities to consider. First of all, in many
implementations, stream-based I/O is fairly slow -- you may gain quite a
bit of speed by reading the data with something like fscanf or
fgets/strtok.

That's not been my experience (except with Sun CC), but it
depends on the implementation. Given that he's having trouble
getting iostream input correct, however, I definitly wouldn't
suggest his trying stdio.h.

Second, it sounds like the dictionary remains constant after you build
it. If so, you're probably better off reading the data into a vector,
sorting it, and then using a binary search to retrieve it later. This
will often improve speed somewhat during the reading phase, and even
more when you're using the data.

Another question is where the file is coming from. If he could
generate it sorted to begin with, or sort it off line, that
could speed up the initialization of the std::vector or lot. Or
even of std::map, since he could give an accurate hint for each
insertion.

But before considering any of that, we really need profiling
data. It's also possible (or even probable) that his
initialization is IO bound anyway, so no changes in code can
improve it.

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Aug 6 '07 #14

by: Karl Ebener | last post by:

Hi! I asked a similar question before but then changed everything to using char-Arrays instead of the string class, but I would rather not do this again. So, does anyone know of a...

C / C++

Lookup in std::map<string,string> doesn't work

by: gipsy boy | last post by:

// -- using namespace std; map<string,string> formMap; list<string> formParams; for(list<string>::iterator fit = formParams.begin(); fit!=formParams.end(); fit++) { cout << "key=" << *fit;...

C / C++

using less<string>...

by: Martin Jørgensen | last post by:

Hi, - - - - - - - - - - - - - - - #include <iostream> #include <string> #include <map> using namespace std; int main() {

C / C++

How to return a map< string , structure* > from a function

by: Soujiro | last post by:

typedef struct { int age; string name; } structure; int functionCall( map< char* , structure* >* map_o ) { map< char* , structure* >* map_op; map_op = map_o;

C / C++

Is this possible: map of references (map<string&, string&>)

by: asclearuc | last post by:

Hello Is it possible to use map<string&, string&>? Why I need it. I have a large amount of data obtained from XML file. I should do processing of this data. The processing takes many...

C / C++

operator!= - map<int, pair>::rend() - linux

by: eiji.anonremail | last post by:

Hi folks, I have a compile problem on linux, and maybe someone has an idea: #include <map> #include <iostream>

C / C++

map<string, vector<string> > Question about partial initialization

by: Mr. K.V.B.L. | last post by:

I want to start a map with keys but an empty vector<string>. Not sure what the syntax is here. Something like: map<string, vector<string MapVector; MapVector.insert(make_pair("string1",...

C / C++

cout << vector<string>

by: barcaroller | last post by:

In the boost::program_options tutorial, the author included the following code: cout << "Input files are: " << vm.as< vector<string() << "\n"; Basically, he is trying to print a vector...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

How to store the data of a large map<string, int>?

Similar topics