gather information from various files efficiently

Klaus Neuner

Hello,

I need to gather information that is contained in various files.

Like so:

file1:
=====================
foo : 1 2
bar : 2 4
baz : 3
=====================

file2:
=====================
foo : 5
bar : 6
baz : 7
=====================

file3:
=====================
foo : 4 18
bar : 8
=====================
The straightforward way to solve this problem is to create a
dictionary. Like so:
[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] = [b]
Yet, I have got 43 such files. Together they are 4,1M
large. In the future, they will probably become much larger.
At the moment, the process takes several hours. As it is a process
that I have to run very often, I would like it to be faster.

How could the problem be solved more efficiently?
Klaus

Jul 18 '05 #1

Subscribe Post Reply

1506

Keith Dart

Klaus Neuner wrote:

Hello,

I need to gather information that is contained in various files.

Like so:

file1:
=====================
foo : 1 2
bar : 2 4
baz : 3
=====================

file2:
=====================
foo : 5
bar : 6
baz : 7
=====================

file3:
=====================
foo : 4 18
bar : 8
=====================
The straightforward way to solve this problem is to create a
dictionary. Like so:
[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] = [b]

Aye...

the dict.keys() line creates a temporary list, and then the 'in' does a
linear search of the list. Better would be:

try:
dict[a].append(b)
except KeyError:
dict[a] = [b]

since you expect the key to be there most of the time, this method is
most efficient. You optomistically get the dictionary entry, and on the
exceptional case where it doesn't yet exist you add it.

--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <kd***@kdart.com>
public key: ID: F3D288E4
================================================== ==========================

Jul 18 '05 #2

Kent Johnson

Keith Dart wrote:

try:
dict[a].append(b)
except KeyError:
dict[a] = [b]

or my favorite Python shortcut:
dict.setdefault(a, []).append(b)

Kent

Jul 18 '05 #3

Fernando Perez

Keith Dart wrote:

Aye...

the dict.keys() line creates a temporary list, and then the 'in' does a
linear search of the list. Better would be:

try:
dict[a].append(b)
except KeyError:
dict[a] = [b]

since you expect the key to be there most of the time, this method is
most efficient. You optomistically get the dictionary entry, and on the
exceptional case where it doesn't yet exist you add it.

I wonder if

dct.setdefault(a,[]).append(b)

wouldn't be even faster. It saves setting up the try/except frame handling in
python (I assume the C implementation of dicts achieves similar results with
much less overhead).

Cheers,

f

ps. I changed dict->dct because it's a generally Bad Idea (TM) to name local
variables as builtin types. This, for the benefit of the OP (I know you were
just following his code conventions).

Jul 18 '05 #4

Keith Dart

Kent Johnson wrote:

Keith Dart wrote:
try:
dict[a].append(b)
except KeyError:
dict[a] = [b]

or my favorite Python shortcut:
dict.setdefault(a, []).append(b)

Kent

Hey, when did THAT get in there? ;-) That's nice. However, the
try..except block is a useful pattern for many similiar situations that
the OP might want to keep in mind. It is usually better than the
following, also:

if dct.has_key(a):
dct[a].append(b)
else:
dct[a] = [b]
Which is a pattern I have seen often.

--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <kd***@kdart.com>
vcard: <http://www.kdart.com/~kdart/kdart.vcf>
public key: ID: F3D288E4 URL: <http://www.kdart.com/~kdart/public.key>
================================================== ==========================

Jul 18 '05 #5

Fredrik Lundh

Keith Dart wrote:

try:
dict[a].append(b)
except KeyError:
dict[a] = [b]

the drawback here is that exceptions are relatively expensive; if the
number of collisions are small, you end up throwing and catching lots
of exceptions. in that case, there are better ways to do this.
dict.setdefault(a, []).append(b)

the drawback here is that you create a new object for each call, but
if the number of collisions are high, you end up throwing most of them
away. in that case, there are better ways to do this.

(gotta love that method name, btw. a serious candidate for the "most
confusing name in the standard library" contest... or maybe even the
"most confusing name in the history of python" contest...)
Hey, when did THAT get in there? ;-) That's nice. However, the try..except block is a useful
pattern for many similiar situations that the OP might want to keep in mind. It is usually better
than the following, also:

if dct.has_key(a):
dct[a].append(b)
else:
dct[a] = [b]

the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.

</F>

Jul 18 '05 #6

Keith Dart

Fredrik Lundh wrote:

...
if dct.has_key(a):
dct[a].append(b)
else:
dct[a] = [b]

the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.

Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. 8-) Now, perhaps
we could write an adaptive data analyzer-code-generator... ;-)

--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <kd***@kdart.com>
public key: ID: F3D288E4
================================================== ==========================

Jul 18 '05 #7

Keith Dart

Fredrik Lundh wrote:

...
if dct.has_key(a):
dct[a].append(b)
else:
dct[a] = [b]

the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.

Jul 18 '05 #8

Richie Hindle

[Keith]

Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. 8-)

s/Python //g

--
Richie Hindle
ri****@entrian.com

Jul 18 '05 #9

Peter Hansen

Keith Dart wrote:

Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario.

And this is different from optimizing in *any* other language
in what way?

-Peter

Jul 18 '05 #10

Simon Brunning

On Tue, 14 Dec 2004 10:40:56 -0500, Peter Hansen <pe***@engcorp.com> wrote:

Keith Dart wrote:
Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario.

And this is different from optimizing in *any* other language
in what way?

In other languages, by the time you get the bloody thing working it's
time to ship, and you don't have to bother worrying about making it
optimal.

--
Cheers,
Simon B,
si***@brunningonline.net,
http://www.brunningonline.net/simon/blog/

Jul 18 '05 #11

Steven Bethard

Klaus Neuner wrote:

The straightforward way to solve this problem is to create a
dictionary. Like so:
[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] = [b]

So I timed the three suggestions with a few different datasets:
cat builddict.py def askpermission(d, k, v):
if k in d:
d[k].append(v)
else:
d[k] = [v]

def askforgiveness(d, k, v):
try:
d[k].append(v)
except KeyError:
d[k] = [v]

def default(d, k, v):
d.setdefault(k, []).append(v)

def test(items, func):
d = {}
for k, v in items:
func(d, k, v)

Dataset where every value causes a collision:
python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i in range(1000)], bd.askpermission)"
1000 loops, best of 3: 1.62 msec per loop
python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i in range(1000)], bd.askforgiveness)"
1000 loops, best of 3: 1.58 msec per loop
python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i in range(1000)], bd.default)"
100 loops, best of 3: 2.03 msec per loop
Dataset where no value causes a collision:
python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i in range(1000)], bd.askpermission)"
100 loops, best of 3: 2.29 msec per loop
python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i in range(1000)], bd.askforgiveness)"
100 loops, best of 3: 9.96 msec per loop
python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i in range(1000)], bd.default)"
100 loops, best of 3: 2.98 msec per loop
Datset where one of every 5 values causes a collision:
python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for i in range(1000)], bd.askpermission)"
1000 loops, best of 3: 1.82 msec per loop
python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for i in range(1000)], bd.askforgiveness)"
1000 loops, best of 3: 1.79 msec per loop
python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for

i in range(1000)], bd.default)"
100 loops, best of 3: 2.2 msec per loop
So, when there are lots of collisions, you may get a small benefit from
the try/except solution. If there are very few collisions, you probably
would rather the if/else solution. The setdefault solution patterns
about like the if/else solution, but is somewhat slower.

I will probably continue to use setdefault, because I think it's
prettier =) but if you're running into a speed bottleneck in this sort
of situation, you might consider one of the other solutions.

Steve

Jul 18 '05 #12

Jeff Shannon

Klaus Neuner wrote:

Yet, I have got 43 such files. Together they are 4,1M
large. In the future, they will probably become much larger.
At the moment, the process takes several hours. As it is a process
that I have to run very often, I would like it to be faster.

Others have shown how you can make your dictionary code more efficient,
which should provide a big speed boost, especially if there are many
keys in your dicts.

However, if you're taking this long to read files each time, perhaps
there's a better high-level approach than just a brute-force scan of
every file every time. You don't say anything about where those files
are coming from, or how they're created. Are they relatively static?
(That is to say, are they (nearly) the same files being read on each
run?) Do you control the process that creates the files? Given the
right conditions, you may be able to store your data in a shelve, or
even proper database, saving you lots of time in parsing through these
files on each run. Even if it's entirely new data on each run, you may
be able to find a more efficient way of transferring data from whatever
the source is into your program.

Jeff Shannon
Technician/Programmer
Credit International

Jul 18 '05 #13

Mike Meyer

Simon Brunning <si************@gmail.com> writes:

On Tue, 14 Dec 2004 10:40:56 -0500, Peter Hansen <pe***@engcorp.com> wrote:
Keith Dart wrote:
> Sigh, this reminds me of a discussion I had at my work once... It seems
> to write optimal Python code one must understand various probabilites of
> your data, and code according to the likely scenario.

And this is different from optimizing in *any* other language
in what way?

In other languages, by the time you get the bloody thing working it's
time to ship, and you don't have to bother worrying about making it
optimal.

+1 QOTW.

<mike

--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Jul 18 '05 #14

Paul McGuire

"Klaus Neuner" <kl************@yahoo.de> wrote in message
news:3e**************************@posting.google.c om...

Hello,

I need to gather information that is contained in various files.

Like so:

file1:
=====================
foo : 1 2
bar : 2 4
baz : 3
=====================

file2:
=====================
foo : 5
bar : 6
baz : 7
=====================

file3:
=====================
foo : 4 18
bar : 8
=====================
The straightforward way to solve this problem is to create a
dictionary. Like so:
[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] = [b]
Yet, I have got 43 such files. Together they are 4,1M
large. In the future, they will probably become much larger.
At the moment, the process takes several hours. As it is a process
that I have to run very often, I would like it to be faster.

How could the problem be solved more efficiently?
Klaus

You have gotten a number of suggestions on the relative improvements for
updating your global dictionary of values. My business partner likens code
optimization to lowering the water in a river. Performance bottlenecks
stick out like rocks sticking out of a river. Once you resolve one problem
(remove the rock), you lower the water level, and the next rock/bottleneck
appears. Have you looked at what is happening in your get_information
method? If you are still taking long periods of time to scan through these
files, you should look into what get_information is doing. In working with
my pyparsing module, I've seen people scan multimegabyte files in seconds,
so taking hours to sift through 4Mb of data sounds like there may be other
problems going on.

With this clean a code input, something like:

def get_information(line):
return map(str.strip, line.split(":",1))

should do the trick. For that matter, you could get rid of the function
call (calls are expensive in Python), and just inline this to :

a,b = map(str.strip, line.split(":",1))
if a in dct:
dct[a] += b.split()
else:
dct[a] = b.split()

(I'm guessing you want to convert b values that have multiple numbers to a
list, based on your "dict[a] = [b]" source line.)
I also renamed dict to dct, per Fernando Perez's suggestion.

-- Paul

Jul 18 '05 #15

Similar topics

C++ Project Files?????

by: JKop | last post by:

Here's what I know so far: You have a C++ project. You have source files in it. When you go to compile it, first thing the preprocessor sticks the header files into each source file. So now...

C / C++

Gather information from Excel PivotCache

by: carrionk | last post by:

Hi, I'm currently working with a Legacy System whose only output is pivot tables in Excel. If I need certain data, I change the pivot table to get the information I want. All the info is...

Microsoft Access / VBA

Writing Structures To Binary Files

by: phyzics | last post by:

I am porting an application from C++ to C#, and am having trouble finding a way to quickly and efficiently write structures to a binary file. In C++ this is trivial because all that is necessary is...

C# / C Sharp

Gather Local ComputerName

by: Joseph | last post by:

Has anyone had any experience gathering the local computername through a web application with no client side program. I've been researching different avenues and have been coming up blank so far. ...

ASP.NET

Information about storge engine in PostgreSQL

by: nd02tsk | last post by:

Hello MySQL has information about several storage engines. MEMORY to handle temporary tables, InnoDB to handle transactions and which also can split its table data over several files/partitions....

PostgreSQL Database

How to gather the caller page information?

by: ABC | last post by:

How to gather the caller page information? I want to check the enter from when entering the onload event of the page. Which properties or functions have that information?

ASP.NET

Re: how to remove oldest files up to a limit efficiently

by: Terry Reedy | last post by:

Dan Stromberg wrote: Since you do not need all 10**6 files sorted, you might also try the heapq module. The entries into the heap would be (time, fileid)

Python

header,footer,left & right column fixed and body information scrolling to use div

by: Noorain | last post by:

I designed a site. i want to header,footer,left & right column fixed but body information only scrolling. this site screen to be 800/600 px. i designed this way but when i used position fixed all...

HTML / CSS

merging excel files - but a bit trickier than usual

by: BBMcL | last post by:

Advanced thanks for any helping. I'm running Python on a Mac OS X. Here's the basic situation. A single group of people had various health measurements performed on them over the course of a few...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice