473,698 Members | 2,888 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

gather information from various files efficiently

Hello,

I need to gather information that is contained in various files.

Like so:

file1:
=============== ======
foo : 1 2
bar : 2 4
baz : 3
=============== ======

file2:
=============== ======
foo : 5
bar : 6
baz : 7
=============== ======

file3:
=============== ======
foo : 4 18
bar : 8
=============== ======
The straightforward way to solve this problem is to create a
dictionary. Like so:
[...]

a, b = get_information (line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] = [b]
Yet, I have got 43 such files. Together they are 4,1M
large. In the future, they will probably become much larger.
At the moment, the process takes several hours. As it is a process
that I have to run very often, I would like it to be faster.

How could the problem be solved more efficiently?
Klaus
Jul 18 '05 #1
14 1527
Klaus Neuner wrote:
Hello,

I need to gather information that is contained in various files.

Like so:

file1:
=============== ======
foo : 1 2
bar : 2 4
baz : 3
=============== ======

file2:
=============== ======
foo : 5
bar : 6
baz : 7
=============== ======

file3:
=============== ======
foo : 4 18
bar : 8
=============== ======
The straightforward way to solve this problem is to create a
dictionary. Like so:
[...]

a, b = get_information (line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] = [b]


Aye...

the dict.keys() line creates a temporary list, and then the 'in' does a
linear search of the list. Better would be:

try:
dict[a].append(b)
except KeyError:
dict[a] = [b]

since you expect the key to be there most of the time, this method is
most efficient. You optomistically get the dictionary entry, and on the
exceptional case where it doesn't yet exist you add it.


--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <kd***@kdart.co m>
public key: ID: F3D288E4
=============== =============== =============== =============== =============== =
Jul 18 '05 #2
Keith Dart wrote:
try:
dict[a].append(b)
except KeyError:
dict[a] = [b]


or my favorite Python shortcut:
dict.setdefault (a, []).append(b)

Kent
Jul 18 '05 #3
Keith Dart wrote:
Aye...

the dict.keys() line creates a temporary list, and then the 'in' does a
linear search of the list. Better would be:

try:
dict[a].append(b)
except KeyError:
dict[a] = [b]

since you expect the key to be there most of the time, this method is
most efficient. You optomistically get the dictionary entry, and on the
exceptional case where it doesn't yet exist you add it.


I wonder if

dct.setdefault( a,[]).append(b)

wouldn't be even faster. It saves setting up the try/except frame handling in
python (I assume the C implementation of dicts achieves similar results with
much less overhead).

Cheers,

f

ps. I changed dict->dct because it's a generally Bad Idea (TM) to name local
variables as builtin types. This, for the benefit of the OP (I know you were
just following his code conventions).

Jul 18 '05 #4
Kent Johnson wrote:
Keith Dart wrote:
try:
dict[a].append(b)
except KeyError:
dict[a] = [b]

or my favorite Python shortcut:
dict.setdefault (a, []).append(b)

Kent


Hey, when did THAT get in there? ;-) That's nice. However, the
try..except block is a useful pattern for many similiar situations that
the OP might want to keep in mind. It is usually better than the
following, also:

if dct.has_key(a):
dct[a].append(b)
else:
dct[a] = [b]
Which is a pattern I have seen often.


--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <kd***@kdart.co m>
vcard: <http://www.kdart.com/~kdart/kdart.vcf>
public key: ID: F3D288E4 URL: <http://www.kdart.com/~kdart/public.key>
=============== =============== =============== =============== =============== =
Jul 18 '05 #5
Keith Dart wrote:
try:
dict[a].append(b)
except KeyError:
dict[a] = [b]

the drawback here is that exceptions are relatively expensive; if the
number of collisions are small, you end up throwing and catching lots
of exceptions. in that case, there are better ways to do this.
dict.setdefault (a, []).append(b)

the drawback here is that you create a new object for each call, but
if the number of collisions are high, you end up throwing most of them
away. in that case, there are better ways to do this.

(gotta love that method name, btw. a serious candidate for the "most
confusing name in the standard library" contest... or maybe even the
"most confusing name in the history of python" contest...)
Hey, when did THAT get in there? ;-) That's nice. However, the try..except block is a useful
pattern for many similiar situations that the OP might want to keep in mind. It is usually better
than the following, also:

if dct.has_key(a):
dct[a].append(b)
else:
dct[a] = [b]


the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.

</F>

Jul 18 '05 #6
Fredrik Lundh wrote:
...
if dct.has_key(a):
dct[a].append(b)
else:
dct[a] = [b]

the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.


Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. 8-) Now, perhaps
we could write an adaptive data analyzer-code-generator... ;-)



--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <kd***@kdart.co m>
public key: ID: F3D288E4
=============== =============== =============== =============== =============== =
Jul 18 '05 #7
Fredrik Lundh wrote:
...
if dct.has_key(a):
dct[a].append(b)
else:
dct[a] = [b]

the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.


Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. 8-) Now, perhaps
we could write an adaptive data analyzer-code-generator... ;-)



--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <kd***@kdart.co m>
public key: ID: F3D288E4
=============== =============== =============== =============== =============== =
Jul 18 '05 #8

[Keith]
Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. 8-)


s/Python //g

--
Richie Hindle
ri****@entrian. com

Jul 18 '05 #9
Keith Dart wrote:
Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario.


And this is different from optimizing in *any* other language
in what way?

-Peter
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

18
3175
by: JKop | last post by:
Here's what I know so far: You have a C++ project. You have source files in it. When you go to compile it, first thing the preprocessor sticks the header files into each source file. So now you have your ".cpp" files all ready, without any "#include" or "#define" in them. Let's assume that there's 2 source files in this project, "a.cpp" and
1
3906
by: carrionk | last post by:
Hi, I'm currently working with a Legacy System whose only output is pivot tables in Excel. If I need certain data, I change the pivot table to get the information I want. All the info is stored in the PivotCache as normally info would exceed 65535 lines, so theres no detail in DB format. I'd like to know if it's possible to connect Access to the pivotCache
2
7805
by: phyzics | last post by:
I am porting an application from C++ to C#, and am having trouble finding a way to quickly and efficiently write structures to a binary file. In C++ this is trivial because all that is necessary is to pack the structure to 1 byte boundries, and then just write out the structure directly to the File IO function pragma pack (1 typedef struct char var1 int var1 }MyStruc fwrite(&myStructure,sizeof(MyStruct),1,filepointer);
2
1824
by: Joseph | last post by:
Has anyone had any experience gathering the local computername through a web application with no client side program. I've been researching different avenues and have been coming up blank so far. I'm working in a Intranet Env with Active Directory and all Windows XP clients. My end goal is to actually have an active desktop component that is branded for the company with other user and computer specific information that will help the IT...
5
7422
by: nd02tsk | last post by:
Hello MySQL has information about several storage engines. MEMORY to handle temporary tables, InnoDB to handle transactions and which also can split its table data over several files/partitions. Splitting of storage is something which according to the following article, PostgreSQL does not support: http://www.devx.com/dbzone/Article/20743
1
1729
by: ABC | last post by:
How to gather the caller page information? I want to check the enter from when entering the onload event of the page. Which properties or functions have that information?
1
1077
by: Terry Reedy | last post by:
Dan Stromberg wrote: Since you do not need all 10**6 files sorted, you might also try the heapq module. The entries into the heap would be (time, fileid)
3
5517
by: Noorain | last post by:
I designed a site. i want to header,footer,left & right column fixed but body information only scrolling. this site screen to be 800/600 px. i designed this way but when i used position fixed all information to show as displace. please help me. my coding are as below: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head>...
1
2509
by: BBMcL | last post by:
Advanced thanks for any helping. I'm running Python on a Mac OS X. Here's the basic situation. A single group of people had various health measurements performed on them over the course of a few decades. But individuals dropped out of the study between examinations. For instance, for the first exam there were 3,000 individuals, then for the second exam (a few years later), there were 2,500, and so on. By the last exam, exam 26, there were...
0
8683
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8611
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
8904
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
7741
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5867
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4372
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4624
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3052
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2007
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.