Hello,
I need to gather information that is contained in various files.
Like so:
file1:
=============== ======
foo : 1 2
bar : 2 4
baz : 3
=============== ======
file2:
=============== ======
foo : 5
bar : 6
baz : 7
=============== ======
file3:
=============== ======
foo : 4 18
bar : 8
=============== ======
The straightforward way to solve this problem is to create a
dictionary. Like so:
[...]
a, b = get_information (line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] = [b]
Yet, I have got 43 such files. Together they are 4,1M
large. In the future, they will probably become much larger.
At the moment, the process takes several hours. As it is a process
that I have to run very often, I would like it to be faster.
How could the problem be solved more efficiently?
Klaus 14 1527
Klaus Neuner wrote: Hello,
I need to gather information that is contained in various files.
Like so:
file1: =============== ====== foo : 1 2 bar : 2 4 baz : 3 =============== ======
file2: =============== ====== foo : 5 bar : 6 baz : 7 =============== ======
file3: =============== ====== foo : 4 18 bar : 8 =============== ======
The straightforward way to solve this problem is to create a dictionary. Like so:
[...]
a, b = get_information (line) if a in dict.keys(): dict[a].append(b) else: dict[a] = [b]
Aye...
the dict.keys() line creates a temporary list, and then the 'in' does a
linear search of the list. Better would be:
try:
dict[a].append(b)
except KeyError:
dict[a] = [b]
since you expect the key to be there most of the time, this method is
most efficient. You optomistically get the dictionary entry, and on the
exceptional case where it doesn't yet exist you add it.
--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <kd***@kdart.co m>
public key: ID: F3D288E4
=============== =============== =============== =============== =============== =
Keith Dart wrote: try: dict[a].append(b) except KeyError: dict[a] = [b]
or my favorite Python shortcut:
dict.setdefault (a, []).append(b)
Kent
Keith Dart wrote: Aye...
the dict.keys() line creates a temporary list, and then the 'in' does a linear search of the list. Better would be:
try: dict[a].append(b) except KeyError: dict[a] = [b]
since you expect the key to be there most of the time, this method is most efficient. You optomistically get the dictionary entry, and on the exceptional case where it doesn't yet exist you add it.
I wonder if
dct.setdefault( a,[]).append(b)
wouldn't be even faster. It saves setting up the try/except frame handling in
python (I assume the C implementation of dicts achieves similar results with
much less overhead).
Cheers,
f
ps. I changed dict->dct because it's a generally Bad Idea (TM) to name local
variables as builtin types. This, for the benefit of the OP (I know you were
just following his code conventions).
Kent Johnson wrote: Keith Dart wrote:
try: dict[a].append(b) except KeyError: dict[a] = [b]
or my favorite Python shortcut: dict.setdefault (a, []).append(b)
Kent
Hey, when did THAT get in there? ;-) That's nice. However, the
try..except block is a useful pattern for many similiar situations that
the OP might want to keep in mind. It is usually better than the
following, also:
if dct.has_key(a):
dct[a].append(b)
else:
dct[a] = [b]
Which is a pattern I have seen often.
--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <kd***@kdart.co m>
vcard: <http://www.kdart.com/~kdart/kdart.vcf>
public key: ID: F3D288E4 URL: <http://www.kdart.com/~kdart/public.key>
=============== =============== =============== =============== =============== =
Keith Dart wrote: try: dict[a].append(b) except KeyError: dict[a] = [b]
the drawback here is that exceptions are relatively expensive; if the
number of collisions are small, you end up throwing and catching lots
of exceptions. in that case, there are better ways to do this.
dict.setdefault (a, []).append(b)
the drawback here is that you create a new object for each call, but
if the number of collisions are high, you end up throwing most of them
away. in that case, there are better ways to do this.
(gotta love that method name, btw. a serious candidate for the "most
confusing name in the standard library" contest... or maybe even the
"most confusing name in the history of python" contest...)
Hey, when did THAT get in there? ;-) That's nice. However, the try..except block is a useful pattern for many similiar situations that the OP might want to keep in mind. It is usually better than the following, also:
if dct.has_key(a): dct[a].append(b) else: dct[a] = [b]
the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.
</F>
Fredrik Lundh wrote: ...if dct.has_key(a): dct[a].append(b) else: dct[a] = [b]
the drawback here is that if the number of collisions are high, you end up doing lots of extra dictionary lookups. in that case, there are better ways to do this.
Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. 8-) Now, perhaps
we could write an adaptive data analyzer-code-generator... ;-)
--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <kd***@kdart.co m>
public key: ID: F3D288E4
=============== =============== =============== =============== =============== =
Fredrik Lundh wrote: ...if dct.has_key(a): dct[a].append(b) else: dct[a] = [b]
the drawback here is that if the number of collisions are high, you end up doing lots of extra dictionary lookups. in that case, there are better ways to do this.
Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. 8-) Now, perhaps
we could write an adaptive data analyzer-code-generator... ;-)
--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <kd***@kdart.co m>
public key: ID: F3D288E4
=============== =============== =============== =============== =============== =
[Keith] Sigh, this reminds me of a discussion I had at my work once... It seems to write optimal Python code one must understand various probabilites of your data, and code according to the likely scenario. 8-)
s/Python //g
--
Richie Hindle ri****@entrian. com
Keith Dart wrote: Sigh, this reminds me of a discussion I had at my work once... It seems to write optimal Python code one must understand various probabilites of your data, and code according to the likely scenario.
And this is different from optimizing in *any* other language
in what way?
-Peter This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: JKop |
last post by:
Here's what I know so far:
You have a C++ project. You have source files in it. When you go to
compile it, first thing the preprocessor sticks the header files into
each source file.
So now you have your ".cpp" files all ready, without any "#include"
or "#define" in them.
Let's assume that there's 2 source files in this project, "a.cpp" and
|
by: carrionk |
last post by:
Hi,
I'm currently working with a Legacy System whose only output is pivot
tables in Excel.
If I need certain data, I change the pivot table to get the information
I want. All the info is stored in the PivotCache as normally info would
exceed 65535 lines, so theres no detail in DB format.
I'd like to know if it's possible to connect Access to the pivotCache
|
by: phyzics |
last post by:
I am porting an application from C++ to C#, and am having trouble finding a way to quickly and efficiently write structures to a binary file. In C++ this is trivial because all that is necessary is to pack the structure to 1 byte boundries, and then just write out the structure directly to the File IO function
pragma pack (1
typedef struct
char var1
int var1
}MyStruc
fwrite(&myStructure,sizeof(MyStruct),1,filepointer);
|
by: Joseph |
last post by:
Has anyone had any experience gathering the local computername through a web
application with no client side program. I've been researching different
avenues and have been coming up blank so far. I'm working in a Intranet Env
with Active Directory and all Windows XP clients. My end goal is to
actually have an active desktop component that is branded for the company
with other user and computer specific information that will help the IT...
|
by: nd02tsk |
last post by:
Hello
MySQL has information about several storage engines. MEMORY to handle
temporary tables, InnoDB to handle transactions and which also can split
its table data over several files/partitions. Splitting of storage is
something which according to the following article, PostgreSQL does not
support:
http://www.devx.com/dbzone/Article/20743
| |
by: ABC |
last post by:
How to gather the caller page information? I want to check the enter from
when entering the onload event of the page. Which properties or functions
have that information?
|
by: Terry Reedy |
last post by:
Dan Stromberg wrote:
Since you do not need all 10**6 files sorted, you might also try the
heapq module. The entries into the heap would be (time, fileid)
|
by: Noorain |
last post by:
I designed a site. i want to header,footer,left & right column fixed but body information only scrolling. this site screen to be 800/600 px. i designed this way but when i used position fixed all information to show as displace. please help me.
my coding are as below:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>...
|
by: BBMcL |
last post by:
Advanced thanks for any helping. I'm running Python on a Mac OS X.
Here's the basic situation. A single group of people had various health measurements performed on them over the course of a few decades. But individuals dropped out of the study between examinations. For instance, for the first exam there were 3,000 individuals, then for the second exam (a few years later), there were 2,500, and so on. By the last exam, exam 26, there were...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
| |
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |