471,338 Members | 1,300 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,338 software developers and data experts.

File to dict

Hello everyone,

I have written this small utility function for transforming legacy
file to Python dict:
def lookupdmo(domain):
lines = open('/etc/virtual/domainowners','r').readlines()
lines = [ [y.lstrip().rstrip() for y in x.split(':')] for x in
lines]
lines = [ x for x in lines if len(x) == 2 ]
d = dict()
for line in lines:
d[line[0]]=line[1]
return d[domain]

The /etc/virtual/domainowners file contains double-colon separated
entries:
domain1.tld: owner1
domain2.tld: own2
domain3.another: somebody
....

Now, the above lookupdmo function works. However, it's rather tedious
to transform files into dicts this way and I have quite a lot of such
files to transform (like custom 'passwd' files for virtual email
accounts etc).

Is there any more clever / more pythonic way of parsing files like
this? Say, I would like to transform a file containing entries like
the following into a list of lists with doublecolon treated as
separators, i.e. this:

tm:$1$aaaa$bbbb:1010:6::/home/owner1/imap/domain1.tld/tm:/sbin/nologin

would get transformed into this:

[ ['tm', '$1$aaaa$bbbb', '1010', '6', , '/home/owner1/imap/domain1.tld/
tm', '/sbin/nologin'] [...] [...] ]

Dec 7 '07 #1
22 1944
On Dec 7, 1:31 pm, mrk...@gmail.com wrote:
Hello everyone,

I have written this small utility function for transforming legacy
file to Python dict:

def lookupdmo(domain):
lines = open('/etc/virtual/domainowners','r').readlines()
lines = [ [y.lstrip().rstrip() for y in x.split(':')] for x in
lines]
lines = [ x for x in lines if len(x) == 2 ]
d = dict()
for line in lines:
d[line[0]]=line[1]
return d[domain]

The /etc/virtual/domainowners file contains double-colon separated
entries:
domain1.tld: owner1
domain2.tld: own2
domain3.another: somebody
...

Now, the above lookupdmo function works. However, it's rather tedious
to transform files into dicts this way and I have quite a lot of such
files to transform (like custom 'passwd' files for virtual email
accounts etc).

Is there any more clever / more pythonic way of parsing files like
this? Say, I would like to transform a file containing entries like
the following into a list of lists with doublecolon treated as
separators, i.e. this:

tm:$1$aaaa$bbbb:1010:6::/home/owner1/imap/domain1.tld/tm:/sbin/nologin

would get transformed into this:

[ ['tm', '$1$aaaa$bbbb', '1010', '6', , '/home/owner1/imap/domain1.tld/
tm', '/sbin/nologin'] [...] [...] ]
For the first one you are parsing the entire file everytime you want
to lookup just one domain. If it is something reused several times
during your code execute you could think of rather storing it so it's
just a simple lookup away, for eg.

_domain_dict = dict()
def generate_dict(input_file):
finput = open(input_file, 'rb')
global _domain_dict
for each_line in enumerate(finput):
line = each_line.strip().split(':')
if len(line)==2: _domain_dict[line[0]] = line[1]

finput.close()

def domain_lookup(domain_name):
global _domain_dict
try:
return _domain_dict[domain_name]
except KeyError:
return 'Unknown.Domain'
Your second parsing example would be a simple case of:

finput = open('input_file.ext', 'rb')
results_list = []
for each_line in enumerate(finput.readlines()):
results_list.append( each_line.strip().split(':') )
finput.close()
Dec 7 '07 #2
mr****@gmail.com wrote:
def lookupdmo(domain):
lines = open('/etc/virtual/domainowners','r').readlines()
lines = [ [y.lstrip().rstrip() for y in x.split(':')] for x in
lines]
lines = [ x for x in lines if len(x) == 2 ]
d = dict()
for line in lines:
d[line[0]]=line[1]
return d[domain]
Just some minor points without changing the basis of what you have done
here:

Don't bother with 'readlines', file objects are directly iterable.
Why are you calling both lstrip and rstrip? The strip method strips
whitespace from both ends for you.

It is usually a good idea with code like this to limit the split method to
a single split in case there is more than one colon on the line: i.e.
x.split(':',1)

When you have a sequence whose elements are sequences with two elements
(which is what you have here), you can construct a dict directly from the
sequence.

But why do you construct a dict from that input data simply to throw it
away? If you only want 1 domain from the file just pick it out of the list.
If you want to do multiple lookups build the dict once and keep it around.

So something like the following (untested code):

from __future__ import with_statement

def loaddomainowners(domain):
with open('/etc/virtual/domainowners','r') as infile:
pairs = [ line.split(':',1) for line in infile if ':' in line ]
pairs = [ (domain.strip(), owner.strip())
for (domain,owner) in pairs ]
return dict(lines)

DOMAINOWNERS = loaddomainowners()

def lookupdmo(domain):
return DOMAINOWNERS[domain]
Dec 7 '07 #3
Chris wrote:
For the first one you are parsing the entire file everytime you want
to lookup just one domain. If it is something reused several times
during your code execute you could think of rather storing it so it's
just a simple lookup away, for eg.

_domain_dict = dict()
def generate_dict(input_file):
finput = open(input_file, 'rb')
global _domain_dict
for each_line in enumerate(finput):
line = each_line.strip().split(':')
if len(line)==2: _domain_dict[line[0]] = line[1]

finput.close()

def domain_lookup(domain_name):
global _domain_dict
try:
return _domain_dict[domain_name]
except KeyError:
What about this?

_domain_dict = dict()
def generate_dict(input_file):
global _domain_dict
# If it's already been run, do nothing. You might want to change
# this.
if _domain_dict:
return
fh = open(input_file, 'rb')
try:
for line in fh:
line = line.strip().split(':', 1)
if len(line) == 2:
_domain_dict[line[0]] = line[1]
finally:
fh.close()

def domain_lookup(domain_name):
return _domain_dict.get(domain_name)

I changed generate_dict to do nothing if it's already been run. (You
might want it to run again with a fresh dict, or throw an error or
something.)

I removed enumerate() because it's unnecessary (and wrong -- you were
trying to split a tuple of (index, line)).

I also changed the split to only split once, like Duncan Booth suggested.

The try-finally is to ensure that the file is closed if an exception is
thrown for some reason.

domain_lookup doesn't need to declare _domain_dict as global because
it's not assigning to it. .get() returns None if the key doesn't exist,
so now the function returns None. You might want to use a different
value or throw an exception (use _domain_dict[domain_name] and not catch
the KeyError if it doesn't exist, perhaps).

Other than that, I just reformatted it and renamed variables, because I
do that. :-P
--
Dec 7 '07 #4
Duncan Booth wrote:
Just some minor points without changing the basis of what you have done
here:

Don't bother with 'readlines', file objects are directly iterable.
Why are you calling both lstrip and rstrip? The strip method strips
whitespace from both ends for you.

It is usually a good idea with code like this to limit the split method to
a single split in case there is more than one colon on the line: i.e.
x.split(':',1)

When you have a sequence whose elements are sequences with two elements
(which is what you have here), you can construct a dict directly from the
sequence.

But why do you construct a dict from that input data simply to throw it
away? If you only want 1 domain from the file just pick it out of the list.
If you want to do multiple lookups build the dict once and keep it around.

So something like the following (untested code):

from __future__ import with_statement

def loaddomainowners(domain):
with open('/etc/virtual/domainowners','r') as infile:
pairs = [ line.split(':',1) for line in infile if ':' in line ]
pairs = [ (domain.strip(), owner.strip())
for (domain,owner) in pairs ]
return dict(lines)

DOMAINOWNERS = loaddomainowners()

def lookupdmo(domain):
return DOMAINOWNERS[domain]
Using two list comprehensions mean you construct two lists, which sucks
if it's a large file.

Also, you could pass the list comprehension (or better yet a generator
expression) directly to dict() without saving it to a variable:

with open('/etc/virtual/domainowners','r') as fh:
return dict(line.strip().split(':', 1) for line in fh)

(Argh, that doesn't .strip() the key and value, which means it won't
work, but it's so simple and elegant and I'm tired enough that I'm not
going to add that. :-P Just use another genexp. Makes for a line
complicated enough that it could be turned into a for loop, though.)
--
Dec 7 '07 #5
Ta Matt, wasn't paying attention to what I typed. :)
And didn't know that about .get() and not having to declare the
global.
Thanks for my mandatory new thing for the day ;)
Dec 7 '07 #6
mr****@gmail.com a écrit :
Hello everyone,
(snip)
Say, I would like to transform a file containing entries like
the following into a list of lists with doublecolon treated as
separators, i.e. this:

tm:$1$aaaa$bbbb:1010:6::/home/owner1/imap/domain1.tld/tm:/sbin/nologin

would get transformed into this:

[ ['tm', '$1$aaaa$bbbb', '1010', '6', , '/home/owner1/imap/domain1.tld/
tm', '/sbin/nologin'] [...] [...] ]
The csv module is your friend.

Dec 7 '07 #7
Chris wrote:
Ta Matt, wasn't paying attention to what I typed. :)
And didn't know that about .get() and not having to declare the
global.
Thanks for my mandatory new thing for the day ;)
:-)
--
Dec 7 '07 #8

Duncan Booth wrote:
Just some minor points without changing the basis of what you have done
here:
All good points, thanks. Phew, there's nothing like peer review for
your code...
But why do you construct a dict from that input data simply to throw it
away?
Because comparing strings for equality in a loop is writing C in
Python, and that's
exactly what I'm trying to unlearn.

The proper way to do it is to produce a dictionary and look up a value
using a key.
>If you only want 1 domain from the file just pick it out of the list.
for item in list:
if item == 'searched.domain':
return item...

Yuck.

with open('/etc/virtual/domainowners','r') as infile:
pairs = [ line.split(':',1) for line in infile if ':' in line ]
Didn't think about doing it this way. Good point. Thx

Dec 7 '07 #9

The csv module is your friend.
(slapping forehead) why the Holy Grail didn't I think about this? That
should be much simpler than using SimpleParse or SPARK.

Thx Bruno & everyone.
Dec 7 '07 #10
On Fri, 07 Dec 2007 04:44:25 -0800, mrkafk wrote:
Duncan Booth wrote:
>But why do you construct a dict from that input data simply to throw it
away?

Because comparing strings for equality in a loop is writing C in
Python, and that's exactly what I'm trying to unlearn.

The proper way to do it is to produce a dictionary and look up a value
using a key.
>>If you only want 1 domain from the file just pick it out of the list.

for item in list:
if item == 'searched.domain':
return item...

Yuck.
I guess Duncan's point wasn't the construction of the dictionary but the
throw it away part. If you don't keep it, the loop above is even more
efficient than building a dictionary with *all* lines of the file, just to
pick one value afterwards.

Ciao,
Marc 'BlackJack' Rintsch
Dec 7 '07 #11
mr****@gmail.com a écrit :
>
>The csv module is your friend.

(slapping forehead) why the Holy Grail didn't I think about this?
If that can make you feel better, a few years ago, I spent two days
writing my own (SquaredWheel(tm) of course) csv reader/writer... before
realizing there was such a thing as the csv module :-/

Should have known better...
Dec 7 '07 #12
I guess Duncan's point wasn't the construction of the dictionary but the
throw it away part. If you don't keep it, the loop above is even more
efficient than building a dictionary with *all* lines of the file, just to
pick one value afterwards.
Sure, but I have two options here, none of them nice: either "write C
in Python" or do it inefficient and still elaborate way.

Anyway, I found my nirvana at last:
>>def shelper(line):
.... return x.replace(' ','').strip('\n').split(':',1)
....
>>ownerslist = [ shelper(x)[1] for x in it if len(shelper(x)) == 2 and shelper(x)[0] == domain ]
>>ownerslist
['da2']
Python rulez. :-)


Dec 7 '07 #13
>def shelper(line):
... return x.replace(' ','').strip('\n').split(':',1)
Argh, typo, should be def shelper(x) of course.
Dec 7 '07 #14
On 2007-12-07, Duncan Booth <du**********@invalid.invalidwrote:
from __future__ import with_statement

def loaddomainowners(domain):
with open('/etc/virtual/domainowners','r') as infile:
I've been thinking I have to use contextlib.closing for
auto-closing files. Is that not so?

--
Neil Cerutti
Dec 7 '07 #15
On 2007-12-07, Bruno Desthuilliers <br********************@wtf.websiteburo.oops.comwr ote:
mr****@gmail.com a écrit :
>>
>>The csv module is your friend.

(slapping forehead) why the Holy Grail didn't I think about this?

If that can make you feel better, a few years ago, I spent two
days writing my own (SquaredWheel(tm) of course) csv
reader/writer... before realizing there was such a thing as the
csv module :-/

Should have known better...
But probably it has made you a better person. ;)

--
Neil Cerutti
Dec 7 '07 #16
Neil Cerutti <ho*****@yahoo.comwrote:
On 2007-12-07, Duncan Booth <du**********@invalid.invalidwrote:
>from __future__ import with_statement

def loaddomainowners(domain):
with open('/etc/virtual/domainowners','r') as infile:

I've been thinking I have to use contextlib.closing for
auto-closing files. Is that not so?
That is not so.

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>from __future__ import with_statement
with open('diffs.txt') as f:
.... print len(list(f))
....
40
>>f
<closed file 'diffs.txt', mode 'r' at 0x00AA0698>
>>>
Dec 7 '07 #17
mr****@gmail.com ha scritto:
Hello everyone,

I have written this small utility function for transforming legacy
file to Python dict:
def lookupdmo(domain):
lines = open('/etc/virtual/domainowners','r').readlines()
lines = [ [y.lstrip().rstrip() for y in x.split(':')] for x in
lines]
lines = [ x for x in lines if len(x) == 2 ]
d = dict()
for line in lines:
d[line[0]]=line[1]
return d[domain]
cache = None

def lookup( domain ):
if not cache:
cache = dict( [map( lambda x: x.strip(), x.split(':')) for x in
open('/etc/virtual/domainowners','r').readlines()])
return cache.get(domain)

Glauco
Dec 7 '07 #18
On 2007-12-07, Duncan Booth <du**********@invalid.invalidwrote:
Neil Cerutti <ho*****@yahoo.comwrote:
>On 2007-12-07, Duncan Booth <du**********@invalid.invalidwrote:
>>from __future__ import with_statement

def loaddomainowners(domain):
with open('/etc/virtual/domainowners','r') as infile:

I've been thinking I have to use contextlib.closing for
auto-closing files. Is that not so?
That is not so.

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>from __future__ import with_statement
with open('diffs.txt') as f:
... print len(list(f))
...
40
>>>f
<closed file 'diffs.txt', mode 'r' at 0x00AA0698>
Thanks. After seeing your answer I managed to find what I'd
overlooked before, in the docs for file.close:

As of Python 2.5, you can avoid having to call this method
explicitly if you use the with statement. For example, the
following code will automatically close f when the with block
is exited:

from __future__ import with_statement

with open("hello.txt") as f:
for line in f:
print line

--
Neil Cerutti
Dec 7 '07 #19
On Fri, 07 Dec 2007 16:46:56 +0100, Glauco wrote:
mr****@gmail.com ha scritto:
>Hello everyone,

I have written this small utility function for transforming legacy file
to Python dict:
def lookupdmo(domain):
lines = open('/etc/virtual/domainowners','r').readlines() lines
= [ [y.lstrip().rstrip() for y in x.split(':')] for x in
lines]
lines = [ x for x in lines if len(x) == 2 ] d = dict()
for line in lines:
d[line[0]]=line[1]
return d[domain]

cache = None

def lookup( domain ):
if not cache:
cache = dict( [map( lambda x: x.strip(), x.split(':')) for x in
open('/etc/virtual/domainowners','r').readlines()])
return cache.get(domain)
>>lookup(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in lookup
UnboundLocalError: local variable 'cache' referenced before assignment

You miss the:
def lookup(domain):
global cache
...

bye
Dec 7 '07 #20
david ha scritto:
On Fri, 07 Dec 2007 16:46:56 +0100, Glauco wrote:
>mr****@gmail.com ha scritto:
>>Hello everyone,

I have written this small utility function for transforming legacy file
to Python dict:
def lookupdmo(domain):
lines = open('/etc/virtual/domainowners','r').readlines() lines
= [ [y.lstrip().rstrip() for y in x.split(':')] for x in
lines]
lines = [ x for x in lines if len(x) == 2 ] d = dict()
for line in lines:
d[line[0]]=line[1]
return d[domain]

cache = None

def lookup( domain ):
if not cache:
cache = dict( [map( lambda x: x.strip(), x.split(':')) for x in
open('/etc/virtual/domainowners','r').readlines()])
return cache.get(domain)
>>>lookup(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in lookup
UnboundLocalError: local variable 'cache' referenced before assignment

You miss the:
def lookup(domain):
global cache
...

bye
yezzz!

you can use global or static
Gla
Dec 7 '07 #21


Glauco wrote:
cache = None

def lookup( domain ):
if not cache:
cache = dict( [map( lambda x: x.strip(), x.split(':')) for x in
open('/etc/virtual/domainowners','r').readlines()])
return cache.get(domain)
Neat solution! It just needs small correction for empty or badly
formed lines:

dict([map( lambda x: x.strip(), x.split(':')) for x in open('/etc/
virtual/domainowners','r') if ':' in x])

Dec 7 '07 #22
Chris <cw****@gmail.comwrote:
For the first one you are parsing the entire file everytime you want
to lookup just one domain...
Is the file sorted? If so wouldn't it be easier either to read the whole
thing and then binary-chop search it, or if the file is vast to use seek
creatively to binary chop it but only read particular records from disk?

(One could chop by using seek to locate to a particular byte location in the
file then read to end of that record then read the next complete record.)

If the file isn't sorted, then .... why not? Also if the file is vast
surely there's some or a lot of point in breaking it up into a group of
smaller files?

--
Jeremy C B Nicoll - my opinions are my own.
Dec 9 '07 #23

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

9 posts views Thread by Robin Cull | last post: by
5 posts views Thread by Shu-Hsien Sheu | last post: by
22 posts views Thread by Ling Lee | last post: by
7 posts views Thread by Lowell Kirsh | last post: by
33 posts views Thread by Jason Heyes | last post: by
7 posts views Thread by amfr | last post: by
3 posts views Thread by Daniel Nogradi | last post: by
1 post views Thread by jim-on-linux | last post: by
reply views Thread by rosydwin | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.