473,806 Members | 2,330 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Re: converting a sed / grep / awk / . . . bash pipe line into python

hofer wrote:
Something I have to do very often is filtering / transforming line
based file contents and storing the result in an array or a
dictionary.

Very often the functionallity exists already in form of a shell script
with sed / awk / grep , . . .
and I would like to have the same implementation in my script

What's a compact, efficient (no intermediate arrays generated /
regexps compiled only once) way in python
for such kind of 'pipe line'

Example 1 (in bash): (annotated with comment (thus not working) if
copied / pasted
cat file \ ### read from file
| sed 's/\.\..*//' \ ### remove '//' comments
| sed 's/#.*//' \ ### remove '#' comments
| grep -v '^\s*$' \ ### get rid of empty lines
| awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining
lines contain always at least
\ ### two integers calculate
sum and 'keep' second number
| grep '^42 ' ### keep lines for which sum is 42
| awk '{ print $2 }' ### print number
thanks in advance for any suggestions of how to code this (keeping the
comments)
for line in open("file"): # read from file
try:
a, b = map(int, line.split(None , 2)[:2]) # remove extra columns,
# convert to integer
except ValueError:
pass # remove comments, get rid of empty lines,
# skip lines with less than two integers
else:
# line did start with two integers
if a + b == 42: # keep lines for which the sum is 42
print b # print number

The hard part was keeping the comments ;)

Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None , 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

Peter
Sep 3 '08 #1
8 4026
In article <g9************ *@news.t-online.com>,
Peter Otten <__*******@web. dewrote:
Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None , 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b
I'm philosophically opposed to one-liners like:
a, b = map(int, line.split(None , 2)[:2])
because they're difficult to understand at a glance. You need to visually
parse it and work your way out from the inside to figure out what's going
on. Better to keep it longer and simpler.

Now that I've got my head around it, I realized there's no reason to make
the split part so complicated. No reason to limit how many splits get done
if you're explicitly going to slice the first two. And since you don't
need to supply the second argument, the first one can be defaulted as well.
So, you immediately get down to:
a, b = map(int, line.split()[:2])
which isn't too bad. I might take it one step further, however, and do:
fields = line.split()[:2]
a, b = map(int, fields)
in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:
a, b = line.split()[:2]
a = int(a)
b = int(b)
Sep 3 '08 #2
Roy Smith wrote:
In article <g9************ *@news.t-online.com>,
Peter Otten <__*******@web. dewrote:
>Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None , 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

I'm philosophically opposed to one-liners
I'm not, as long as you don't /force/ the code into one line.
like:
> a, b = map(int, line.split(None , 2)[:2])

because they're difficult to understand at a glance. You need to visually
parse it and work your way out from the inside to figure out what's going
on. Better to keep it longer and simpler.

Now that I've got my head around it, I realized there's no reason to make
the split part so complicated. No reason to limit how many splits get
done
if you're explicitly going to slice the first two. And since you don't
need to supply the second argument, the first one can be defaulted as
well. So, you immediately get down to:
> a, b = map(int, line.split()[:2])
I agree that the above is an improvement.
which isn't too bad. I might take it one step further, however, and do:
> fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:
> a, b = line.split()[:2]
a = int(a)
b = int(b)
If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Peter
Sep 3 '08 #3
Roy Smith:
No reason to limit how many splits get done if you're
explicitly going to slice the first two.
You are probably right for this problem, because most lines are 2
items long, but in scripts that have to process lines potentially
composed of many parts, setting a max number of parts speeds up your
script and reduces memory used, because you have less parts at the
end.

Bye,
bearophile
Sep 3 '08 #4
In article <g9************ *@news.t-online.com>,
Peter Otten <__*******@web. dewrote:
I might take it one step further, however, and do:
fields = line.split()[:2]
a, b = map(int, fields)
in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:
a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...
Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.
Sep 3 '08 #5
In article
<7f************ *************** *******@34g2000 hsh.googlegroup s.com>,
be************@ lycos.com wrote:
Roy Smith:
No reason to limit how many splits get done if you're
explicitly going to slice the first two.

You are probably right for this problem, because most lines are 2
items long, but in scripts that have to process lines potentially
composed of many parts, setting a max number of parts speeds up your
script and reduces memory used, because you have less parts at the
end.

Bye,
bearophile
Sounds like premature optimization to me. Make it work and be easy to
understand first. Then worry about how fast it is.

But, along those lines, I've often thought that split() needed a way to not
just limit the number of splits, but to also throw away the extra stuff.
Getting the first N fields of a string is something I've done often enough
that refactoring the slicing operation right into the split() code seems
worthwhile. And, it would be even faster :-)
Sep 3 '08 #6
Roy Smith wrote:
In article <g9************ *@news.t-online.com>,
Peter Otten <__*******@web. dewrote:
I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.
As I understood it you didn't just split a line of code into three, but
wanted two processing steps. These logical steps are then somewhat remixed
by the shared error handling. You lose the information which step failed.
In the general case you may even mask a bug.

Peter
Sep 3 '08 #7
In article <g9************ *@news.t-online.com>,
Peter Otten <__*******@web. dewrote:
Roy Smith wrote:
In article <g9************ *@news.t-online.com>,
Peter Otten <__*******@web. dewrote:
I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...
Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.

As I understood it you didn't just split a line of code into three, but
wanted two processing steps. These logical steps are then somewhat remixed
by the shared error handling. You lose the information which step failed.
In the general case you may even mask a bug.

Peter
Well, what I really wanted was two conceptual steps, to make it easier for
a reader of the code to follow what it's doing. My standard for code being
adequately comprehensible is not that the reader *can* figure it out, but
that the reader doesn't have to exert any effort to figure it out. Or even
be aware that there's any figuring-out going on. He or she just reads it.
Sep 3 '08 #8
Roy Smith:
But, along those lines, I've often thought that split() needed a way to not
just limit the number of splits, but to also throw away the extra stuff.
Getting the first N fields of a string is something I've done often enough
that refactoring the slicing operation right into the split() code seems
worthwhile. And, it would be even faster :-)
Given the hypothetical .xsplit() string method I was talking about,
it's then easy to use islice() on it to skip the first items:

islice(sometext .xsplit(), 10, None)

Bye,
bearophile
Sep 3 '08 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
3495
by: Bernhard Kuemel | last post by:
Hi! To relief the problems of accessing a unix machine from behind a restrictive firewall or from an internet cafe I started to make a PHP web interface to bash. I'd like to hear your opinions and advice about my concept, especially regarding security. There is already such a thing (http://www.rohitab.com/cgiscripts/cgitelnet.html). However, it lacks interactive input to programs. To fix this I'd use a frame
13
4085
by: j. del | last post by:
I am just beginning to write programs... and my first task that I have set myself is to write a little program that will generate cryptic bywords from a source of text. A cryptic byword is basically a ceaser cypher but with randomly assigned letters rather than just shifted down by a value. So.. I am thinking that the BSD fortune game for the gnixes is a good source for the text to cypher. HOWEVER.. I am not sure how to read data in...
5
7352
by: Steven Woody | last post by:
i wrote the following bash script: ,---- | #!/bin/bash | | cat - | ./test-eof << EOF | hello | world | EOF |
6
3059
by: Tim Chase | last post by:
While working on a Jumble-esque program, I was trying to get a string into a character array. Unfortunately, it seems to choke on the following import random s = "abcefg" random.shuffle(s) returning
4
1886
by: 4zumanga | last post by:
I have a bunch of really horrible hacked-up bash scripts which I would really like to convert to python, so I can extend and neaten them. However, I'm having some trouble mapping some constructs easily, and was wondering if anyone know of a guide to mapping simple uses of command line programs to python. For an example, the kind of thing I am thinking of are things like (yes, this is horrible code). # These are a run of a program I...
3
2094
by: Daniel Klein | last post by:
Here's a c routine that prints a single line : #include <stdio.h> main() { printf ("Hello World!\n"); } And now the Python program (called 'po.py') that uses 'popen2' :
13
10133
by: Anton Slesarev | last post by:
I've read great paper about generators: http://www.dabeaz.com/generators/index.html Author say that it's easy to write analog of common linux tools such as awk,grep etc. He say that performance could be even better. But I have some problem with writing performance grep analog. It's my script:
0
803
by: Marc 'BlackJack' Rintsch | last post by:
On Tue, 02 Sep 2008 10:36:50 -0700, hofer wrote: Comment does not match the code. Or vice versa. :-) Untested: from __future__ import with_statement from itertools import ifilter, ifilterfalse, imap
0
498
by: Paul McGuire | last post by:
On Sep 2, 12:36 pm, hofer <bla...@dungeon.dewrote: All that sed'ing, grep'ing and awk'ing, you might want to take a look at pyparsing. Here is a pyparsing take on your posted problem: from pyparsing import LineEnd, Word, nums, LineStart, OneOrMore, restOfLine test = """
0
9719
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10366
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10371
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
7649
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6877
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5546
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5678
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3850
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3008
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.