473,378 Members | 1,411 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

Re: converting a sed / grep / awk / . . . bash pipe line into python

hofer wrote:
Something I have to do very often is filtering / transforming line
based file contents and storing the result in an array or a
dictionary.

Very often the functionallity exists already in form of a shell script
with sed / awk / grep , . . .
and I would like to have the same implementation in my script

What's a compact, efficient (no intermediate arrays generated /
regexps compiled only once) way in python
for such kind of 'pipe line'

Example 1 (in bash): (annotated with comment (thus not working) if
copied / pasted
cat file \ ### read from file
| sed 's/\.\..*//' \ ### remove '//' comments
| sed 's/#.*//' \ ### remove '#' comments
| grep -v '^\s*$' \ ### get rid of empty lines
| awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining
lines contain always at least
\ ### two integers calculate
sum and 'keep' second number
| grep '^42 ' ### keep lines for which sum is 42
| awk '{ print $2 }' ### print number
thanks in advance for any suggestions of how to code this (keeping the
comments)
for line in open("file"): # read from file
try:
a, b = map(int, line.split(None, 2)[:2]) # remove extra columns,
# convert to integer
except ValueError:
pass # remove comments, get rid of empty lines,
# skip lines with less than two integers
else:
# line did start with two integers
if a + b == 42: # keep lines for which the sum is 42
print b # print number

The hard part was keeping the comments ;)

Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

Peter
Sep 3 '08 #1
8 3982
In article <g9*************@news.t-online.com>,
Peter Otten <__*******@web.dewrote:
Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b
I'm philosophically opposed to one-liners like:
a, b = map(int, line.split(None, 2)[:2])
because they're difficult to understand at a glance. You need to visually
parse it and work your way out from the inside to figure out what's going
on. Better to keep it longer and simpler.

Now that I've got my head around it, I realized there's no reason to make
the split part so complicated. No reason to limit how many splits get done
if you're explicitly going to slice the first two. And since you don't
need to supply the second argument, the first one can be defaulted as well.
So, you immediately get down to:
a, b = map(int, line.split()[:2])
which isn't too bad. I might take it one step further, however, and do:
fields = line.split()[:2]
a, b = map(int, fields)
in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:
a, b = line.split()[:2]
a = int(a)
b = int(b)
Sep 3 '08 #2
Roy Smith wrote:
In article <g9*************@news.t-online.com>,
Peter Otten <__*******@web.dewrote:
>Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

I'm philosophically opposed to one-liners
I'm not, as long as you don't /force/ the code into one line.
like:
> a, b = map(int, line.split(None, 2)[:2])

because they're difficult to understand at a glance. You need to visually
parse it and work your way out from the inside to figure out what's going
on. Better to keep it longer and simpler.

Now that I've got my head around it, I realized there's no reason to make
the split part so complicated. No reason to limit how many splits get
done
if you're explicitly going to slice the first two. And since you don't
need to supply the second argument, the first one can be defaulted as
well. So, you immediately get down to:
> a, b = map(int, line.split()[:2])
I agree that the above is an improvement.
which isn't too bad. I might take it one step further, however, and do:
> fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:
> a, b = line.split()[:2]
a = int(a)
b = int(b)
If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Peter
Sep 3 '08 #3
Roy Smith:
No reason to limit how many splits get done if you're
explicitly going to slice the first two.
You are probably right for this problem, because most lines are 2
items long, but in scripts that have to process lines potentially
composed of many parts, setting a max number of parts speeds up your
script and reduces memory used, because you have less parts at the
end.

Bye,
bearophile
Sep 3 '08 #4
In article <g9*************@news.t-online.com>,
Peter Otten <__*******@web.dewrote:
I might take it one step further, however, and do:
fields = line.split()[:2]
a, b = map(int, fields)
in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:
a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...
Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.
Sep 3 '08 #5
In article
<7f**********************************@34g2000hsh.g ooglegroups.com>,
be************@lycos.com wrote:
Roy Smith:
No reason to limit how many splits get done if you're
explicitly going to slice the first two.

You are probably right for this problem, because most lines are 2
items long, but in scripts that have to process lines potentially
composed of many parts, setting a max number of parts speeds up your
script and reduces memory used, because you have less parts at the
end.

Bye,
bearophile
Sounds like premature optimization to me. Make it work and be easy to
understand first. Then worry about how fast it is.

But, along those lines, I've often thought that split() needed a way to not
just limit the number of splits, but to also throw away the extra stuff.
Getting the first N fields of a string is something I've done often enough
that refactoring the slicing operation right into the split() code seems
worthwhile. And, it would be even faster :-)
Sep 3 '08 #6
Roy Smith wrote:
In article <g9*************@news.t-online.com>,
Peter Otten <__*******@web.dewrote:
I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.
As I understood it you didn't just split a line of code into three, but
wanted two processing steps. These logical steps are then somewhat remixed
by the shared error handling. You lose the information which step failed.
In the general case you may even mask a bug.

Peter
Sep 3 '08 #7
In article <g9*************@news.t-online.com>,
Peter Otten <__*******@web.dewrote:
Roy Smith wrote:
In article <g9*************@news.t-online.com>,
Peter Otten <__*******@web.dewrote:
I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...
Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.

As I understood it you didn't just split a line of code into three, but
wanted two processing steps. These logical steps are then somewhat remixed
by the shared error handling. You lose the information which step failed.
In the general case you may even mask a bug.

Peter
Well, what I really wanted was two conceptual steps, to make it easier for
a reader of the code to follow what it's doing. My standard for code being
adequately comprehensible is not that the reader *can* figure it out, but
that the reader doesn't have to exert any effort to figure it out. Or even
be aware that there's any figuring-out going on. He or she just reads it.
Sep 3 '08 #8
Roy Smith:
But, along those lines, I've often thought that split() needed a way to not
just limit the number of splits, but to also throw away the extra stuff.
Getting the first N fields of a string is something I've done often enough
that refactoring the slicing operation right into the split() code seems
worthwhile. And, it would be even faster :-)
Given the hypothetical .xsplit() string method I was talking about,
it's then easy to use islice() on it to skip the first items:

islice(sometext.xsplit(), 10, None)

Bye,
bearophile
Sep 3 '08 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Bernhard Kuemel | last post by:
Hi! To relief the problems of accessing a unix machine from behind a restrictive firewall or from an internet cafe I started to make a PHP web interface to bash. I'd like to hear your opinions...
13
by: j. del | last post by:
I am just beginning to write programs... and my first task that I have set myself is to write a little program that will generate cryptic bywords from a source of text. A cryptic byword is...
5
by: Steven Woody | last post by:
i wrote the following bash script: ,---- | #!/bin/bash | | cat - | ./test-eof << EOF | hello | world | EOF |
6
by: Tim Chase | last post by:
While working on a Jumble-esque program, I was trying to get a string into a character array. Unfortunately, it seems to choke on the following import random s = "abcefg" random.shuffle(s) ...
4
by: 4zumanga | last post by:
I have a bunch of really horrible hacked-up bash scripts which I would really like to convert to python, so I can extend and neaten them. However, I'm having some trouble mapping some constructs...
3
by: Daniel Klein | last post by:
Here's a c routine that prints a single line : #include <stdio.h> main() { printf ("Hello World!\n"); } And now the Python program (called 'po.py') that uses 'popen2' :
13
by: Anton Slesarev | last post by:
I've read great paper about generators: http://www.dabeaz.com/generators/index.html Author say that it's easy to write analog of common linux tools such as awk,grep etc. He say that performance...
0
by: Marc 'BlackJack' Rintsch | last post by:
On Tue, 02 Sep 2008 10:36:50 -0700, hofer wrote: Comment does not match the code. Or vice versa. :-) Untested: from __future__ import with_statement from itertools import ifilter,...
0
by: Paul McGuire | last post by:
On Sep 2, 12:36 pm, hofer <bla...@dungeon.dewrote: All that sed'ing, grep'ing and awk'ing, you might want to take a look at pyparsing. Here is a pyparsing take on your posted problem: from...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.