Re: converting a sed / grep / awk / . . . bash pipe line into python

hofer wrote:

Something I have to do very often is filtering / transforming line
based file contents and storing the result in an array or a
dictionary.

Very often the functionallity exists already in form of a shell script
with sed / awk / grep , . . .
and I would like to have the same implementation in my script

What's a compact, efficient (no intermediate arrays generated /
regexps compiled only once) way in python
for such kind of 'pipe line'

Example 1 (in bash): (annotated with comment (thus not working) if
copied / pasted

cat file \ ### read from file
| sed 's/\.\..*//' \ ### remove '//' comments
| sed 's/#.*//' \ ### remove '#' comments
| grep -v '^\s*$' \ ### get rid of empty lines
| awk '{ print $1 + $2 " " $2 }' \ ### knowing, that all remaining
lines contain always at least
\ ### two integers calculate
sum and 'keep' second number
| grep '^42 ' ### keep lines for which sum is 42
| awk '{ print $2 }' ### print number
thanks in advance for any suggestions of how to code this (keeping the
comments)

for line in open("file"): # read from file
try:
a, b = map(int, line.split(None, 2)[:2]) # remove extra columns,
# convert to integer
except ValueError:
pass # remove comments, get rid of empty lines,
# skip lines with less than two integers
else:
# line did start with two integers
if a + b == 42: # keep lines for which the sum is 42
print b # print number

The hard part was keeping the comments ;)

Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

Peter

Sep 3 '08 #1

Subscribe Post Reply

3982

Roy Smith

In article <g9*************@news.t-online.com>,
Peter Otten <__*******@web.dewrote:

Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

I'm philosophically opposed to one-liners like:

a, b = map(int, line.split(None, 2)[:2])

because they're difficult to understand at a glance. You need to visually
parse it and work your way out from the inside to figure out what's going
on. Better to keep it longer and simpler.

Now that I've got my head around it, I realized there's no reason to make
the split part so complicated. No reason to limit how many splits get done
if you're explicitly going to slice the first two. And since you don't
need to supply the second argument, the first one can be defaulted as well.
So, you immediately get down to:

a, b = map(int, line.split()[:2])

which isn't too bad. I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

Sep 3 '08 #2

Peter Otten

Roy Smith wrote:

In article <g9*************@news.t-online.com>,
Peter Otten <__*******@web.dewrote:

>Without them it looks better:

import sys
for line in sys.stdin:
try:
a, b = map(int, line.split(None, 2)[:2])
except ValueError:
pass
else:
if a + b == 42:
print b

I'm philosophically opposed to one-liners

I'm not, as long as you don't /force/ the code into one line.

like:

> a, b = map(int, line.split(None, 2)[:2])

because they're difficult to understand at a glance. You need to visually
parse it and work your way out from the inside to figure out what's going
on. Better to keep it longer and simpler.

Now that I've got my head around it, I realized there's no reason to make
the split part so complicated. No reason to limit how many splits get
done
if you're explicitly going to slice the first two. And since you don't
need to supply the second argument, the first one can be defaulted as
well. So, you immediately get down to:

> a, b = map(int, line.split()[:2])

I agree that the above is an improvement.

which isn't too bad. I might take it one step further, however, and do:

> fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

> a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Peter

Sep 3 '08 #3

bearophileHUGS

Roy Smith:

No reason to limit how many splits get done if you're
explicitly going to slice the first two.

You are probably right for this problem, because most lines are 2
items long, but in scripts that have to process lines potentially
composed of many parts, setting a max number of parts speeds up your
script and reduces memory used, because you have less parts at the
end.

Bye,
bearophile

Sep 3 '08 #4

Roy Smith

In article <g9*************@news.t-online.com>,
Peter Otten <__*******@web.dewrote:

I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)
in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.

Sep 3 '08 #5

Roy Smith

In article
<7f**********************************@34g2000hsh.g ooglegroups.com>,
be************@lycos.com wrote:

Roy Smith:
No reason to limit how many splits get done if you're
explicitly going to slice the first two.

You are probably right for this problem, because most lines are 2
items long, but in scripts that have to process lines potentially
composed of many parts, setting a max number of parts speeds up your
script and reduces memory used, because you have less parts at the
end.

Bye,
bearophile

Sounds like premature optimization to me. Make it work and be easy to
understand first. Then worry about how fast it is.

But, along those lines, I've often thought that split() needed a way to not
just limit the number of splits, but to also throw away the extra stuff.
Getting the first N fields of a string is something I've done often enough
that refactoring the slicing operation right into the split() code seems
worthwhile. And, it would be even faster :-)

Sep 3 '08 #6

Peter Otten

Roy Smith wrote:

In article <g9*************@news.t-online.com>,
Peter Otten <__*******@web.dewrote:

I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...

Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.

As I understood it you didn't just split a line of code into three, but
wanted two processing steps. These logical steps are then somewhat remixed
by the shared error handling. You lose the information which step failed.
In the general case you may even mask a bug.

Peter

Sep 3 '08 #7

Roy Smith

In article <g9*************@news.t-online.com>,
Peter Otten <__*******@web.dewrote:

Roy Smith wrote:

In article <g9*************@news.t-online.com>,
Peter Otten <__*******@web.dewrote:

I might take it one step further, however, and do:

fields = line.split()[:2]
a, b = map(int, fields)

in fact, I might even get rid of the very generic, but conceptually
overkill, use of map() and just write:

a, b = line.split()[:2]
a = int(a)
b = int(b)

If you go that route your next step is to introduce another try...except,
one for the unpacking and another for the integer conversion...
Why another try/except? The potential unpack and conversion errors exist
in both versions, and the existing try block catches them all. Splitting
the one line up into three with some intermediate variables doesn't change
that.

As I understood it you didn't just split a line of code into three, but
wanted two processing steps. These logical steps are then somewhat remixed
by the shared error handling. You lose the information which step failed.
In the general case you may even mask a bug.

Peter

Well, what I really wanted was two conceptual steps, to make it easier for
a reader of the code to follow what it's doing. My standard for code being
adequately comprehensible is not that the reader *can* figure it out, but
that the reader doesn't have to exert any effort to figure it out. Or even
be aware that there's any figuring-out going on. He or she just reads it.

Sep 3 '08 #8

bearophileHUGS

Roy Smith:

But, along those lines, I've often thought that split() needed a way to not
just limit the number of splits, but to also throw away the extra stuff.
Getting the first N fields of a string is something I've done often enough
that refactoring the slicing operation right into the split() code seems
worthwhile. And, it would be even faster :-)

Given the hypothetical .xsplit() string method I was talking about,
it's then easy to use islice() on it to skip the first items:

islice(sometext.xsplit(), 10, None)

Bye,
bearophile

Sep 3 '08 #9

Similar topics

webinterface to bash

by: Bernhard Kuemel | last post by:

Hi! To relief the problems of accessing a unix machine from behind a restrictive firewall or from an internet cafe I started to make a PHP web interface to bash. I'd like to hear your opinions...

PHP

How to read data from a bash pipe

by: j. del | last post by:

I am just beginning to write programs... and my first task that I have set myself is to write a little program that will generate cryptic bywords from a source of text. A cryptic byword is...

C / C++

fgets and EOF in Bash

by: Steven Woody | last post by:

C / C++

Converting a string to an array?

by: Tim Chase | last post by:

While working on a Jumble-esque program, I was trying to get a string into a character array. Unfortunately, it seems to choke on the following import random s = "abcefg" random.shuffle(s) ...

Python

Guide to using python for bash-style scripting

by: 4zumanga | last post by:

I have a bunch of really horrible hacked-up bash scripts which I would really like to convert to python, so I can extend and neaten them. However, I'm having some trouble mapping some constructs...

Python

Having trouble converting popen2 to subprocess

by: Daniel Klein | last post by:

Here's a c routine that prints a single line : #include <stdio.h> main() { printf ("Hello World!\n"); } And now the Python program (called 'po.py') that uses 'popen2' :

Python

python vs. grep

by: Anton Slesarev | last post by:

I've read great paper about generators: http://www.dabeaz.com/generators/index.html Author say that it's easy to write analog of common linux tools such as awk,grep etc. He say that performance...

Python

Re: converting a sed / grep / awk / . . . bash pipe line intopython

by: Marc 'BlackJack' Rintsch | last post by:

On Tue, 02 Sep 2008 10:36:50 -0700, hofer wrote: Comment does not match the code. Or vice versa. :-) Untested: from __future__ import with_statement from itertools import ifilter,...

Python

Re: converting a sed / grep / awk / . . . bash pipe line into python

by: Paul McGuire | last post by:

On Sep 2, 12:36 pm, hofer <bla...@dungeon.dewrote: All that sed'ing, grep'ing and awk'ing, you might want to take a look at pyparsing. Here is a pyparsing take on your posted problem: from...

Python

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++