By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,836 Members | 2,023 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,836 IT Pros & Developers. It's quick & easy.

Splitting on a word

P: n/a
Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code, I found that an essential step is:
splitting on a word (in this case 'href').

I am asking if there is some alternative (more pythonic...):

# SplitMultichar.py

import re

# string s simulating an html file
s='ffy: ytrty <a href="www.python.org">python</a> fyt <A
HREF="wwwx">wx</A> dtrtf'
p=re.compile(r'\bhref\b',re.I)

lHref=p.findall(s) # lHref=['href','HREF']
# for normal html files the lHref list has more elements
# (more web references)

c='~' # char to be used as delimiter
# c=chr(127) # char to be used as delimiter
for i in lHref:
s=s.replace(i,c)

# s ='ffy: ytrty <a ~="www.python.org">python</a> fyt <A
~="wwwx">wx</A> dtrtf'

list=s.split(c)
# list=['ffy: ytrty <a ', '="www.python.org">python</a> fyt <A ',
'="wwwx">wx</A> dtrtf']
#=-----------------------------------------------------

If you save the original s string to xxx.html, any browser
can visualize it.
To be sure as delimiter I choose chr(127)
which surely is not present in the html file.
Bye.

Jul 21 '05 #1
Share this Question
Share on Google+
7 Replies


P: n/a
qw******@yahoo.it wrote:
Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code, I found that an essential step is:
splitting on a word (in this case 'href').

I am asking if there is some alternative (more pythonic...):


For *this* particular task, certainly. It begins with

import BeautifulSoup

The rest is left as a (brief) exercise for the reader. :-)

As for the more general task of splitting strings using regular
expressions, see re.split().

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Jul 21 '05 #2

P: n/a
On Wed, 13 Jul 2005 06:19:54 -0700, qwweeeit wrote:
Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code,
[red rag to bull]
Because it was too slow? Or just to prove what a macho programmer you are?

Is your code even working yet? If it isn't working, you shouldn't be
trying to optimizing buggy code.

I found that an essential step is:
splitting on a word (in this case 'href').
Then just do it:

py> '<a href="web reference"> underlined reference</a>'.split('href')
['<a ', '="web reference"> underlined reference</a>']

If you are concerned about case issues, you can either convert the
entire HTML file to lowercase, or you might write a case-insensitive
regular expression to replace any "href" regardless of case with the
lowercase version.

[snip]
To be sure as delimiter I choose chr(127)
which surely is not present in the html file.


I wouldn't bet my life on that. I've found some weird characters in HTML
files.
--
Steven.

Jul 21 '05 #3

P: n/a
Joe
# string s simulating an html file
s='ffy: ytrty <a href="www.python.org">python</a> fyt <A
HREF="wwwx">wx</A> dtrtf'
p=re.compile(r'\bhref\b',re.I)

list=p.split(s) #<<<<<<<<<<<<<<<<< gets you your final list.

good luck,

Joe

Jul 21 '05 #4

P: n/a
qw******@yahoo.it wrote:
Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code, I found that an essential step is:
splitting on a word (in this case 'href').

I am asking if there is some alternative (more pythonic...):


Sure. The htmllib module provides HTMLparser.
Here's an example, run it with your HTML file as argument
and you'll see a list of all href's in the document.

#------------------------------------------------
#!/usr/bin/python
import htmllib

def test():
import sys, formatter

file = sys.argv[1]
f = open(file, 'r')
data = f.read()
f.close()

f = formatter.NullFormatter()
p = htmllib.HTMLParser(f)
p.feed(data)

for a_link in p.anchorlist:
print a_link

p.close()

test()
#------------------------------------------------

I'm sure that this is far more Pythonic!

Bernhard
Jul 21 '05 #5

P: n/a
Hi all,
thanks for your contributions. To Robert Kern I can replay that I know
BeautifulSoap, but mine wanted to be a "generalization" (only
incidentally used in a web parsing application). The fact is that,
beeing a "macho newbie" programmer (the "macho" is from Steven
D'Aprano), I wanted to show how beaufiful solutions I can find...
Luckily there is Joe who shows me that he most of my "beautiful" code
(working, of course!) can be replaced by:
list=p.split(s)
Bernard... don't get angry, but I prefer the solution of Joe. It is
more general, and, besides that, for me "pythonic" means simple and
short (I may be wrong...).
By the way, I have found an alternative solution to the problem of
lists "unique", without sorting, but non beeing enough "macho"...
Bye.

Jul 21 '05 #6

P: n/a
qw******@yahoo.it wrote:
Bernard... don't get angry, but I prefer the solution of Joe.
Oh. If I got angry in such a case, I would have stopped responding to such
posts long ago
You know the background... and you'll have to bear the consequences. ;-)
...
for me "pythonic" means simple and short (I may be wrong...).


It's your definition, isn't it?
One of the most important advantages of Python (for me!) besides its
readability is that it comes with "Batteries included", which means, that I
can benefit of the work others did before, and that I can rely on its
quality.

The solution which I proposed is nothing but the test code from htmllib,
stripped down to the absolut minimum, enriched with the print command
to show the anchor list.

If I had to write production-level code of your sort, I'd take such an
off-the-shelf solution, because it minimizes the risk of failures.

Think only of such issues like these:
- does your code find a tag like <A HREF= (capital letters)?
- does your code correctly handle incomplete tags like
<a href="linkadr"></a> or references with/without " ...?
- does it survive ill-coded html after all?

I've made the experience that it's usually better to rely on such
"library" code than to reinvent the wheel.

There's often a reason to take another approach.
I'd agree that a simple and short solution is fascinating.
However, every simple and short solution should be readable.
As a terrific example, here's a very tiny piece of code,
which does nothing but calculate the prime numbers up to 1000:

print filter(None,map(lambda y:y*reduce(lambda x,y:x*y!=0,
map(lambda x,y=y:y%x,range(2,int(pow(y,0.5)+1))),1),
range(2,1000)))

- simple (depends on your familiarity with stuff like map and lambda)
- short (compared with different solutions)
- and veeerrrryyy pythonic!

Bernhard

Jul 21 '05 #7

P: n/a
Hi Bernhard,
firstly you must excuse my English ("angry" is a little ...strong, but
my vocabulary is limited). I hope that the experts keep on helping us
newbie.
Also if I am a newbie (in Python), I disagree with you: my solution
(with the help of Joe) answers to the problem of splitting a string
using a delimiter of more than one character (sometimes a word as
delimiter, but it is not required).
The code I supplied can be misleading because is centered in web
parsing, but my request is more general (Next time I will only make the
question without examples!)
If I were a professional programmer I could agree with you and the
"Batteries included" concept and all the other considerations
("off-the-shelf solutions" and ...not reinventing the wheel).
Also the terrific example you supply in order to caution me not to
follow dully (found in the dictionary) the "simple & short" concept,
doesn't apply to me (too complicated!).
I am so far from a real programmer that when an error occurs, I use
try/except (if they solve the problem) without caring of the sources of
the mistake, ...EAFP!).
So I don't care too much of possible future mistakes (also if the code
takes into account capital letters).
For the specific case I mentioned, actually if the closing tag ">" is
missing perhaps I obtain wrong results... I will worry when necessary
(also if the Murphy law...).
Bye.

Jul 21 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.