473,378 Members | 1,175 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

Splitting on a word

Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code, I found that an essential step is:
splitting on a word (in this case 'href').

I am asking if there is some alternative (more pythonic...):

# SplitMultichar.py

import re

# string s simulating an html file
s='ffy: ytrty <a href="www.python.org">python</a> fyt <A
HREF="wwwx">wx</A> dtrtf'
p=re.compile(r'\bhref\b',re.I)

lHref=p.findall(s) # lHref=['href','HREF']
# for normal html files the lHref list has more elements
# (more web references)

c='~' # char to be used as delimiter
# c=chr(127) # char to be used as delimiter
for i in lHref:
s=s.replace(i,c)

# s ='ffy: ytrty <a ~="www.python.org">python</a> fyt <A
~="wwwx">wx</A> dtrtf'

list=s.split(c)
# list=['ffy: ytrty <a ', '="www.python.org">python</a> fyt <A ',
'="wwwx">wx</A> dtrtf']
#=-----------------------------------------------------

If you save the original s string to xxx.html, any browser
can visualize it.
To be sure as delimiter I choose chr(127)
which surely is not present in the html file.
Bye.

Jul 21 '05 #1
7 2211
qw******@yahoo.it wrote:
Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code, I found that an essential step is:
splitting on a word (in this case 'href').

I am asking if there is some alternative (more pythonic...):


For *this* particular task, certainly. It begins with

import BeautifulSoup

The rest is left as a (brief) exercise for the reader. :-)

As for the more general task of splitting strings using regular
expressions, see re.split().

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Jul 21 '05 #2
On Wed, 13 Jul 2005 06:19:54 -0700, qwweeeit wrote:
Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code,
[red rag to bull]
Because it was too slow? Or just to prove what a macho programmer you are?

Is your code even working yet? If it isn't working, you shouldn't be
trying to optimizing buggy code.

I found that an essential step is:
splitting on a word (in this case 'href').
Then just do it:

py> '<a href="web reference"> underlined reference</a>'.split('href')
['<a ', '="web reference"> underlined reference</a>']

If you are concerned about case issues, you can either convert the
entire HTML file to lowercase, or you might write a case-insensitive
regular expression to replace any "href" regardless of case with the
lowercase version.

[snip]
To be sure as delimiter I choose chr(127)
which surely is not present in the html file.


I wouldn't bet my life on that. I've found some weird characters in HTML
files.
--
Steven.

Jul 21 '05 #3
Joe
# string s simulating an html file
s='ffy: ytrty <a href="www.python.org">python</a> fyt <A
HREF="wwwx">wx</A> dtrtf'
p=re.compile(r'\bhref\b',re.I)

list=p.split(s) #<<<<<<<<<<<<<<<<< gets you your final list.

good luck,

Joe

Jul 21 '05 #4
qw******@yahoo.it wrote:
Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code, I found that an essential step is:
splitting on a word (in this case 'href').

I am asking if there is some alternative (more pythonic...):


Sure. The htmllib module provides HTMLparser.
Here's an example, run it with your HTML file as argument
and you'll see a list of all href's in the document.

#------------------------------------------------
#!/usr/bin/python
import htmllib

def test():
import sys, formatter

file = sys.argv[1]
f = open(file, 'r')
data = f.read()
f.close()

f = formatter.NullFormatter()
p = htmllib.HTMLParser(f)
p.feed(data)

for a_link in p.anchorlist:
print a_link

p.close()

test()
#------------------------------------------------

I'm sure that this is far more Pythonic!

Bernhard
Jul 21 '05 #5
Hi all,
thanks for your contributions. To Robert Kern I can replay that I know
BeautifulSoap, but mine wanted to be a "generalization" (only
incidentally used in a web parsing application). The fact is that,
beeing a "macho newbie" programmer (the "macho" is from Steven
D'Aprano), I wanted to show how beaufiful solutions I can find...
Luckily there is Joe who shows me that he most of my "beautiful" code
(working, of course!) can be replaced by:
list=p.split(s)
Bernard... don't get angry, but I prefer the solution of Joe. It is
more general, and, besides that, for me "pythonic" means simple and
short (I may be wrong...).
By the way, I have found an alternative solution to the problem of
lists "unique", without sorting, but non beeing enough "macho"...
Bye.

Jul 21 '05 #6
qw******@yahoo.it wrote:
Bernard... don't get angry, but I prefer the solution of Joe.
Oh. If I got angry in such a case, I would have stopped responding to such
posts long ago
You know the background... and you'll have to bear the consequences. ;-)
...
for me "pythonic" means simple and short (I may be wrong...).


It's your definition, isn't it?
One of the most important advantages of Python (for me!) besides its
readability is that it comes with "Batteries included", which means, that I
can benefit of the work others did before, and that I can rely on its
quality.

The solution which I proposed is nothing but the test code from htmllib,
stripped down to the absolut minimum, enriched with the print command
to show the anchor list.

If I had to write production-level code of your sort, I'd take such an
off-the-shelf solution, because it minimizes the risk of failures.

Think only of such issues like these:
- does your code find a tag like <A HREF= (capital letters)?
- does your code correctly handle incomplete tags like
<a href="linkadr"></a> or references with/without " ...?
- does it survive ill-coded html after all?

I've made the experience that it's usually better to rely on such
"library" code than to reinvent the wheel.

There's often a reason to take another approach.
I'd agree that a simple and short solution is fascinating.
However, every simple and short solution should be readable.
As a terrific example, here's a very tiny piece of code,
which does nothing but calculate the prime numbers up to 1000:

print filter(None,map(lambda y:y*reduce(lambda x,y:x*y!=0,
map(lambda x,y=y:y%x,range(2,int(pow(y,0.5)+1))),1),
range(2,1000)))

- simple (depends on your familiarity with stuff like map and lambda)
- short (compared with different solutions)
- and veeerrrryyy pythonic!

Bernhard

Jul 21 '05 #7
Hi Bernhard,
firstly you must excuse my English ("angry" is a little ...strong, but
my vocabulary is limited). I hope that the experts keep on helping us
newbie.
Also if I am a newbie (in Python), I disagree with you: my solution
(with the help of Joe) answers to the problem of splitting a string
using a delimiter of more than one character (sometimes a word as
delimiter, but it is not required).
The code I supplied can be misleading because is centered in web
parsing, but my request is more general (Next time I will only make the
question without examples!)
If I were a professional programmer I could agree with you and the
"Batteries included" concept and all the other considerations
("off-the-shelf solutions" and ...not reinventing the wheel).
Also the terrific example you supply in order to caution me not to
follow dully (found in the dictionary) the "simple & short" concept,
doesn't apply to me (too complicated!).
I am so far from a real programmer that when an error occurs, I use
try/except (if they solve the problem) without caring of the sources of
the mistake, ...EAFP!).
So I don't care too much of possible future mistakes (also if the code
takes into account capital letters).
For the specific case I mentioned, actually if the closing tag ">" is
missing perhaps I obtain wrong results... I will worry when necessary
(also if the Murphy law...).
Bye.

Jul 21 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: somaBoy MX | last post by:
I'm building a site where I need to pull very large blocks from a database. I would like to make navigation a little more user friendly by splitting text in pages which can then be navigated. I...
10
by: Angelo Secchi | last post by:
Hi, I have string of numbers and words like ',,,,,,23,,,asd,,,,,"name,surname",,,,,,,\n' and I would like to split (I'm using string.split()) it using comma as separator but I do not want to...
6
by: qwweeeit | last post by:
Splitting with RE has (for me!) misterious behaviour! I want to get the words from this string: s= 'This+(that)= a.string!!!' in a list like that considering "a.string" as a word. Python...
11
by: MM | last post by:
Hi I have never written any C programs before, but it seems that I need to do so now. Hope some of you out there can spend a few minutes and help me by writing a simple example of something...
5
by: Peter Oliphant | last post by:
I was thinking it might be a good idea to split this newsgroup into different newsgroups, depending on the version of VS C++.NET being discussed. Thus, there would be 2002, 2003, and 2005...
7
by: Anat | last post by:
Hi, What regex do I need to split a string, using javascript's split method, into words-array? Splitting accroding to whitespaces only is not enough, I need to split according to whitespace,...
3
by: Ramper | last post by:
I need to Write a function that will, given an input string containing many words, split that string into individual words. For each word, the function should output the word, its starting index in...
2
by: Sumit | last post by:
Hi , I am trying to splitt a Line whihc is below of format , AzAccept PLYSSTM01 "162.44.245.32 CN=dddd cojack (890),OU=1,OU=Customers,OU=ISM-Users,OU=kkk Secure,DC=customer,DC=rxcorp,DC=com"...
2
by: David Jackson | last post by:
Hello, The company I'm working for has taken over a smaller company with a fairly large customer base. We want to send an email to that customer base informing them of the takeover but the...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.