Splitting on a word

qwweeeit

Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code, I found that an essential step is:
splitting on a word (in this case 'href').

I am asking if there is some alternative (more pythonic...):

# SplitMultichar.py

import re

# string s simulating an html file
s='ffy: ytrty <a href="www.python.org">python</a> fyt <A
HREF="wwwx">wx</A> dtrtf'
p=re.compile(r'\bhref\b',re.I)

lHref=p.findall(s) # lHref=['href','HREF']
# for normal html files the lHref list has more elements
# (more web references)

c='~' # char to be used as delimiter
# c=chr(127) # char to be used as delimiter
for i in lHref:
s=s.replace(i,c)

# s ='ffy: ytrty <a ~="www.python.org">python</a> fyt <A
~="wwwx">wx</A> dtrtf'

list=s.split(c)
# list=['ffy: ytrty <a ', '="www.python.org">python</a> fyt <A ',
'="wwwx">wx</A> dtrtf']
#=-----------------------------------------------------

If you save the original s string to xxx.html, any browser
can visualize it.
To be sure as delimiter I choose chr(127)
which surely is not present in the html file.
Bye.

Jul 21 '05 #1

Subscribe Post Reply

2211

Robert Kern

qw******@yahoo.it wrote:

Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code, I found that an essential step is:
splitting on a word (in this case 'href').

I am asking if there is some alternative (more pythonic...):

For *this* particular task, certainly. It begins with

import BeautifulSoup

The rest is left as a (brief) exercise for the reader. :-)

As for the more general task of splitting strings using regular
expressions, see re.split().

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Jul 21 '05 #2

Steven D'Aprano

On Wed, 13 Jul 2005 06:19:54 -0700, qwweeeit wrote:

Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code,
[red rag to bull]
Because it was too slow? Or just to prove what a macho programmer you are?

Is your code even working yet? If it isn't working, you shouldn't be
trying to optimizing buggy code.

I found that an essential step is:
splitting on a word (in this case 'href').
Then just do it:

py> '<a href="web reference"> underlined reference</a>'.split('href')
['<a ', '="web reference"> underlined reference</a>']

If you are concerned about case issues, you can either convert the
entire HTML file to lowercase, or you might write a case-insensitive
regular expression to replace any "href" regardless of case with the
lowercase version.

[snip]
To be sure as delimiter I choose chr(127)
which surely is not present in the html file.

I wouldn't bet my life on that. I've found some weird characters in HTML
files.
--
Steven.

Jul 21 '05 #3

Joe

# string s simulating an html file
s='ffy: ytrty <a href="www.python.org">python</a> fyt <A
HREF="wwwx">wx</A> dtrtf'
p=re.compile(r'\bhref\b',re.I)

list=p.split(s) #<<<<<<<<<<<<<<<<< gets you your final list.

good luck,

Joe

Jul 21 '05 #4

Bernhard Holzmayer

qw******@yahoo.it wrote:

Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code, I found that an essential step is:
splitting on a word (in this case 'href').

I am asking if there is some alternative (more pythonic...):

Sure. The htmllib module provides HTMLparser.
Here's an example, run it with your HTML file as argument
and you'll see a list of all href's in the document.

#------------------------------------------------
#!/usr/bin/python
import htmllib

def test():
import sys, formatter

file = sys.argv[1]
f = open(file, 'r')
data = f.read()
f.close()

f = formatter.NullFormatter()
p = htmllib.HTMLParser(f)
p.feed(data)

for a_link in p.anchorlist:
print a_link

p.close()

test()
#------------------------------------------------

I'm sure that this is far more Pythonic!

Bernhard

Jul 21 '05 #5

qwweeeit

Hi all,
thanks for your contributions. To Robert Kern I can replay that I know
BeautifulSoap, but mine wanted to be a "generalization" (only
incidentally used in a web parsing application). The fact is that,
beeing a "macho newbie" programmer (the "macho" is from Steven
D'Aprano), I wanted to show how beaufiful solutions I can find...
Luckily there is Joe who shows me that he most of my "beautiful" code
(working, of course!) can be replaced by:
list=p.split(s)
Bernard... don't get angry, but I prefer the solution of Joe. It is
more general, and, besides that, for me "pythonic" means simple and
short (I may be wrong...).
By the way, I have found an alternative solution to the problem of
lists "unique", without sorting, but non beeing enough "macho"...
Bye.

Jul 21 '05 #6

Bernhard Holzmayer

qw******@yahoo.it wrote:

Bernard... don't get angry, but I prefer the solution of Joe.
Oh. If I got angry in such a case, I would have stopped responding to such
posts long ago
You know the background... and you'll have to bear the consequences. ;-)
...
for me "pythonic" means simple and short (I may be wrong...).

It's your definition, isn't it?
One of the most important advantages of Python (for me!) besides its
readability is that it comes with "Batteries included", which means, that I
can benefit of the work others did before, and that I can rely on its
quality.

The solution which I proposed is nothing but the test code from htmllib,
stripped down to the absolut minimum, enriched with the print command
to show the anchor list.

If I had to write production-level code of your sort, I'd take such an
off-the-shelf solution, because it minimizes the risk of failures.

Think only of such issues like these:
- does your code find a tag like <A HREF= (capital letters)?
- does your code correctly handle incomplete tags like
<a href="linkadr"></a> or references with/without " ...?
- does it survive ill-coded html after all?

I've made the experience that it's usually better to rely on such
"library" code than to reinvent the wheel.

There's often a reason to take another approach.
I'd agree that a simple and short solution is fascinating.
However, every simple and short solution should be readable.
As a terrific example, here's a very tiny piece of code,
which does nothing but calculate the prime numbers up to 1000:

print filter(None,map(lambda y:y*reduce(lambda x,y:x*y!=0,
map(lambda x,y=y:y%x,range(2,int(pow(y,0.5)+1))),1),
range(2,1000)))

- simple (depends on your familiarity with stuff like map and lambda)
- short (compared with different solutions)
- and veeerrrryyy pythonic!

Bernhard

Jul 21 '05 #7

qwweeeit

Hi Bernhard,
firstly you must excuse my English ("angry" is a little ...strong, but
my vocabulary is limited). I hope that the experts keep on helping us
newbie.
Also if I am a newbie (in Python), I disagree with you: my solution
(with the help of Joe) answers to the problem of splitting a string
using a delimiter of more than one character (sometimes a word as
delimiter, but it is not required).
The code I supplied can be misleading because is centered in web
parsing, but my request is more general (Next time I will only make the
question without examples!)
If I were a professional programmer I could agree with you and the
"Batteries included" concept and all the other considerations
("off-the-shelf solutions" and ...not reinventing the wheel).
Also the terrific example you supply in order to caution me not to
follow dully (found in the dictionary) the "simple & short" concept,
doesn't apply to me (too complicated!).
I am so far from a real programmer that when an error occurs, I use
try/except (if they solve the problem) without caring of the sources of
the mistake, ...EAFP!).
So I don't care too much of possible future mistakes (also if the code
takes into account capital letters).
For the specific case I mentioned, actually if the closing tag ">" is
missing perhaps I obtain wrong results... I will worry when necessary
(also if the Murphy law...).
Bye.

Jul 21 '05 #8

Similar topics

splitting text into pages

by: somaBoy MX | last post by:

I'm building a site where I need to pull very large blocks from a database. I would like to make navigation a little more user friendly by splitting text in pages which can then be navigated. I...

PHP

Problem in splitting a string

by: Angelo Secchi | last post by:

Hi, I have string of numbers and words like ',,,,,,23,,,asd,,,,,"name,surname",,,,,,,\n' and I would like to split (I'm using string.split()) it using comma as separator but I do not want to...

Python

Splitting with Regular Expressions

by: qwweeeit | last post by:

Splitting with RE has (for me!) misterious behaviour! I want to get the words from this string: s= 'This+(that)= a.string!!!' in a list like that considering "a.string" as a word. Python...

Python

Splitting text files?

by: MM | last post by:

Hi I have never written any C programs before, but it seems that I need to do so now. Hope some of you out there can spend a few minutes and help me by writing a simple example of something...

C / C++

How about splitting this newsgroup up?

by: Peter Oliphant | last post by:

I was thinking it might be a good idea to split this newsgroup into different newsgroups, depending on the version of VS C++.NET being discussed. Thus, there would be 2002, 2003, and 2005...

.NET Framework

Splitting string into word array - regular expression

by: Anat | last post by:

Hi, What regex do I need to split a string, using javascript's split method, into words-array? Splitting accroding to whitespaces only is not enough, I need to split according to whitespace,...

Javascript

Splitting string into words and displaying

by: Ramper | last post by:

I need to Write a function that will, given an input string containing many words, split that string into individual words. For each word, the function should output the word, its starting index in...

C / C++

splitting a words of a line

by: Sumit | last post by:

Hi , I am trying to splitt a Line whihc is below of format , AzAccept PLYSSTM01 "162.44.245.32 CN=dddd cojack (890),OU=1,OU=Customers,OU=ISM-Users,OU=kkk Secure,DC=customer,DC=rxcorp,DC=com"...

Python

Splitting text with regular expressions

by: David Jackson | last post by:

Hello, The company I'm working for has taken over a smaller company with a fairly large customer base. We want to send an email to that customer base informing them of the takeover but the...

C# / C Sharp

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware