By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,538 Members | 1,293 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,538 IT Pros & Developers. It's quick & easy.

Regular Expression problem

P: n/a
(I don't know if it is the right place. So if I am wrong, please point
me the right direction.
If this post is read by you masters, I'm honoured. If I am getting a
mere response, I'm blessed!)

Hi,

I'm a newbie regular expression user. I use regex in my Python
programs. I have a strange

(sometimes not strange, but please bear in mind; I'm a newbie ;)
problem using regex. That I want

a particular tag value of one of my HTML files.

ie: I want only the value after 'href=' in the tag >>

'<link href="mystylesheet.css" rel="stylesheet" type="text/css">'

here it would be 'mystylesheet.css'. I used the following regex to get
this value(I dont know if it

is good).

_"<link\s+href=["]?(.*?)["]?\s+rel=["]?stylesheet["]?\s+type=["]?text/css["]?>"_
I thought I was doing fine until I got stuck by this tag >>

<link rel="stylesheet" href="mystylesheet.css" type="text/css" : same
tag but with 'href=' part

at a different place. I think you got the point!

So What should I do to get the exact value(here the value after
'href=') in any case even if the

tags are like these? >>

<link rel="stylesheet" href="mystylesheet.css" type="text/css">
-OR-
<link href="mystylesheet.css" rel="stylesheet" type="text/css">
-OR-
<link type="text/css" href="mystylesheet.css" rel="stylesheet">

Jul 13 '06 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Hey,

I'm new with regex's as well but here is my idea. Since you don't know
which attribute will come first why don't structure your regex like
this

(first off, I'll assume that \s == ' ', actually now that I think of
it, isn't \s any whitespace character? anyways \s == ' ' for now)

'<link\s*((\s*attribute1\s*)|(\s*attribute2\s*)|(\ s*attribute3\s*))+>'

I think that should just about do it.

Hope this helped,

Colin

John Blogger wrote:
(I don't know if it is the right place. So if I am wrong, please point
me the right direction.
If this post is read by you masters, I'm honoured. If I am getting a
mere response, I'm blessed!)

Hi,

I'm a newbie regular expression user. I use regex in my Python
programs. I have a strange

(sometimes not strange, but please bear in mind; I'm a newbie ;)
problem using regex. That I want

a particular tag value of one of my HTML files.

ie: I want only the value after 'href=' in the tag >>

'<link href="mystylesheet.css" rel="stylesheet" type="text/css">'

here it would be 'mystylesheet.css'. I used the following regex to get
this value(I dont know if it

is good).

_"<link\s+href=["]?(.*?)["]?\s+rel=["]?stylesheet["]?\s+type=["]?text/css["]?>"_
I thought I was doing fine until I got stuck by this tag >>

<link rel="stylesheet" href="mystylesheet.css" type="text/css" : same
tag but with 'href=' part

at a different place. I think you got the point!

So What should I do to get the exact value(here the value after
'href=') in any case even if the

tags are like these? >>

<link rel="stylesheet" href="mystylesheet.css" type="text/css">
-OR-
<link href="mystylesheet.css" rel="stylesheet" type="text/css">
-OR-
<link type="text/css" href="mystylesheet.css" rel="stylesheet">
Jul 13 '06 #2

P: n/a
John Blogger wrote:
That I want a particular tag value of one of my HTML files.

ie: I want only the value after 'href=' in the tag >>

'<link href="mystylesheet.css" rel="stylesheet" type="text/css">'

here it would be 'mystylesheet.css'. I used the following regex to get
this value(I dont know if it is good).
No matter how good it is you should still use something that
understands html:
>>from BeautifulSoup import BeautifulSoup
html='<link href="mystylesheet.css" rel="stylesheet" type="text/css">'
page=BeautifulSoup(html)
page.link.get('href')
'mystylesheet.css'

--
- Justin

Jul 14 '06 #3

P: n/a
Justin Azoff wrote:
>from BeautifulSoup import BeautifulSoup
html='<link href="mystylesheet.css" rel="stylesheet" type="text/css">'
page=BeautifulSoup(html)
page.link.get('href')
'mystylesheet.css'
On second thought, you will probably want something like
>>[link.get('href') for link in page.fetch('link',{'type':'text/css'})]
['mystylesheet.css']

which will properly handle multiple link tags.

--
- Justin

Jul 14 '06 #4

P: n/a
Ant
So What should I do to get the exact value(here the value after
'href=') in any case even if the

tags are like these? >>

<link rel="stylesheet" href="mystylesheet.css" type="text/css">
-OR-
<link href="mystylesheet.css" rel="stylesheet" type="text/css">
-OR-
<link type="text/css" href="mystylesheet.css" rel="stylesheet">
The following should do it:

expr = r'<link .*?href="(.*?)"'

or if single quotes might have been used:

expr = r'''<link .*?href=["'](.*?)['"]'''

But like the others have said, beautiful soup is very good for things
like this.

Jul 14 '06 #5

P: n/a
Pyparsing is also good for recognizing basic HTML tags and their
attributes, regardless of the order of the attributes.

-- Paul

testText = """sldkjflsa;faj

<link href="mystylesheet.css" rel="stylesheet" type="text/css">

here it would be 'mystylesheet.css'. I used the following regex to get
this value(I dont know if it

I thought I was doing fine until I got stuck by this tag >>

<link rel="stylesheet" href="mystylesheet.css" type="text/css" : same

tag but with 'href=' part

tags are like these? >>

<link rel="stylesheet" href="mystylesheet.css" type="text/css">
-OR-
<link href="mystylesheet.css" rel="stylesheet" type="text/css">
-OR-
<link type="text/css" href="mystylesheet.css" rel="stylesheet">

"""
from pyparsing import makeHTMLTags,line

linkTag = makeHTMLTags("link")[0]
for toks,s,e in linkTag.scanString(testText):
print toks.href
print line(s,testText)
print

Prints out:

mystylesheet.css
<link href="mystylesheet.css" rel="stylesheet" type="text/css">

mystylesheet.css
<link rel="stylesheet" href="mystylesheet.css" type="text/css" : same
mystylesheet.css
<link rel="stylesheet" href="mystylesheet.css" type="text/css">

mystylesheet.css
<link href="mystylesheet.css" rel="stylesheet" type="text/css">

mystylesheet.css
<link type="text/css" href="mystylesheet.css" rel="stylesheet">

Jul 14 '06 #6

This discussion thread is closed

Replies have been disabled for this discussion.