469,338 Members | 8,112 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,338 developers. It's quick & easy.

regexp

Hello

I need to use some regular expressions for more than one line.
And i would like to use some modificators like: /m or /s in perl.
For example:
re.sub("<script.*>.*</script>","",data)

will not cut out all javascript code if it's spread on many lines.
I could use something like /s from perl which treats . as all signs
(including new line). How can i do that ?

Maybe there is other way to achieve the same results ?

Thanx
Dec 19 '06 #1
10 1570
vertigo wrote:
I need to use some regular expressions for more than one line.
And i would like to use some modificators like: /m or /s in perl.
For example:
re.sub("<script.*>.*</script>","",data)

will not cut out all javascript code if it's spread on many lines.
that won't cut out all javascript code period.

</F>

Dec 19 '06 #2
On Tuesday 19 December 2006 13:15, vertigo wrote:
Hello

I need to use some regular expressions for more than one line.
And i would like to use some modificators like: /m or /s in perl.
For example:
re.sub("<script.*>.*</script>","",data)

will not cut out all javascript code if it's spread on many lines.
I could use something like /s from perl which treats . as all signs
(including new line). How can i do that ?

Maybe there is other way to achieve the same results ?

Thanx
Take a look at Chapter 8 of 'Dive Into Python.'
http://diveintopython.org/toc/index.html

You can modify the code there and get the results that you need. Buy the book
if you can :) It has lots of neat examples.

- Jonathan Curran
Dec 19 '06 #3

vertigo wrote:
>I need to use some regular expressions for more than one line.
And i would like to use some modificators like: /m or /s in perl.
For example:
re.sub("<script.*>.*</script>","",data)
will not cut out all javascript code if it's spread on many lines.

that won't cut out all javascript code period.
do you have any idea what will do ?
i need to cut everything but the pure text data.

Thanx
Dec 19 '06 #4
Hello
On Tuesday 19 December 2006 13:15, vertigo wrote:
>Hello

I need to use some regular expressions for more than one line.
And i would like to use some modificators like: /m or /s in perl.
For example:
re.sub("<script.*>.*</script>","",data)

will not cut out all javascript code if it's spread on many lines.
I could use something like /s from perl which treats . as all signs
(including new line). How can i do that ?

Maybe there is other way to achieve the same results ?

Thanx

Take a look at Chapter 8 of 'Dive Into Python.'
http://diveintopython.org/toc/index.html
i read whole regexp chapter - but there was no solution for my problem.
Example:

re.sub("<!--.*-->","",htmldata)
would remove only comments which are in one line.
If comment is in many lines like this:
<!--start
of
commend, end-->

it would not work. It's because '.' sign does not matches '\n' sign.

Does anybody knows solution for this particular problem ?

Thanx
Dec 19 '06 #5
You want re.sub("(?s)<!--.*?-->", "", htmldata)

Explanation: To make the dot match all characters, including newlines,
you need to set the DOTALL flag. You can set the flag using the (?_)
syntax, which is explained in section 4.2.1 of the Python Library
Reference.

A more readable way to do this is:

obj = re.compile("<!--.*?-->", re.DOTALL)
re.sub("", htmldata)
On Dec 19, 3:59 pm, vertigo <s...@spam.plwrote:
Hello


On Tuesday 19 December 2006 13:15, vertigo wrote:
Hello
I need to use some regular expressions for more than one line.
And i would like to use some modificators like: /m or /s in perl.
For example:
re.sub("<script.*>.*</script>","",data)
will not cut out all javascript code if it's spread on many lines.
I could use something like /s from perl which treats . as all signs
(including new line). How can i do that ?
Maybe there is other way to achieve the same results ?
Thanx
Take a look at Chapter 8 of 'Dive Into Python.'
http://diveintopython.org/toc/index.htmli read whole regexp chapter - but there was no solution for my problem.
Example:

re.sub("<!--.*-->","",htmldata)
would remove only comments which are in one line.
If comment is in many lines like this:
<!--start
of
commend, end-->

it would not work. It's because '.' sign does not matches '\n' sign.

Does anybody knows solution for this particular problem ?

Thanx- Hide quoted text -- Show quoted text -
Dec 19 '06 #6
Oops, I mean obj.sub("", htmldata)

On Dec 19, 4:15 pm, johnzen...@gmail.com wrote:
You want re.sub("(?s)<!--.*?-->", "", htmldata)

Explanation: To make the dot match all characters, including newlines,
you need to set the DOTALL flag. You can set the flag using the (?_)
syntax, which is explained in section 4.2.1 of the Python Library
Reference.

A more readable way to do this is:

obj = re.compile("<!--.*?-->", re.DOTALL)
re.sub("", htmldata)

On Dec 19, 3:59 pm, vertigo <s...@spam.plwrote:
Hello
On Tuesday 19 December 2006 13:15, vertigo wrote:
>Hello
>I need to use some regular expressions for more than one line.
>And i would like to use some modificators like: /m or /s in perl.
>For example:
>re.sub("<script.*>.*</script>","",data)
>will not cut out all javascript code if it's spread on many lines.
>I could use something like /s from perl which treats . as all signs
>(including new line). How can i do that ?
>Maybe there is other way to achieve the same results ?
>Thanx
Take a look at Chapter 8 of 'Dive Into Python.'
>http://diveintopython.org/toc/index.htmliread whole regexp chapter - but there was no solution for my problem.
Example:
re.sub("<!--.*-->","",htmldata)
would remove only comments which are in one line.
If comment is in many lines like this:
<!--start
of
commend, end-->
it would not work. It's because '.' sign does not matches '\n' sign.
Does anybody knows solution for this particular problem ?
Thanx- Hide quoted text -- Show quoted text -- Hide quoted text -- Show quoted text -
Dec 19 '06 #7
Hello

Thanx for help, i have one more question:

i noticed that while matching regexp python tries to match as wide as it's
possible,
for example:
re.sub("<!--.*-->","",htmldata)
would cut out everything before first "<!--" and last "-->" in the
document.
Can i force re to math as narrow as possible ?
(to match first "<!--" with the first "-->" after the "<!--" and to repeat
this procedure while mentioned pattern is still found) ?

Thanx

Dec 19 '06 #8

vertigoi noticed that while matching regexp python tries to match as wide as it's
vertigopossible,
vertigofor example:
vertigore.sub("<!--.*-->","",htmldata)
vertigowould cut out everything before first "<!--" and last "-->" in the
vertigodocument.
vertigoCan i force re to math as narrow as possible ?

http://docs.python.org/lib/re-syntax.html

Search for "greedy".

Skip
Dec 19 '06 #9
On Tuesday 19 December 2006 15:32, Paul Arthur wrote:
On 2006-12-19, vertigo <sp**@spam.plwrote:
Hello
Take a look at Chapter 8 of 'Dive Into Python.'
http://diveintopython.org/toc/index.html
i read whole regexp chapter -

Did you read Chapter 8? Regexes are 7; 8 is about processing HTML.
Regexes are not well suited to this type of processing.
but there was no solution for my problem.
Example:

re.sub("<!--.*-->","",htmldata)
would remove only comments which are in one line.
If comment is in many lines like this:
<!--start
of
commend, end-->

it would not work. It's because '.' sign does not matches '\n' sign.

Does anybody knows solution for this particular problem ?

Yes. Use DOTALL mode.
Paul, I mentioned Chapter 8 so that the HTML processing section would be taken
a look at. What Vertigo wants can be done with relative ease with SGMLlib.

Anyway, if you (Vertigo) want to use regular expressions to do this, you can
try and use some regular expression testing programs. I'm not quite sure of
the name but there is one that comes with KDE.

- Jonathan Curran
Dec 20 '06 #10
Not just Python, but every Regex engine works this way. You want a ?
after your *, as in <--(.*?)--if you want it to catch the first
available "-->".

At this point in your adventure, you might be wondering whether regular
expressions are more trouble than they are worth. They are. There are
two libraries you need to take a look at, and soon: BeautifulSoup for
parsing HTML, and PyParsing for parsing everything else. Take the time
you were planning to spend on deciphering regexes like
"(\d{1,3}\.){3}\d{1,3}" and spend it learning the basics of those
libraries instead -- you will not regret it.

On Dec 19, 4:39 pm, vertigo <s...@spam.plwrote:
Hello

Thanx for help, i have one more question:

i noticed that while matching regexp python tries to match as wide as it's
possible,
for example:
re.sub("<!--.*-->","",htmldata)
would cut out everything before first "<!--" and last "-->" in the
document.
Can i force re to math as narrow as possible ?
(to match first "<!--" with the first "-->" after the "<!--" and to repeat
this procedure while mentioned pattern is still found) ?

Thanx
Dec 20 '06 #11

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

10 posts views Thread by Anand Pillai | last post: by
5 posts views Thread by Lukas Holcik | last post: by
reply views Thread by Chris Croughton | last post: by
8 posts views Thread by Dmitry Korolyov | last post: by
26 posts views Thread by Matt Kruse | last post: by
7 posts views Thread by Csaba Gabor | last post: by
6 posts views Thread by runsun pan | last post: by
4 posts views Thread by Matt | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
reply views Thread by Marylou17 | last post: by
1 post views Thread by Marylou17 | last post: by
1 post views Thread by Marylou17 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.