RE Module

Roman

I am trying to filter a column in a list of all html tags.

To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?

Thanks in advance

Aug 25 '06 #1

Subscribe Reply

1769

Simon Forman

Roman wrote:

I am trying to filter a column in a list of all html tags.

What?

To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?

Thanks in advance

I'm no re expert, so I won't try to advise you on your re, but it might
help those who are if you gave examples of your input and output data.
What results are you getting for what input strings.

Also, if you're just trying to strip html markup to get plain text from
a file, "w3m -dump some.html" works great. ;-)

HTH,
~Simon

Aug 25 '06 #2

Anthra Norell

Roman,

Your re works for me. I suspect you have tags spanning lines, a thing you get more often than not. If so, processing linewise
doesn't work. You need to catch the tags like this:

>>text = re.sub ('<(.|\n)*?>', '', text)

If your text is reasonably small I would recommend this solution. Else you might want to take a look at SE which is a stream edtor
that does the buffering for you:

http://cheeseshop.python.org/pypi/SE/2.2%20beta

>>import SE
Tag_Stripper = SE.SE (' "~<(.|\n)*?>~=" "~~=" ')
print Tag_Stripper (text)

(... your text without tags ...)

The Tag_Stripper is made up of two regexes. The second one catches comments which may nest tags. The first expression alone would
also catch comments, but would mistake the '>' of the first nested tag for the end of the comment and quit prematurely. The example
"re.sub ('<(.|\n)*?>', '', text)" above would misperform in this respect.

Your Tag_Stripper takes input from files directly:

>>Tag_Stripper ('name_of_file.htm', 'name_of_output_file')

'name_of_output_file'

Or if you want to to view the output:

>>Tag_Stripper ('name_of_file.htm', '')

(... your text without tags ...)

If you want to keep the definitions for later use, do this:

>>Tag_Stripper.save ('[your_path/]tag_stripper.se')

Your definitions are now saved in the file 'tag_stripper.se'. You can edit that file. The next time you need a Tag_Stripper you can
make it simply by naming the file:

>>Tag_Stripper = SE.SE ('[your_path/]tag_stripper.se')

You can easily expand the capabilities of your Tag_Stripper. If, for instance, you want to translate the ampersand escapes ( 
etc.) you'd simply add the name of the file that defines the ampersand replacements:

>>Tag_Stripper = SE.SE ('tag_stripper.se htm2iso.se')

'htm2iso.se' comes with the SE package ready to use and as an example for writing ones own replacement sets.
Frederic
----- Original Message -----
From: "Simon Forman" <ro*********@yahoo.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Friday, August 25, 2006 7:09 AM
Subject: Re: RE Module

Roman wrote:
I am trying to filter a column in a list of all html tags.

What?

To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?

Thanks in advance

I'm no re expert, so I won't try to advise you on your re, but it might
help those who are if you gave examples of your input and output data.
What results are you getting for what input strings.

Also, if you're just trying to strip html markup to get plain text from
a file, "w3m -dump some.html" works great. ;-)

HTH,
~Simon

--
http://mail.python.org/mailman/listinfo/python-list

Aug 25 '06 #3

Roman

Thanks for your help.

A thing I didn't mention is that before the statement row[0] =
re.sub(r'<.*?>', '', row[0]), I have row[0]=re.sub(r'[^
0-9A-Za-z\"\'\.\,\#\@\!\(\)\*\&\%\%\\\/\:\;\?\`\~\<\>]', '', row[0])
statement. Hence, the line separators are going to be gone. You
mentioned the size of the string could be a factor. If so what is the
max size before I see problems?

Thanks again
Anthra Norell wrote:

Roman,

Your re works for me. I suspect you have tags spanning lines, a thing you get more often than not. If so, processing linewise
doesn't work. You need to catch the tags like this:

>text = re.sub ('<(.|\n)*?>', '', text)

If your text is reasonably small I would recommend this solution. Else you might want to take a look at SE which is a stream edtor
that does the buffering for you:

http://cheeseshop.python.org/pypi/SE/2.2%20beta

>import SE
Tag_Stripper = SE.SE (' "~<(.|\n)*?>~=" "~~=" ')
print Tag_Stripper (text)

(... your text without tags ...)

The Tag_Stripper is made up of two regexes. The second one catches comments which may nest tags. The first expression alone would
also catch comments, but would mistake the '>' of the first nested tag for the end of the comment and quit prematurely. The example
"re.sub ('<(.|\n)*?>', '', text)" above would misperform in this respect.

Your Tag_Stripper takes input from files directly:

>Tag_Stripper ('name_of_file.htm', 'name_of_output_file')

'name_of_output_file'

Or if you want to to view the output:

>Tag_Stripper ('name_of_file.htm', '')

(... your text without tags ...)

If you want to keep the definitions for later use, do this:

>Tag_Stripper.save ('[your_path/]tag_stripper.se')

Your definitions are now saved in the file 'tag_stripper.se'. You can edit that file. The next time you need a Tag_Stripper you can
make it simply by naming the file:

>Tag_Stripper = SE.SE ('[your_path/]tag_stripper.se')

You can easily expand the capabilities of your Tag_Stripper. If, for instance, you want to translate the ampersand escapes ( 
etc.) you'd simply add the name of the file that defines the ampersand replacements:

>Tag_Stripper = SE.SE ('tag_stripper.se htm2iso.se')

'htm2iso.se' comes with the SE package ready to use and as an example for writing ones own replacement sets.
Frederic
----- Original Message -----
From: "Simon Forman" <ro*********@yahoo.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Friday, August 25, 2006 7:09 AM
Subject: Re: RE Module

Roman wrote:
I am trying to filter a column in a list of all html tags.
What?

To do that, I have setup the following statement.
>
row[0] = re.sub(r'<.*?>', '', row[0])
>
The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?
>
Thanks in advance
I'm no re expert, so I won't try to advise you on your re, but it might
help those who are if you gave examples of your input and output data.
What results are you getting for what input strings.

Also, if you're just trying to strip html markup to get plain text from
a file, "w3m -dump some.html" works great. ;-)

HTH,
~Simon

--
http://mail.python.org/mailman/listinfo/python-list

Aug 25 '06 #4

tobiah

Roman wrote:

I am trying to filter a column in a list of all html tags.

To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

The regex will be 'greedy' and match through one tag
all the way to the end of another on the same line.
There are more complete suggestions offered, but
it seems to me that the simple fix here is to not
match through the end of the tag, like this:

"<[^>]*>"

--
Posted via a free Usenet account from http://www.teranews.com

Aug 25 '06 #5

Roman

This is excellent. Thanks a lot.

Also, what made the expression greedy?
tobiah wrote:

Roman wrote:
I am trying to filter a column in a list of all html tags.

To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

The regex will be 'greedy' and match through one tag
all the way to the end of another on the same line.
There are more complete suggestions offered, but
it seems to me that the simple fix here is to not
match through the end of the tag, like this:

"<[^>]*>"

--
Posted via a free Usenet account from http://www.teranews.com

Aug 25 '06 #6

Anthra Norell

Roman,

I don't quite understand what you mean. Line separators gone? That would be the '\n', right? What of it if you process line by line,
as your variable name 'row' suggests?
As to the maximum size re can handle, I have no idea. I vaguely remember the topic being discussed. You should be able to find
the discussions in the archives, if a knowlegeable soul doesn't volunteer the info right away. With SE it is of no concern.

Anyway, I think the best thing to do is to just try with a real page:

>>f = urllib.urlopen (r'http://www.python.org')
page = f.read (); f.close ()
import SE
Tag_Stripper = SE.SE (' "~<(.|\n)*?>~=" "~~=" ')
Tag_Stripper (page)

( ... page without tags, but lots of empty lines ...)

If you want to take the empty lines out, do this:

>>Tag_Stripper = SE.SE (' "~<(.|\n)*?>~=" "~~=" | "~\r?\n\s+?(?=\r?\n)~=" | "~(\r?\n)+~=\n" ')

"|" means do the preceding replacements (which happen to be deletions: replace with nothing) and go on from there. The expressions
we added say: delete lines that contain only spaces. Do that (another "|"). And finally replace multiple consecutive line feeds with
a single line feed.
So you can develop interactively. Add a definition. See what it does. Add another one. One little step at a time. Hacking at
its best!

Frederic
----- Original Message -----
From: "Roman" <rg*******@hotmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Friday, August 25, 2006 6:14 PM
Subject: Re: RE Module

Thanks for your help.

A thing I didn't mention is that before the statement row[0] =
re.sub(r'<.*?>', '', row[0]), I have row[0]=re.sub(r'[^
0-9A-Za-z\"\'\.\,\#\@\!\(\)\*\&\%\%\\\/\:\;\?\`\~\<\>]', '', row[0])
statement. Hence, the line separators are going to be gone. You
mentioned the size of the string could be a factor. If so what is the
max size before I see problems?

Thanks again
Anthra Norell wrote:
Roman,

Your re works for me. I suspect you have tags spanning lines, a thing you get more often than not. If so, processing linewise
doesn't work. You need to catch the tags like this:

>>text = re.sub ('<(.|\n)*?>', '', text)
If your text is reasonably small I would recommend this solution. Else you might want to take a look at SE which is a stream

edtor

that does the buffering for you:

http://cheeseshop.python.org/pypi/SE/2.2%20beta

>>import SE
>>Tag_Stripper = SE.SE (' "~<(.|\n)*?>~=" "~~=" ')
>>print Tag_Stripper (text)
(... your text without tags ...)

The Tag_Stripper is made up of two regexes. The second one catches comments which may nest tags. The first expression alone

would

also catch comments, but would mistake the '>' of the first nested tag for the end of the comment and quit prematurely. The

example

"re.sub ('<(.|\n)*?>', '', text)" above would misperform in this respect.

Your Tag_Stripper takes input from files directly:

>>Tag_Stripper ('name_of_file.htm', 'name_of_output_file')
'name_of_output_file'

Or if you want to to view the output:

>>Tag_Stripper ('name_of_file.htm', '')
(... your text without tags ...)

If you want to keep the definitions for later use, do this:

>>Tag_Stripper.save ('[your_path/]tag_stripper.se')
Your definitions are now saved in the file 'tag_stripper.se'. You can edit that file. The next time you need a Tag_Stripper you

can

make it simply by naming the file:

>>Tag_Stripper = SE.SE ('[your_path/]tag_stripper.se')
You can easily expand the capabilities of your Tag_Stripper. If, for instance, you want to translate the ampersand escapes

( 

etc.) you'd simply add the name of the file that defines the ampersand replacements:

>>Tag_Stripper = SE.SE ('tag_stripper.se htm2iso.se')
'htm2iso.se' comes with the SE package ready to use and as an example for writing ones own replacement sets.
Frederic
----- Original Message -----
From: "Simon Forman" <ro*********@yahoo.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Friday, August 25, 2006 7:09 AM
Subject: Re: RE Module

Roman wrote:
I am trying to filter a column in a list of all html tags.
>
What?
>
To do that, I have setup the following statement.

row[0] = re.sub(r'<.*?>', '', row[0])

The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?

Thanks in advance
>
I'm no re expert, so I won't try to advise you on your re, but it might
help those who are if you gave examples of your input and output data.
What results are you getting for what input strings.
>
Also, if you're just trying to strip html markup to get plain text from
a file, "w3m -dump some.html" works great. ;-)
>
HTH,
~Simon
>
--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list

Aug 25 '06 #7

tobiah

Roman wrote:

This is excellent. Thanks a lot.

Also, what made the expression greedy?

They usually are, by default. It means that when there
are more than one ways to match the pattern, choose the
one that matches the most text. Often there are flags
available to change that behavior. I'm not sure off hand
how to do it with the re module.

--
Posted via a free Usenet account from http://www.teranews.com

Aug 25 '06 #8

Tim Chase

>Also, what made the expression greedy?

>
They usually are, by default. It means that when there
are more than one ways to match the pattern, choose the
one that matches the most text. Often there are flags
available to change that behavior. I'm not sure off hand
how to do it with the re module.

In python's RE module, they're like Perl:

Greedy: "<.*>"
Nongreedy: "<.*?>"

By appending a question-mark onto the operator, one makes it a
non-greedy repeat. It also applies to the plus ("one or more")
and the questionmark ("zero or one")

-tkc

Aug 25 '06 #9

tobiah

In python's RE module, they're like Perl:

>
Greedy: "<.*>"
Nongreedy: "<.*?>"

Oh, I have never seen that. In that case, why
did Roman's first example not work well for
HTML tags?

'<.*?>'

Also, how does the engine decide whether I am adjusting
the greed of the previous operator, or just asking
for another possible character?

Suppose I want:

"x*?" to match "xxxxxxxO"

If the '?' means non greedy, then I should get 'x' back.
If the '?' means optional character then I should get
the full string back.

Checking in python:

######################################
import re

s = 'xxxxxxx0'

m = re.search("x*", s)
print "First way", m.group(0)

m = re.search("x*?", s)
print "Second way", m.group(0)
#####################################
First way xxxxxxx
Second way

So now I'm really confused. It didn't do a non-greedy
'x' match, nor did it allow the '?' to match the '0'.
--
Posted via a free Usenet account from http://www.teranews.com

Aug 25 '06 #10

Fredrik Lundh

tobiah wrote:

Also, how does the engine decide whether I am adjusting
the greed of the previous operator, or just asking
for another possible character?

"?" always modifies the *preceeding* RE element.

if the preceeding element is a pattern (e.g. a character or group), it
means that the pattern is optional.

if the preceeding element is a repeat modifier (*, +, or ?), it changes
the greediness.

Suppose I want:

"x*?" to match "xxxxxxxO"

If the '?' means non greedy, then I should get 'x' back.

no, because "*" means *ZERO* or more matches, not one or more.

If the '?' means optional character then I should get
the full string back.

no, because "?" never means anything on its own; it's a pattern
modifier, not a pattern.

Checking in python:

######################################
import re

s = 'xxxxxxx0'

m = re.search("x*", s)
print "First way", m.group(0)

m = re.search("x*?", s)
print "Second way", m.group(0)
#####################################
First way xxxxxxx
Second way

So now I'm really confused. It didn't do a non-greedy
'x' match, nor did it allow the '?' to match the '0'.

see above. reading the RE documentation again may also help.

</F>

Aug 25 '06 #11

Tim Chase

######################################

import re

s = 'xxxxxxx0'

m = re.search("x*", s)
print "First way", m.group(0)

m = re.search("x*?", s)
print "Second way", m.group(0)
#####################################
First way xxxxxxx
Second way

So now I'm really confused. It didn't do a non-greedy
'x' match, nor did it allow the '?' to match the '0'.

it did do a non-greedy match. It found as few "x"s as possible.
it found 0 of them, and quit. For a better test, use

s = '<tag 1><tag 2>'
print re.search('<tag.*?>',s).group(0)
print re.search('<tag.*>',s).group(0)

(the question/problem at hand)

-tkc

Aug 25 '06 #12

Roman

I looked at a book called beginning python and it claims that <.*?is
a non-greedy match.

tobiah wrote:

In python's RE module, they're like Perl:

Greedy: "<.*>"
Nongreedy: "<.*?>"

Oh, I have never seen that. In that case, why
did Roman's first example not work well for
HTML tags?

'<.*?>'

Also, how does the engine decide whether I am adjusting
the greed of the previous operator, or just asking
for another possible character?

Suppose I want:

"x*?" to match "xxxxxxxO"

If the '?' means non greedy, then I should get 'x' back.
If the '?' means optional character then I should get
the full string back.

Checking in python:

######################################
import re

s = 'xxxxxxx0'

m = re.search("x*", s)
print "First way", m.group(0)

m = re.search("x*?", s)
print "Second way", m.group(0)
#####################################
First way xxxxxxx
Second way

So now I'm really confused. It didn't do a non-greedy
'x' match, nor did it allow the '?' to match the '0'.
--
Posted via a free Usenet account from http://www.teranews.com

Aug 27 '06 #13

tobiah

Roman wrote:

I looked at a book called beginning python and it claims that <.*?is
a non-greedy match.

Yeah, I get that now, but why didn't it work for you in
the first place?

--
Posted via a free Usenet account from http://www.teranews.com

Aug 28 '06 #14

Roman

It turns out false alarm. It work. I had other logic in the
expression involving punctuation marks and got all confused with the
escape characters. It becomes a mess trying to keep track of all the
reserved character as you are going from module to module.
tobiah wrote:

Roman wrote:
I looked at a book called beginning python and it claims that <.*?is
a non-greedy match.

Yeah, I get that now, but why didn't it work for you in
the first place?

--
Posted via a free Usenet account from http://www.teranews.com

Aug 29 '06 #15

by: Bo Peng | last post by:

Dear list, I am writing a Python extension module that needs a way to expose pieces of a big C array to python. Currently, I am using NumPy like the following: PyObject* res =...

Python

Getting a module's byte code, how?

by: Irmen de Jong | last post by:

What would be the best way, if any, to obtain the bytecode for a given loaded module? I can get the source: import inspect import os src = inspect.getsource(os) but there is no...

Python

Get importer module from imported module

by: dody suria wijaya | last post by:

I found this problem when trying to split a module into two. Here's an example: ============== #Module a (a.py): from b import * class Main: pass ============== ==============

Python

"Cannot find module" Error Messages in PHP

by: David T. Ashley | last post by:

Hi, Red Hat Enterprise Linux 4.X. I'm writing command-line PHP scripts for the first time. I get the messages below. What do they mean? Are these operating system library modules, or...

PHP

Difference between Class and module

by: Bonzol | last post by:

vb.net Hey there, could someone just tell me what the differnce is between classes and modules and when each one would be used compared to the other? Any help would be great Thanx in...

Visual Basic .NET

How to Upload Files using the CGI.pm Module and Perl

by: KevinADC | last post by:

Note: You may skip to the end of the article if all you want is the perl code. Introduction Uploading files from a local computer to a remote web server has many useful purposes, the most...

Perl

module confusion

by: rjcarr | last post by:

Sorry if this is a completely newbie question ... I was trying to get information about the logging.handlers module, so I imported logging, and tried dir(logging.handlers), but got: ...

Python

dynamically importing a module and function

by: rkmr.em | last post by:

Hi I have a function data, that I need to import from a file data, in the directory data If I do this from python interactive shell (linux fedora core 8) from dir /home/mark it works fine: ...

Python

Re: imported module no longer available

by: Fredrik Lundh | last post by:

Jeff Dyke wrote: so how did that processing use the "mymodulename" name? the calling method has nothing to do with what's considered to be a local variable in the method being called, so...

Python

What is module initialization?

by: dudeja.rajat | last post by:

Hi, I found on the net that there is something called module initialization. Unfortunately, there is not much information for this. However, small the information I found module initialization...

Python

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

php

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

Similar topics