473,394 Members | 1,812 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,394 software developers and data experts.

Clean "Durty" strings

Hello,

I need to clean the string like this :

string =
"""
bonne mentalit&eacute; mec!:) \n <br>bon pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n <br>mais pour avoir des resultats probant il
faut pas faire les mariolles, comme le &quot;fondateur&quot; de bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le &quot;fondateur&quot; de bvs krew \n
"""

into :
bonne mentalité mec!:) bon pour info moi je suis un serial posteur
arceleur dictateur ^^* mais pour avoir des resultats probant il faut
pas faire les mariolles, comme le "fondateur" de bvs krew
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew

To do this I wold like to use only strandard librairies.

Thanks

Apr 1 '07 #1
8 3600
Ulysse wrote:
Hello,

I need to clean the string like this :

string =
"""
bonne mentalit&eacute; mec!:) \n <br>bon pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n <br>mais pour avoir des resultats probant il
faut pas faire les mariolles, comme le &quot;fondateur&quot; de bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le &quot;fondateur&quot; de bvs krew \n
"""

into :
bonne mentalité mec!:) bon pour info moi je suis un serial posteur
arceleur dictateur ^^* mais pour avoir des resultats probant il faut
pas faire les mariolles, comme le "fondateur" de bvs krew
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew
The obvious way that has been suggested to you at other places is to use
BeautifulSoup.
To do this I wold like to use only strandard librairies.
Then you need to reprogram what BeautifulSoup does. Happy hacking!

Diez

Apr 2 '07 #2
"Diez B. Roggisch" <de***@nospam.web.dewrote in
news:57*************@mid.uni-berlin.de:
Ulysse wrote:
>Hello,

I need to clean the string like this :

string =
"""
bonne mentalit&eacute; mec!:) \n <br>bon
pour info moi je suis un serial posteur arceleur dictateur ^^*
\n <br>mais pour avoir des resultats
probant il faut pas faire les mariolles, comme le
&quot;fondateur&quot; de bvs krew \n
mais pour avoir des resultats probant il faut pas faire les
mariolles, comme le &quot;fondateur&quot; de bvs krew \n
"""

into :
bonne mentalité mec!:) bon pour info moi je suis un serial
posteur arceleur dictateur ^^* mais pour avoir des resultats
probant il faut pas faire les mariolles, comme le "fondateur"
de bvs krew mais pour avoir des resultats probant il faut pas
faire les mariolles, comme le "fondateur" de bvs krew

The obvious way that has been suggested to you at other places
is to use BeautifulSoup.
>To do this I wold like to use only strandard librairies.

Then you need to reprogram what BeautifulSoup does. Happy
hacking!
If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.

--
rzed

Apr 2 '07 #3
>
If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.
The OP had <br>-tags in his text. Which is _more_ than a half dozen lines of
code to clean up. Because your simple replacement-approach won't help here:

<br>foo <brbar </br>

Which is perfectly legal HTML, but nasty to parse.

Diez
Apr 2 '07 #4
On Apr 2, 4:05 pm, "Diez B. Roggisch" <d...@nospam.web.dewrote:
If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.

The OP had <br>-tags in his text. Which is _more_ than a half dozen lines of
code to clean up. Because your simple replacement-approach won't help here:

<br>foo <brbar </br>

Which is perfectly legal HTML, but nasty to parse.

Diez
But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s). For whitespace, re.sub(r'\s+', ' ', s). For XML
characters like &eacute;, re.sub(r'&(\w+);', lambda mo:
unichr(htmlentitydefs.name2codepoint[mo.group(1)]), s) and
re.sub(r'&#(\d+);', lambda mo: unichr(int(mo.group(1))), s). That's it
pretty much.

I'd like to see how this transformation can be done with
BeautifulSoup. Well, the last two regexps can be replaced with this:

unicode(BeautifulStoneSoup(s,convertEntities=Beaut ifulStoneSoup.HTML_ENTITIES).contents[0])

Apr 2 '07 #5
In <11**********************@d57g2000hsg.googlegroups .com>, irstas wrote:
I'd like to see how this transformation can be done with
BeautifulSoup. Well, the last two regexps can be replaced with this:

unicode(BeautifulStoneSoup(s,convertEntities=Beaut ifulStoneSoup.HTML_ENTITIES).contents[0])
Completely without regular expressions:

def main():
soup = BeautifulSoup(source, convertEntities=BeautifulSoup.HTML_ENTITIES)
print ' '.join(''.join(soup(text=True)).split())

Ciao,
Marc 'BlackJack' Rintsch
Apr 2 '07 #6
ir****@gmail.com wrote:
But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s).
Won't work for, say, this:

<img src="src" alt="<text>">
--
Michael Hoffman
Apr 2 '07 #7
On Apr 2, 10:08 pm, Michael Hoffman <cam.ac...@mh391.invalidwrote:
irs...@gmail.com wrote:
But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s).

Won't work for, say, this:

<img src="src" alt="<text>">
--
Michael Hoffman
True, but is that legal? I think the alt attribute needs to use &lt;
and &gt;. Although I know what you're going to reply. That
BeautifulSoup probably parses it even if it's invalid HTML. And I'd
say that I agree, using BeautifulSoup is a better solution than custom
regexps.

Apr 2 '07 #8
"Diez B. Roggisch" <de***@nospam.web.dewrote in
news:57*************@mid.uni-berlin.de:
>>
If the OP is constrained to standard libraries, then it may be
a question of defining what should be done more clearly. The
extraneous spaces can be removed by tokenizing the string and
rejoining the tokens. Replacing portions of a string with
equivalents is standard stuff. It might be preferable to create
a function that will accept lists of from and to strings and
translate the entire string by successively applying the
replacements. From what I've seen so far, that would be all the
OP needs for this task. It might take a half- dozen lines of
code, plus the from/to table definition.

The OP had <br>-tags in his text. Which is _more_ than a half
dozen lines of code to clean up. Because your simple
replacement-approach won't help here:

<br>foo <brbar </br>

Which is perfectly legal HTML, but nasty to parse.
Well, as I said, given the input the OP supplied, it's not even
necessary to parse it. It isn't clear what the true desired
operation is, but this seems to meet the criteria given:

<code -- the string 's' is wrapped nastily, but ...>
s ="""\
bonne mentalit&eacute; mec!:) \n <br>bon
pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n <br>mais pour avoir des resultats
probant il
faut pas faire les mariolles, comme le &quot;fondateur&quot; de
bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les
mariolles,
comme le &quot;fondateur&quot; de bvs krew \n"""

fromlist = ['<br>', '&eacute;', '&quot;']
tolist = ['', 'é', '"' ]
def withReplacements( s, flist,tlist ):
for ix, f in enumerate(flist):
t = tlist[ix]
s = s.replace( f,t )
return s

print withReplacements(' '.join(s.split()),fromlist,tolist)

</code>

If the question is about efficiency or robustness or generality,
then that's another set of issues, but that's for the 1.1 version
to handle.

--
rzed

Apr 2 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: rajesh | last post by:
URL: *** http://www.webdeveloping.nl/forum/showthread.php?s=0cfd9dadcad70f5454595 879ad73140b&threadid=2037&goto=nextoldest *** Sent via Developersdex http://www.developersdex.com ***
7
by: andrewfsears | last post by:
I have a question: I was wondering if it is possible to simulate the multiple constructors, like in Java (yes, I know that the languages are completely different)? Let's say that I have a class...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.