By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,016 Members | 2,255 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,016 IT Pros & Developers. It's quick & easy.

Clean "Durty" strings

P: n/a
Hello,

I need to clean the string like this :

string =
"""
bonne mentalit&eacute; mec!:) \n <br>bon pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n <br>mais pour avoir des resultats probant il
faut pas faire les mariolles, comme le &quot;fondateur&quot; de bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le &quot;fondateur&quot; de bvs krew \n
"""

into :
bonne mentalité mec!:) bon pour info moi je suis un serial posteur
arceleur dictateur ^^* mais pour avoir des resultats probant il faut
pas faire les mariolles, comme le "fondateur" de bvs krew
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew

To do this I wold like to use only strandard librairies.

Thanks

Apr 1 '07 #1
Share this Question
Share on Google+
8 Replies


P: n/a
Ulysse wrote:
Hello,

I need to clean the string like this :

string =
"""
bonne mentalit&eacute; mec!:) \n <br>bon pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n <br>mais pour avoir des resultats probant il
faut pas faire les mariolles, comme le &quot;fondateur&quot; de bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le &quot;fondateur&quot; de bvs krew \n
"""

into :
bonne mentalité mec!:) bon pour info moi je suis un serial posteur
arceleur dictateur ^^* mais pour avoir des resultats probant il faut
pas faire les mariolles, comme le "fondateur" de bvs krew
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew
The obvious way that has been suggested to you at other places is to use
BeautifulSoup.
To do this I wold like to use only strandard librairies.
Then you need to reprogram what BeautifulSoup does. Happy hacking!

Diez

Apr 2 '07 #2

P: n/a
"Diez B. Roggisch" <de***@nospam.web.dewrote in
news:57*************@mid.uni-berlin.de:
Ulysse wrote:
>Hello,

I need to clean the string like this :

string =
"""
bonne mentalit&eacute; mec!:) \n <br>bon
pour info moi je suis un serial posteur arceleur dictateur ^^*
\n <br>mais pour avoir des resultats
probant il faut pas faire les mariolles, comme le
&quot;fondateur&quot; de bvs krew \n
mais pour avoir des resultats probant il faut pas faire les
mariolles, comme le &quot;fondateur&quot; de bvs krew \n
"""

into :
bonne mentalité mec!:) bon pour info moi je suis un serial
posteur arceleur dictateur ^^* mais pour avoir des resultats
probant il faut pas faire les mariolles, comme le "fondateur"
de bvs krew mais pour avoir des resultats probant il faut pas
faire les mariolles, comme le "fondateur" de bvs krew

The obvious way that has been suggested to you at other places
is to use BeautifulSoup.
>To do this I wold like to use only strandard librairies.

Then you need to reprogram what BeautifulSoup does. Happy
hacking!
If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.

--
rzed

Apr 2 '07 #3

P: n/a
>
If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.
The OP had <br>-tags in his text. Which is _more_ than a half dozen lines of
code to clean up. Because your simple replacement-approach won't help here:

<br>foo <brbar </br>

Which is perfectly legal HTML, but nasty to parse.

Diez
Apr 2 '07 #4

P: n/a
On Apr 2, 4:05 pm, "Diez B. Roggisch" <d...@nospam.web.dewrote:
If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.

The OP had <br>-tags in his text. Which is _more_ than a half dozen lines of
code to clean up. Because your simple replacement-approach won't help here:

<br>foo <brbar </br>

Which is perfectly legal HTML, but nasty to parse.

Diez
But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s). For whitespace, re.sub(r'\s+', ' ', s). For XML
characters like &eacute;, re.sub(r'&(\w+);', lambda mo:
unichr(htmlentitydefs.name2codepoint[mo.group(1)]), s) and
re.sub(r'&#(\d+);', lambda mo: unichr(int(mo.group(1))), s). That's it
pretty much.

I'd like to see how this transformation can be done with
BeautifulSoup. Well, the last two regexps can be replaced with this:

unicode(BeautifulStoneSoup(s,convertEntities=Beaut ifulStoneSoup.HTML_ENTITIES).contents[0])

Apr 2 '07 #5

P: n/a
In <11**********************@d57g2000hsg.googlegroups .com>, irstas wrote:
I'd like to see how this transformation can be done with
BeautifulSoup. Well, the last two regexps can be replaced with this:

unicode(BeautifulStoneSoup(s,convertEntities=Beaut ifulStoneSoup.HTML_ENTITIES).contents[0])
Completely without regular expressions:

def main():
soup = BeautifulSoup(source, convertEntities=BeautifulSoup.HTML_ENTITIES)
print ' '.join(''.join(soup(text=True)).split())

Ciao,
Marc 'BlackJack' Rintsch
Apr 2 '07 #6

P: n/a
ir****@gmail.com wrote:
But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s).
Won't work for, say, this:

<img src="src" alt="<text>">
--
Michael Hoffman
Apr 2 '07 #7

P: n/a
On Apr 2, 10:08 pm, Michael Hoffman <cam.ac...@mh391.invalidwrote:
irs...@gmail.com wrote:
But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s).

Won't work for, say, this:

<img src="src" alt="<text>">
--
Michael Hoffman
True, but is that legal? I think the alt attribute needs to use &lt;
and &gt;. Although I know what you're going to reply. That
BeautifulSoup probably parses it even if it's invalid HTML. And I'd
say that I agree, using BeautifulSoup is a better solution than custom
regexps.

Apr 2 '07 #8

P: n/a
"Diez B. Roggisch" <de***@nospam.web.dewrote in
news:57*************@mid.uni-berlin.de:
>>
If the OP is constrained to standard libraries, then it may be
a question of defining what should be done more clearly. The
extraneous spaces can be removed by tokenizing the string and
rejoining the tokens. Replacing portions of a string with
equivalents is standard stuff. It might be preferable to create
a function that will accept lists of from and to strings and
translate the entire string by successively applying the
replacements. From what I've seen so far, that would be all the
OP needs for this task. It might take a half- dozen lines of
code, plus the from/to table definition.

The OP had <br>-tags in his text. Which is _more_ than a half
dozen lines of code to clean up. Because your simple
replacement-approach won't help here:

<br>foo <brbar </br>

Which is perfectly legal HTML, but nasty to parse.
Well, as I said, given the input the OP supplied, it's not even
necessary to parse it. It isn't clear what the true desired
operation is, but this seems to meet the criteria given:

<code -- the string 's' is wrapped nastily, but ...>
s ="""\
bonne mentalit&eacute; mec!:) \n <br>bon
pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n <br>mais pour avoir des resultats
probant il
faut pas faire les mariolles, comme le &quot;fondateur&quot; de
bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les
mariolles,
comme le &quot;fondateur&quot; de bvs krew \n"""

fromlist = ['<br>', '&eacute;', '&quot;']
tolist = ['', 'é', '"' ]
def withReplacements( s, flist,tlist ):
for ix, f in enumerate(flist):
t = tlist[ix]
s = s.replace( f,t )
return s

print withReplacements(' '.join(s.split()),fromlist,tolist)

</code>

If the question is about efficiency or robustness or generality,
then that's another set of issues, but that's for the 1.1 version
to handle.

--
rzed

Apr 2 '07 #9

This discussion thread is closed

Replies have been disabled for this discussion.