Clean "Durty" strings

Ulysse

Hello,

I need to clean the string like this :

string =
"""
bonne mentalité mec!:) \n bon pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n mais pour avoir des resultats probant il
faut pas faire les mariolles, comme le "fondateur" de bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew \n
"""

into :
bonne mentalité mec!:) bon pour info moi je suis un serial posteur
arceleur dictateur ^^* mais pour avoir des resultats probant il faut
pas faire les mariolles, comme le "fondateur" de bvs krew
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew

To do this I wold like to use only strandard librairies.

Thanks

Apr 1 '07 #1

Subscribe Post Reply

3600

Diez B. Roggisch

Ulysse wrote:

Hello,

I need to clean the string like this :

string =
"""
bonne mentalité mec!:) \n bon pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n mais pour avoir des resultats probant il
faut pas faire les mariolles, comme le "fondateur" de bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew \n
"""

into :
bonne mentalitÃ© mec!:) bon pour info moi je suis un serial posteur
arceleur dictateur ^^* mais pour avoir des resultats probant il faut
pas faire les mariolles, comme le "fondateur" de bvs krew
mais pour avoir des resultats probant il faut pas faire les mariolles,
comme le "fondateur" de bvs krew

The obvious way that has been suggested to you at other places is to use
BeautifulSoup.

To do this I wold like to use only strandard librairies.

Then you need to reprogram what BeautifulSoup does. Happy hacking!

Diez

Apr 2 '07 #2

rzed

"Diez B. Roggisch" <de***@nospam.web.dewrote in
news:57*************@mid.uni-berlin.de:

Ulysse wrote:

>Hello,

I need to clean the string like this :

string =
"""
bonne mentalité mec!:) \n bon
pour info moi je suis un serial posteur arceleur dictateur ^^*
\n mais pour avoir des resultats
probant il faut pas faire les mariolles, comme le
"fondateur" de bvs krew \n
mais pour avoir des resultats probant il faut pas faire les
mariolles, comme le "fondateur" de bvs krew \n
"""

into :
bonne mentalitÃ© mec!:) bon pour info moi je suis un serial
posteur arceleur dictateur ^^* mais pour avoir des resultats
probant il faut pas faire les mariolles, comme le "fondateur"
de bvs krew mais pour avoir des resultats probant il faut pas
faire les mariolles, comme le "fondateur" de bvs krew

The obvious way that has been suggested to you at other places
is to use BeautifulSoup.

>To do this I wold like to use only strandard librairies.

Then you need to reprogram what BeautifulSoup does. Happy
hacking!

If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.

--
rzed

Apr 2 '07 #3

Diez B. Roggisch

If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.

The OP had -tags in his text. Which is _more_ than a half dozen lines of
code to clean up. Because your simple replacement-approach won't help here:

 foo <brbar 

Which is perfectly legal HTML, but nasty to parse.

Diez

Apr 2 '07 #4

irstas

On Apr 2, 4:05 pm, "Diez B. Roggisch" <d...@nospam.web.dewrote:

If the OP is constrained to standard libraries, then it may be a
question of defining what should be done more clearly. The extraneous
spaces can be removed by tokenizing the string and rejoining the
tokens. Replacing portions of a string with equivalents is standard
stuff. It might be preferable to create a function that will accept
lists of from and to strings and translate the entire string by
successively applying the replacements. From what I've seen so far,
that would be all the OP needs for this task. It might take a half-
dozen lines of code, plus the from/to table definition.

The OP had -tags in his text. Which is _more_ than a half dozen lines of
code to clean up. Because your simple replacement-approach won't help here:

 foo <brbar 

Which is perfectly legal HTML, but nasty to parse.

Diez

But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s). For whitespace, re.sub(r'\s+', ' ', s). For XML
characters like é, re.sub(r'&(\w+);', lambda mo:
unichr(htmlentitydefs.name2codepoint[mo.group(1)]), s) and
re.sub(r'&#(\d+);', lambda mo: unichr(int(mo.group(1))), s). That's it
pretty much.

I'd like to see how this transformation can be done with
BeautifulSoup. Well, the last two regexps can be replaced with this:

unicode(BeautifulStoneSoup(s,convertEntities=Beaut ifulStoneSoup.HTML_ENTITIES).contents[0])

Apr 2 '07 #5

Marc 'BlackJack' Rintsch

In <11**********************@d57g2000hsg.googlegroups .com>, irstas wrote:

I'd like to see how this transformation can be done with
BeautifulSoup. Well, the last two regexps can be replaced with this:

unicode(BeautifulStoneSoup(s,convertEntities=Beaut ifulStoneSoup.HTML_ENTITIES).contents[0])

Completely without regular expressions:

def main():
soup = BeautifulSoup(source, convertEntities=BeautifulSoup.HTML_ENTITIES)
print ' '.join(''.join(soup(text=True)).split())

Ciao,
Marc 'BlackJack' Rintsch

Apr 2 '07 #6

Michael Hoffman

ir****@gmail.com wrote:

But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s).

Won't work for, say, this:

<img src="src" alt="<text>">
--
Michael Hoffman

Apr 2 '07 #7

irstas

On Apr 2, 10:08 pm, Michael Hoffman <cam.ac...@mh391.invalidwrote:

irs...@gmail.com wrote:
But it could be that he just wants all HTML tags to disappear, like in
his example. A code like this might be sufficient then: re.sub(r'<[^>]
+>', '', s).

Won't work for, say, this:

<img src="src" alt="<text>">
--
Michael Hoffman

True, but is that legal? I think the alt attribute needs to use <
and >. Although I know what you're going to reply. That
BeautifulSoup probably parses it even if it's invalid HTML. And I'd
say that I agree, using BeautifulSoup is a better solution than custom
regexps.

Apr 2 '07 #8

rzed

"Diez B. Roggisch" <de***@nospam.web.dewrote in
news:57*************@mid.uni-berlin.de:

>>
If the OP is constrained to standard libraries, then it may be
a question of defining what should be done more clearly. The
extraneous spaces can be removed by tokenizing the string and
rejoining the tokens. Replacing portions of a string with
equivalents is standard stuff. It might be preferable to create
a function that will accept lists of from and to strings and
translate the entire string by successively applying the
replacements. From what I've seen so far, that would be all the
OP needs for this task. It might take a half- dozen lines of
code, plus the from/to table definition.

The OP had -tags in his text. Which is _more_ than a half
dozen lines of code to clean up. Because your simple
replacement-approach won't help here:

 foo <brbar 

Which is perfectly legal HTML, but nasty to parse.

Well, as I said, given the input the OP supplied, it's not even
necessary to parse it. It isn't clear what the true desired
operation is, but this seems to meet the criteria given:

<code -- the string 's' is wrapped nastily, but ...>
s ="""\
bonne mentalité mec!:) \n bon
pour
info moi je suis un serial posteur arceleur dictateur ^^*
\n mais pour avoir des resultats
probant il
faut pas faire les mariolles, comme le "fondateur" de
bvs
krew \n
mais pour avoir des resultats probant il faut pas faire les
mariolles,
comme le "fondateur" de bvs krew \n"""

fromlist = [' ', 'é', '"']
tolist = ['', 'é', '"' ]
def withReplacements( s, flist,tlist ):
for ix, f in enumerate(flist):
t = tlist[ix]
s = s.replace( f,t )
return s

print withReplacements(' '.join(s.split()),fromlist,tolist)

</code>

If the question is about efficiency or robustness or generality,
then that's another set of issues, but that's for the 1.1 version
to handle.

--
rzed

Apr 2 '07 #9

by: rajesh | last post by:

URL: *** http://www.webdeveloping.nl/forum/showthread.php?s=0cfd9dadcad70f5454595 879ad73140b&threadid=2037&goto=nextoldest *** Sent via Developersdex http://www.developersdex.com ***

Javascript

Are multiple "constructors" allowed in JavaScript?

by: andrewfsears | last post by:

I have a question: I was wondering if it is possible to simulate the multiple constructors, like in Java (yes, I know that the languages are completely different)? Let's say that I have a class...

Javascript

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Clean "Durty" strings

Similar topics