473,387 Members | 1,890 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

file.getvalue() with _ or other characters

H!

I do this to get a htmlTOtext file

class mvbHTMLParser(htmllib.HTMLParser):

def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self,formatter,verbose )
self.imglist = []

def handle_image(self,src,alt,*args):
self.imglist.append(src)
file = StringIO.StringIO()
f = formatter.AbstractFormatter(formatter.DumbWriter(f ile))
p = mvbHTMLParser(f)
p.feed(html)
p.close()

print file.getvalue()

But then the _ characters are away.
is it possible to keep that character in file.getvalue()

[the p.anchorlist = oke : test_bla.html]

Jul 18 '05 #1
6 1616
ma*****@gamecreators.nl wrote:
file = StringIO.StringIO()
f = formatter.AbstractFormatter(formatter.DumbWriter(f ile))
p = mvbHTMLParser(f)
p.feed(html)
p.close()

print file.getvalue()

But then the _ characters are away.
is it possible to keep that character in file.getvalue()


I consider this a defect in StringIO, but it's pretty easy to
fix it, at least for the narrow usage you describe:

class PreservingStringIO(StringIO.StringIO):
def close(self):
pass

file = PreservingStringIO()
....etc

The problem is (if I'm right about this) that the close()
method on the object returned by mvbHTMLParser() will actually
call close() on the file object in the formatter (whether
directly or not, I don't know). One might consider _this_
a bug as well, but if the above approach works, in the end
it's no big deal.

So basically redefine close() to do nothing (or have it save
a copy of the buffer's getvalue() results first) and you
should be good to go.

-Peter
Jul 18 '05 #2
mmm I'm a newbie with python.

I did this but don't work:

class mvbHTMLParser(htmllib.HTMLParser):

def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self,formatter,verbose )
self.imglist = []

def handle_image(self,src,alt,*args):
self.imglist.append(src)

class PreservingStringIO(StringIO.StringIO):
def close(self):
pass

file = PreservingStringIO()
f = formatter.AbstractFormatter(formatter.DumbWriter(f ile))
p = mvbHTMLParser(f)
p.feed(html)
p.close()

print file.getvalue()
---- i will try some things

Jul 18 '05 #3
ma*****@gamecreators.nl wrote:
I do this to get a htmlTOtext file
[...]
But then the _ characters are away.
is it possible to keep that character in file.getvalue()


Just to make sure: you did look into the HTML file and verified that there
are actually underscores and not spaces that are _rendered_ similar to "_"
via <u>some text</u> or CSS?

Peter
Jul 18 '05 #4
ma*****@gamecreators.nl wrote:
I did this but don't work:


It is quite possible I misunderstood the problem you
were having. I am familiar with a problem with StringIO
whereby if you call close() on it, you can no longer call
getvalue() afterwards. Perhaps that's not the problem
you were seeing.

Can you clarify your comment "But then the _ characters are
away. is it possible to keep that character in file.getvalue()"?

Please show actual (small!) examples of the sort of input
you are dealing with, and the output which you are getting
(if any).

-Peter
Jul 18 '05 #5
srry I needed some sleep.
it works oke.

But if you want to answer a question.

I use this code:
----------------------------------------------------------
import StringIO
import re
import urllib2,htmllib, formatter

class mvbHTMLParser(htmllib.HTMLParser):
def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self,formatter,verbose )

def getContent(url):
try:
line = urllib2.urlopen(url)
htmlToText(line.read().lower())
except IOError,(strerror):
print strerror

def htmlToText(html):
file = StringIO.StringIO()
f = formatter.AbstractFormatter(formatter.DumbWriter(f ile))
p = mvbHTMLParser(f)
p.feed(html)
p.close()

print file.getvalue()

getContent('http://www.zquare.nl/test.html')
----------------------------------------------------------
then the output is:
text_text
a_link[1]

that's oke but how to delete [n]
like this? : del = re.compile(r'[0-9]',).sub

Thanks for the fast helping,
GC-Martijn

Jul 18 '05 #6
ma*****@gamecreators.nl wrote:
class mvbHTMLParser(htmllib.HTMLParser):
def __init__(self, formatter, verbose=0):
htmllib.HTMLParser.__init__(self,formatter,verbose )
def anchor_end(self):
self.anchor = None

[...]
then the output is:
text_text
a_link[1]

that's oke but how to delete [n]
like this? : del = re.compile(r'[0-9]',).sub


Overriding the anchor_end() method as shown above will suppress the [n]
suffix after links.

Peter

Jul 18 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Henry Jordon | last post by:
I have started a java program and I have one problem that I can't overcome. I have commented in my code as to what I would like to do but am unable to do correctly. Your help is greatly...
0
by: Anish G | last post by:
Hi All, I am getting the below given error while running my application in live server. In my local machine, its working fine. Please help me as it is very urgent for me. Exception from...
3
by: LordHog | last post by:
Hello, How would I go about finding the default handler, let's say a text file (*.txt), then launch the default handler with the file as an argument? I had found how to launch an external...
0
by: nuttynibbles | last post by:
hi, im creating a mobile apps whereby if i transfer a file from a PC to a folder directory in the windows mobile phone, it will detect and fire an event? for example, rename each file that is...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.