473,326 Members | 2,255 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

Using utidylib, empty string returned in some cases

Hello

I'm using debian linux, Python 2.4.4, and utidylib (http://
utidylib.berlios.de/). I wrote simple functions to get a web page,
convert it from windows-1251 to utf8 and then I'd like to clean html
with it.

Here is two pages I use to check my program:
http://www.ya.ru/ (in this case everything works ok)
http://www.yellow-pages.ru/rus/nd2/qu5/ru15632 (in this case tidy did
not return me anything just empty string)
code:

--------------

# coding: utf-8
import urllib, urllib2, tidy

def get_page(url):
user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'
headers = { 'User-Agent' : user_agent }
data= {}

req = urllib2.Request(url, data, headers)
responce = urllib2.urlopen(req)
page = responce.read()

return page

def convert_1251(page):
p = page.decode('windows-1251')
u = p.encode('utf-8')
return u

def clean_html(page):
tidy_options = { 'output_xhtml' : 1,
'add_xml_decl' : 1,
'indent' : 1,
'input-encoding' : 'utf8',
'output-encoding' : 'utf8',
'tidy_mark' : 1,
}
cleaned_page = tidy.parseString(page, **tidy_options)
return cleaned_page

test_url = 'http://www.yellow-pages.ru/rus/nd2/qu5/ru15632'
#test_url = 'http://www.ya.ru/'

#f = open('yp.html', 'r')
#p = f.read()

print clean_html(convert_1251(get_page(test_url)))

--------------

What am I doing wrong? Can anyone help, please?
Jan 22 '08 #1
1 1677
En Tue, 22 Jan 2008 15:35:16 -0200, Boris <sa**********@gmail.com>
escribió:
I'm using debian linux, Python 2.4.4, and utidylib (http://
utidylib.berlios.de/). I wrote simple functions to get a web page,
convert it from windows-1251 to utf8 and then I'd like to clean html
with it.
Why the intermediate conversion? I don't know utidylib, but can't you feed
it with the original page, in the original encoding? If the page itself
contains a "meta http-equiv" tag stating its content-type and charset, it
won't be valid anymore if you reencode the page.

--
Gabriel Genellina

Jan 23 '08 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
by: Grasshopper | last post by:
Hi, I am automating Access reports to PDF using PDF Writer 6.0. I've created a DTS package to run the reports and schedule a job to run this DTS package. If I PC Anywhere into the server on...
1
by: Daveyk0 | last post by:
Hello there, I have a front end database that I have recently made very many changes to to allow off-line use. I keep copies of the databases on my hard drive and link to them rather than the...
9
by: Guy | last post by:
I have extended the datetimepicker control to incorporate a ReadOnly property. I have used the new keyword to implement my own version of the value property, so that if readonly == true then it...
3
by: Rico | last post by:
If there are consecutive occurrences of characters from the given delimiter, String.Split() and Regex.Split() produce an empty string as the token that's between such consecutive occurrences. It...
0
by: Lokkju | last post by:
I am pretty much lost here - I am trying to create a managed c++ wrapper for this dll, so that I can use it from c#/vb.net, however, it does not conform to any standard style of coding I have seen....
14
by: cj | last post by:
What is string.empty used for? I can't say: if string.empty then I have to use: if string = "" then which is ok, I just want to know what .empty is for.
6
by: Jeremy L. Moles | last post by:
I've been using the following lambda/function for a number of months now (I got the idea from someone in #python, though I don't remember who): def chop(s, n): """Chops a sequence, s, into n...
4
by: whisher | last post by:
Hi. I'm taking my first steps on regex I set up this simple function to check if a form field is empty or with only space. var onlySpaceRegexp = /^\s*$/; function isEmpty(val) { if...
71
by: desktop | last post by:
I have read in Bjarne Stroustrup that using malloc and free should be avoided in C++ because they deal with uninitialized memory and one should instead use new and delete. But why is that a...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.