Replacing utf-8 characters

Mike

Hi, I am using Python to scrape web pages and I do not have problem
unless I run into a site that is utf-8. It seems & is changed to &
when the site is utf-8.

If I try to replace it with .replace('&','&') it for some reason
does not replace it.

For example: http://today.reuters.co.uk/news/default.aspx

The url in the page looks like this

http://today.reuters.co.uk/news/News...SERVATIVES.xml

However when I pull it into python the URL ends up looking like this
(notice the & instead of just & in the URL)

http://today.reuters.co.uk/news/news...B-STGOBAIN.xml

Any ideas?

Oct 5 '05 #1

Subscribe Post Reply

5274

Richard Brodie

"Mike" <no@spam> wrote in message news:11**************@nntp.acecape.com...

However when I pull it into python the URL ends up looking like this
(notice the & instead of just & in the URL)

Any ideas?

Some code would be helpful: the "&" is in the page source to start
with (which is as it ought to be). What are you using to parse the HTML?

Oct 5 '05 #2

Mike

For example this is what I am trying to do that is not working.

The contents of link is the reuters web page, containing

"/news/newsArticle.aspx?type=businessNews&amp;storyID =2005-10-05T151245Z_01_HO548006_RTRUKOC_0_UK-AIRLINES-BA.xml"

link = link.replace('&amp;','&')

But if I now view the the contents link it shows it the same as when it
was assigned.

Richard Brodie wrote:

"Mike" <no@spam> wrote in message news:11**************@nntp.acecape.com...

However when I pull it into python the URL ends up looking like this
(notice the & instead of just & in the URL)

Any ideas?

Some code would be helpful: the "&" is in the page source to start
with (which is as it ought to be). What are you using to parse the HTML?

Oct 5 '05 #3

Steve Holden

Unknown wrote:

For example this is what I am trying to do that is not working.

The contents of link is the reuters web page, containing

"/news/newsArticle.aspx?type=businessNews&amp;storyID =2005-10-05T151245Z_01_HO548006_RTRUKOC_0_UK-AIRLINES-BA.xml"

link = link.replace('&amp;','&')

But if I now view the the contents link it shows it the same as when it
was assigned.

Richard Brodie wrote:
"Mike" <no@spam> wrote in message news:11**************@nntp.acecape.com...

However when I pull it into python the URL ends up looking like this
(notice the & instead of just & in the URL)

Any ideas?

Some code would be helpful: the "&" is in the page source to start
with (which is as it ought to be). What are you using to parse the HTML?

You must be doing *something* wrong:

link = "/news/newsArticle.aspx?type=businessNews&amp;storyID =2005-10-05T151245Z_01_HO548006_RTRUKOC_0_UK-AIRLINES-BA.xml" link = link.replace('&amp;','&')
link '/news/newsArticle.aspx?type=businessNews&storyID=2005-10-05T151245Z_01_HO548006_RTRUKOC_0_UK-AIRLINES-BA.xml'

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Oct 5 '05 #4

Mike

Steve Holden wrote:

You must be doing *something* wrong:
>>> link = "/news/newsArticle.aspx?type=businessNews&amp;storyID =2005-10-05T151245Z_01_HO548006_RTRUKOC_0_UK-AIRLINES-BA.xml"
>>> link = link.replace('&amp;','&')
>>> link '/news/newsArticle.aspx?type=businessNews&storyID=2005-10-05T151245Z_01_HO548006_RTRUKOC_0_UK-AIRLINES-BA.xml'
>>>

regards
Steve

What you and I typed was ascii. The value of link came from importing
that utf-8 web page into that variable. That is why I think it is not
working. But not sure what the solution is.

Oct 5 '05 #5

Klaus Alexander Seistrup

Mike wrote:

Hi, I am using Python to scrape web pages and I do not have problem
unless I run into a site that is utf-8. It seems & is changed to
& when the site is utf-8.

[...] Any ideas?

How about using the universal feedparser from feedparser.org to fetch
and parse the RSS from Reuters? That's what I do and it works like a
charm.

#v+

import feedparser
rss = feedparser.parse('http://today.reuters.com/rss/topNews')
for what in ('link', 'title', 'summary'): .... print rss.entries[0][what]
.... print
....
http://today.reuters.com/news/newsar...RT-SUICIDE.xml

Top court seems closely divided on suicide law

During arguments, the justices sharply questioned both sides on whether then-Attorney General John Ashcroft had the power under federal law in 2001 to bar distribution of controlled drugs to assist suicides, regardless of state law.

#v-

Cheers,

--
Klaus Alexander Seistrup
Magnetic Ink, Copenhagen, Denmark
http://magnetic-ink.dk/

Oct 5 '05 #6

Mike

In playing with this I found link.replace does work but when I use

link.replace('&','&')

it replaces it with & instead of just &. link.replace is working
for me since if I changed the second option from & to something else I
see the change.

So it seems link.replace() function reads whether the first option is
utf-8 and converts the second option automatically to utf-8? How do I
prevent that?

Thanks again.

Oct 5 '05 #7

David Bolen

Mike <no@spam> writes:

What you and I typed was ascii. The value of link came from importing
that utf-8 web page into that variable. That is why I think it is not
working. But not sure what the solution is.

Are you sure you're asking what you think you are asking? Both the
ampersand character (&) and the characters within the ampersand entity
character reference (&) are ASCII. As it turns out they are also
legal UTF-8, but I would not call a web page UTF-8 just because I saw
the sequence of characters "&" within the stream. (That's not to
say it isn't UTF-8 encoded, just that I don't think that's the issue)

I'm just guessing, but you do realize that legal HTML should quote all
uses of the ampersand character with an entity reference, since the
ampersand itself is reserved for use in such references. This
includes URL references whether inside attributes or in the body of
the text.

So when you see something in a browser in a web page that shows a URL
that includes "&" such as for separating parameters, internally that
page is (or should be) stored with "&" for that character. Thus
if you retrieve the page in code, that's what you'll find. It's the
browser processing that entity reference that turns it back into the
"&" for presentation.

Note that whether or not the page in question is encoded as UTF-8 is a
completely distinct question - whatever encoding the page is in would
be used to encode the characters in the entity reference (namely
"&").

I'm assuming that in scraping the page you want to reverse the process
(e.g., perform the interpretation of the entity references much as a
browser would) before using that URL for other purposes. If so, the
string replacement you tried should handle the replacement just fine,
at least within the value of the URL as managed by your code.

You then mention it being the same when you view the contents of the
link, which isn't quite clear to me, but if that means retrieving
another copy of the link as embedded in an HTML page then yes, it'll
get quoted again since as initially, you have to quote an ampersand
as an entity reference within HTML.

What did you mean by "view the contents link"?

-- David

Oct 5 '05 #8

Martin v. Löwis

Mike wrote:

So it seems link.replace() function reads whether the first option is
utf-8 and converts the second option automatically to utf-8? How do I
prevent that?

Not sure what an option is... if you are talking about parameters,
rest assured that <string>.replace does not know or care whether any
of its parameters is encoded in UTF-8. Also not sure where you got
the impression UTF-8 could have to do anything with this.

Regards,
Martin

Oct 5 '05 #9

by: Jakanapes | last post by:

Hi all, I'm looking for a way to scan a block of text and replace all the double quotes (") with single quotes ('). I'm using PHP to pull text out of a mySQL table and then feed the text into...

PHP

PostgreSQL/PHP/UTF-8?

by: Mark | last post by:

hello! does anybody know how well the pg extension in PHP for PostgreSQL supports UTF-8 character sets? functions such as pg_escape_string would be of particular concern as they go whipping...

PostgreSQL Database

Replacing an XML element?

by: Nick Vargish | last post by:

I've been trying to figure out how to do something that seems relatively simple, but it's just not coming together for me. I'm hoping someone will deign to give me a little insight here. The...

Python

Replacing palindrome substrings of an input string with a given string

by: Tung Chau | last post by:

Hi, I need help with an efficient implementation of the above problem in C. Suffix tree does not seem to help much in this case. Any idea? Please help. Thanks. Tung Chau

C / C++

Pear class Config.php and UTF-8

by: Werner Elflein | last post by:

Hi! I tried to use the PEAR class Config.php to parse a XML configuration file which looks like <?xml version="1.0" encoding="utf-8"?> <configuration> <section> <item1>value1</item1>...

PHP

Problem with curses and UTF-8

by: Ian Ward | last post by:

When I run the following code in a terminal with the encoding set to UTF-8 I get garbage on the first line, but the correct output on the second. import curses s = curses.initscr()...

Python

web app breakage with utf-8

by: elmo | last post by:

Hello, after two days of failed efforts and googling, I thought I had better seek advice or observations from the experts. I would be grateful for any input. We have various small internal web...

Python

About charset setting and replacing

by: gmclee | last post by:

Hi there, I am writing a program to load HTML from file and send it to IE directly. I've met some problem in charset setting. Most of HTML have charset "us-ascii", for some reason, some UNICODE...

HTML / CSS

Replace character (UTF-8)

by: Peter.Laganis | last post by:

Hi to all, I apologize in advance if a similar question was already posted, but I didn't find it. Here is my problem: I would like to replace all the special characters ' ¾' with another...

PHP

XSL replacing tags from xhtml to xhtml

by: TamusJRoyce | last post by:

I have xsl code which I was hoping could be used to replace one specific tag from an xhtml document and output another xhtml document. xsl has phenomenal potential in data replacing, but coming...

XML

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Replacing utf-8 characters

Similar topics