bad data from urllib when run from MS .bat file

Stuart McGraw

I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

================================================

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.

Note: To reproduce this problem, it helps to have East Asian font
support installed on the test system. In Windows 2000:
Control Panel,
Regional Options, General tab
check mark in Japanese in the "Language seetings..." area.
Python also needs either the cjkcodecs (http://cjkpython.berlios.de/)
or Tamito KAJIYAMA's japanese codecs
(http://www.asahi-net.or.jp/~rd6t-kjym/python/)
installed.

To reproduce the problem...

1. Create a python file, test.py:
test.py:
----------------
import sys, urllib, cjkcodecs
f = urllib.urlopen (sys.argv[1])
for ln in f:
ln = ln.decode ("cjkcodecs.euc-jp")
print ln.encode("utf-8"),
----------------

2. Create a batch file that will run test.py:
test.bat:
----------------
python test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1
----------------

3. In a cmd.exe window run the following two commands:
python test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1 >out1.txt
test.bat >out2.txt

4. out1.txt and out2.txt should be identical. But they are not.

The url used will return a EUC-JP encoded page with some japanese
characters in it. Test.py reads the page line by line, decodes
the lines to unicode, reencodes them to UTF-8, and writes to a file.
Thus the output file should be a UTF-8 version of the EUC-JP web page.

The first command runs test.py directly. The second command runs
the identical command from a Windows batch file. One should expect
out1.txt and out2.txt to be identical.

out1.txt (created by running test.py from the command line) is
correct (verify by opening out1.txt in notepad, and selecting a
Japanese capable font, e.g. Lucida Sans Unicode). The string in
the first cell of the html table is the three japanese characters
for word "taberu".

But in out2.txt (created by running test.py from a windows .bat
file), instead of japanese characters there, we see an ascii text
string "A9D9EB". (The EUC-JP value of the actual japanese characters
that should be there are \xBF\xA9\xA4\xD9\XA4\xEB, so the printed
hex digits seems to come from alternate bytes of the EUC-JP string.

In other lines with japanese characters a similar effect is seen:
the first two japanese character are replaced with with a string of
hex digits. Strangely, remaining japanese characters on the line
are not corrupted.

Running with a debugger shows that the corruption is in the text
received from urllib; it is not a result of the euc-jp decoding,
UTF-8 encoding, or writing to the output file.

So it looks like some bad mojo between urllib and the Windows
batch environment.

Jul 18 '05 #1

Subscribe Post Reply

2313

John J. Lee

"Stuart McGraw" <sm******@frii.RimoovThisToReply.com> writes:
[...]

2. Create a batch file that will run test.py:
test.bat:
----------------
python test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1
----------------

3. In a cmd.exe window run the following two commands:
python test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1 >out1.txt
test.bat >out2.txt

4. out1.txt and out2.txt should be identical. But they are not. [...] Running with a debugger shows that the corruption is in the text
received from urllib; it is not a result of the euc-jp decoding,
UTF-8 encoding, or writing to the output file.
Hmm...

So it looks like some bad mojo between urllib and the Windows
batch environment.

Just a guess, without actually bothering to think about the numerology
in detail:

test.bat:
----------------
python -u test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1
----------------

Note the -u switch (for 'unbuffered', but also 'um, binary mode'
<wink>).
John

Jul 18 '05 #2

Stuart McGraw

"John J. Lee" <jj*@pobox.com> wrote in message news:87************@pobox.com...

"Stuart McGraw" <sm******@frii.RimoovThisToReply.com> writes:
[...]
So it looks like some bad mojo between urllib and the Windows
batch environment.

Just a guess, without actually bothering to think about the numerology
in detail:

test.bat:
----------------
python -u test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1
----------------

Note the -u switch (for 'unbuffered', but also 'um, binary mode'
<wink>).

Did you try doing that? Did it work for you? I just tried here, and
still have the same problem.

Even worse, in the original script that the test script is derived from
I encountered a new problem. Intermixed with the web page data
returned by urllib, is bits and pieces (10-20 characters long) of local
file and directory names. Only happens reading some web pages
(EUC-JP encoded as with the original problem) but I'm wondering
if there are some single-byte/double-byte character issues with urllib.
That would be surprising to me given that urllib is shipped with the
Python distribution, I would think that any core libs would be pretty
bombproof. (Am I being naive? :-) Of course, still possible I hosed
something in my script, so I will double check...

Jul 18 '05 #3

Bengt Richter

On Sat, 18 Sep 2004 16:23:40 -0600, "Stuart McGraw" <sm******@frii.RimoovThisToReply.com> wrote:

I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

=============================================== =

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.

Just a thought: in case your command line is being interpreted
by cmd.exe and .bat by something else (command.com?) you could
check if it makes a difference, e.g.,

copy test.bat test.cmd

and try it again? (explicitly as test.cmd, not just test, since any
same-name .com or .exe or .bat may have priority over .cmd)
You can probably investigate the latter by something like

[21:54] C:\pywk\junk>echo %pathext%
.COM;.EXE;.BAT;.CMD

Regards,
Bengt Richter

Jul 18 '05 #4

Stuart McGraw

"Bengt Richter" <bo**@oz.net> wrote in message news:ci*************************@theriver.com...

On Sat, 18 Sep 2004 16:23:40 -0600, "Stuart McGraw" <sm******@frii.RimoovThisToReply.com> wrote:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

=============================================== =

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.

Just a thought: in case your command line is being interpreted
by cmd.exe and .bat by something else (command.com?) you could
check if it makes a difference, e.g.,

copy test.bat test.cmd

and try it again? (explicitly as test.cmd, not just test, since any
same-name .com or .exe or .bat may have priority over .cmd)
You can probably investigate the latter by something like

[21:54] C:\pywk\junk>echo %pathext%
.COM;.EXE;.BAT;.CMD

Well, I'm pretty sure cmd.exe was executing it, but I tried your
suggestion to make absolutely sure. Same results :-(
Given the other (seeming) urllib problem I mentioned in another
post in this thread, which appeared without any involvement
of batch scripts, I am getting more and more suspicious that
urllib is buggy, at least with non-single byte data.

Jul 18 '05 #5

Bengt Richter

On Mon, 20 Sep 2004 07:33:13 -0600, "Stuart McGraw" <sm******@frii.RimoovThisToReply.com> wrote:

"Bengt Richter" <bo**@oz.net> wrote in message news:ci*************************@theriver.com...
On Sat, 18 Sep 2004 16:23:40 -0600, "Stuart McGraw" <sm******@frii.RimoovThisToReply.com> wrote:
>I just spent a $*#@!*&^&% hour registering at ^$#@#%^
>Sourceforce and trying to submit a Python bug report
>but it still won't let me. I give up. Maybe someone who
>cares will see this post, or maybe it will save time for
>someone else who runs into this problem...
>
>=============================================== =
>
>Environment:
>- Microsoft Windows 2000 Pro
>- Python 2.3.4
>- urllib (version shipped with Python-2.3.4)
>
>Problem:
> urllib returns corrupted data when reading an EUC-JP encoded
> web page, from a python script run from a MS Windows .BAT
>file, but not when the same script is run from the command line.

Just a thought: in case your command line is being interpreted
by cmd.exe and .bat by something else (command.com?) you could
check if it makes a difference, e.g.,

copy test.bat test.cmd

and try it again? (explicitly as test.cmd, not just test, since any
same-name .com or .exe or .bat may have priority over .cmd)
You can probably investigate the latter by something like

[21:54] C:\pywk\junk>echo %pathext%
.COM;.EXE;.BAT;.CMD

Well, I'm pretty sure cmd.exe was executing it, but I tried your
suggestion to make absolutely sure. Same results :-(
Given the other (seeming) urllib problem I mentioned in another
post in this thread, which appeared without any involvement
of batch scripts, I am getting more and more suspicious that
urllib is buggy, at least with non-single byte data.

Hm, what happens if you make a test2.py and pass it the name of an output
file instead of piping the output from print? In fact, eliminate the
encoding and the line generator and everything, and just let test2 copy the entire
server data in one single read and write it in binary. I.e,
open(sys.argv[2],'wb').write(urllib.urlopen(...).read())

That should show whether python is seeing the identical input from the server.
Then you could do it line-wise (not with a print line ending in ",", but with
a binary file write). That would say whether line generation chunking on input
was doing anything to the data -- if possibly urrlib is buffering/chunking
differently for interactive vs bat file. Just grasping at straws, but eliminating
chunking, piping, re/encoding, binary vs text mode doubts from the test should
show why interactive vs .bat is different IWT.

Also, your mention of two-character errors made me wonder about spurious BOMs
or such from encoding file substrings as though they were entire files?
Would a final print for a final '\n' do anything that might trigger a final flush
differently with potential cooking consequence? (why the print with space instead BTW)?
What if you just do your own file.write output in binary and control everything?

Just some additional thoughts. Sorry the cmd vs bat thing didn't do anything.
BTW, what command line options are in use to start your interactive session
(it is console, not idle, right?). You didn't seem to have any (e.g. -u) in test.py.
Could the .BAT file be seeing a different environment? could the http://.. need quoting?
I.e., could the server be seeing a glitched url tail and be sending the same file but with some
different option?

Hope something gives you a useful idea. That's all I can think of for the moment ;-)

Regards,
Bengt Richter

Jul 18 '05 #6

John J. Lee

"Stuart McGraw" <sm******@frii.RimoovThisToReply.com> writes:

"John J. Lee" <jj*@pobox.com> wrote in message news:87************@pobox.com... [...]
Just a guess

[...] Did you try doing that?
No. It was a guess.
[...] That would be surprising to me given that urllib is shipped with the
Python distribution, I would think that any core libs would be pretty
bombproof. (Am I being naive? :-) [...]

Possibly:

http://article.gmane.org/gmane.comp.python.devel/63911
But I have a fairly strong suspicion that this *isn't* a bug in urllib
or Python: I think urllib regards HTTP response data simply as a
binary string (as opposed to the case of URLs, where things are... uh,
complicated).

*I'm* certainly more naive about encodings, character sets &c. than
any Python module, though <wink>...

Of course, still possible I hosed

Yes, that's possible :-)
John

Jul 18 '05 #7

Stuart McGraw

Thanks everyone for all the suggestions. I will follow up on them,
but not right now. I am about to move halfway around the world
for a few months so I will need to get settled in before I have
time to look into this more.

Jul 18 '05 #8

Similar topics

Simple Question : files and URLLIB

by: Richard Shea | last post by:

Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy' function (part of the mx.tidy package). There's one thing I'm having real difficulties understanding. When I did this ... ...

Python

urllib problem

by: C GIllespie | last post by:

Dear All, I'm having problems using the urllib module and was wondering if anyone could suggest a solution. The only thing I can thing of is that I'm using at university and my uni uses a...

Python

Can not get urllib.urlopen to work

by: Pater Maximus | last post by:

I am trying to implement the recipe listed at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/211886 However, I can not get to first base. When I try to run import urllib...

Python

urllib versus IIS

by: Shane Hathaway | last post by:

I started experimenting with SOAPpy yesterday and immediately hit a snag. Both web services I tried simply hung and never replied. After a lot of digging, I found out what was going wrong:...

Python

POST data with 401 authentication using urllib(2)

by: Pieter Edelman | last post by:

Hi all, I'm trying to submit some data using a POST request to a HTTP server with BASIC authentication with python, but I can't get it to work. Since it's driving me completely nuts, so here's...

Python

urllib.urlopen

by: JabaPyth | last post by:

Hello, I'm trying to use the urllib module, but when i try urllib.urlopen, it gives me a socket error: >>import urllib >>print urllib.urlopen('http://www.google.com/').read() Traceback (most...

Python

how-to POST form data to ASP pages?

by: livin | last post by:

I need to post form data to an ASP page that looks like this on the page itself... <form method='POST'><input src=\icons\devices\coffee-on.gif type='image' align='absmiddle' width=16 height=16...

Python

urllib behaves strangely

by: Gabriel Zachmann | last post by:

Here is a very simple Python script utilizing urllib: import urllib url = "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological" print url print file = urllib.urlopen(...

Python

urllib.urlopen: Errno socket error

by: kgrafals | last post by:

Hi, I'm just trying to read from a webpage with urllib but I'm getting IOErrors. This is my code: import urllib sock = urllib.urlopen("http://www.google.com/") and this is the error:

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing