473,243 Members | 1,696 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,243 software developers and data experts.

bad data from urllib when run from MS .bat file

I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

================================================

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.

Note: To reproduce this problem, it helps to have East Asian font
support installed on the test system. In Windows 2000:
Control Panel,
Regional Options, General tab
check mark in Japanese in the "Language seetings..." area.
Python also needs either the cjkcodecs (http://cjkpython.berlios.de/)
or Tamito KAJIYAMA's japanese codecs
(http://www.asahi-net.or.jp/~rd6t-kjym/python/)
installed.

To reproduce the problem...

1. Create a python file, test.py:
test.py:
----------------
import sys, urllib, cjkcodecs
f = urllib.urlopen (sys.argv[1])
for ln in f:
ln = ln.decode ("cjkcodecs.euc-jp")
print ln.encode("utf-8"),
----------------

2. Create a batch file that will run test.py:
test.bat:
----------------
python test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1
----------------

3. In a cmd.exe window run the following two commands:
python test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1 >out1.txt
test.bat >out2.txt

4. out1.txt and out2.txt should be identical. But they are not.

The url used will return a EUC-JP encoded page with some japanese
characters in it. Test.py reads the page line by line, decodes
the lines to unicode, reencodes them to UTF-8, and writes to a file.
Thus the output file should be a UTF-8 version of the EUC-JP web page.

The first command runs test.py directly. The second command runs
the identical command from a Windows batch file. One should expect
out1.txt and out2.txt to be identical.

out1.txt (created by running test.py from the command line) is
correct (verify by opening out1.txt in notepad, and selecting a
Japanese capable font, e.g. Lucida Sans Unicode). The string in
the first cell of the html table is the three japanese characters
for word "taberu".

But in out2.txt (created by running test.py from a windows .bat
file), instead of japanese characters there, we see an ascii text
string "A9D9EB". (The EUC-JP value of the actual japanese characters
that should be there are \xBF\xA9\xA4\xD9\XA4\xEB, so the printed
hex digits seems to come from alternate bytes of the EUC-JP string.

In other lines with japanese characters a similar effect is seen:
the first two japanese character are replaced with with a string of
hex digits. Strangely, remaining japanese characters on the line
are not corrupted.

Running with a debugger shows that the corruption is in the text
received from urllib; it is not a result of the euc-jp decoding,
UTF-8 encoding, or writing to the output file.

So it looks like some bad mojo between urllib and the Windows
batch environment.
Jul 18 '05 #1
7 2307
"Stuart McGraw" <sm******@frii.RimoovThisToReply.com> writes:
[...]
2. Create a batch file that will run test.py:
test.bat:
----------------
python test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1
----------------

3. In a cmd.exe window run the following two commands:
python test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1 >out1.txt
test.bat >out2.txt

4. out1.txt and out2.txt should be identical. But they are not. [...] Running with a debugger shows that the corruption is in the text
received from urllib; it is not a result of the euc-jp decoding,
UTF-8 encoding, or writing to the output file.
Hmm...

So it looks like some bad mojo between urllib and the Windows
batch environment.


Just a guess, without actually bothering to think about the numerology
in detail:

test.bat:
----------------
python -u test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1
----------------

Note the -u switch (for 'unbuffered', but also 'um, binary mode'
<wink>).
John
Jul 18 '05 #2
"John J. Lee" <jj*@pobox.com> wrote in message news:87************@pobox.com...
"Stuart McGraw" <sm******@frii.RimoovThisToReply.com> writes:
[...]
So it looks like some bad mojo between urllib and the Windows
batch environment.


Just a guess, without actually bothering to think about the numerology
in detail:

test.bat:
----------------
python -u test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1
----------------

Note the -u switch (for 'unbuffered', but also 'um, binary mode'
<wink>).


Did you try doing that? Did it work for you? I just tried here, and
still have the same problem.

Even worse, in the original script that the test script is derived from
I encountered a new problem. Intermixed with the web page data
returned by urllib, is bits and pieces (10-20 characters long) of local
file and directory names. Only happens reading some web pages
(EUC-JP encoded as with the original problem) but I'm wondering
if there are some single-byte/double-byte character issues with urllib.
That would be surprising to me given that urllib is shipped with the
Python distribution, I would think that any core libs would be pretty
bombproof. (Am I being naive? :-) Of course, still possible I hosed
something in my script, so I will double check...

Jul 18 '05 #3
On Sat, 18 Sep 2004 16:23:40 -0600, "Stuart McGraw" <sm******@frii.RimoovThisToReply.com> wrote:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

=============================================== =

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.

Just a thought: in case your command line is being interpreted
by cmd.exe and .bat by something else (command.com?) you could
check if it makes a difference, e.g.,

copy test.bat test.cmd

and try it again? (explicitly as test.cmd, not just test, since any
same-name .com or .exe or .bat may have priority over .cmd)
You can probably investigate the latter by something like

[21:54] C:\pywk\junk>echo %pathext%
.COM;.EXE;.BAT;.CMD

Regards,
Bengt Richter
Jul 18 '05 #4
"Bengt Richter" <bo**@oz.net> wrote in message news:ci*************************@theriver.com...
On Sat, 18 Sep 2004 16:23:40 -0600, "Stuart McGraw" <sm******@frii.RimoovThisToReply.com> wrote:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

=============================================== =

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.

Just a thought: in case your command line is being interpreted
by cmd.exe and .bat by something else (command.com?) you could
check if it makes a difference, e.g.,

copy test.bat test.cmd

and try it again? (explicitly as test.cmd, not just test, since any
same-name .com or .exe or .bat may have priority over .cmd)
You can probably investigate the latter by something like

[21:54] C:\pywk\junk>echo %pathext%
.COM;.EXE;.BAT;.CMD


Well, I'm pretty sure cmd.exe was executing it, but I tried your
suggestion to make absolutely sure. Same results :-(
Given the other (seeming) urllib problem I mentioned in another
post in this thread, which appeared without any involvement
of batch scripts, I am getting more and more suspicious that
urllib is buggy, at least with non-single byte data.

Jul 18 '05 #5
On Mon, 20 Sep 2004 07:33:13 -0600, "Stuart McGraw" <sm******@frii.RimoovThisToReply.com> wrote:
"Bengt Richter" <bo**@oz.net> wrote in message news:ci*************************@theriver.com...
On Sat, 18 Sep 2004 16:23:40 -0600, "Stuart McGraw" <sm******@frii.RimoovThisToReply.com> wrote:
>I just spent a $*#@!*&^&% hour registering at ^$#@#%^
>Sourceforce and trying to submit a Python bug report
>but it still won't let me. I give up. Maybe someone who
>cares will see this post, or maybe it will save time for
>someone else who runs into this problem...
>
>=============================================== =
>
>Environment:
>- Microsoft Windows 2000 Pro
>- Python 2.3.4
>- urllib (version shipped with Python-2.3.4)
>
>Problem:
> urllib returns corrupted data when reading an EUC-JP encoded
> web page, from a python script run from a MS Windows .BAT
>file, but not when the same script is run from the command line.

Just a thought: in case your command line is being interpreted
by cmd.exe and .bat by something else (command.com?) you could
check if it makes a difference, e.g.,

copy test.bat test.cmd

and try it again? (explicitly as test.cmd, not just test, since any
same-name .com or .exe or .bat may have priority over .cmd)
You can probably investigate the latter by something like

[21:54] C:\pywk\junk>echo %pathext%
.COM;.EXE;.BAT;.CMD


Well, I'm pretty sure cmd.exe was executing it, but I tried your
suggestion to make absolutely sure. Same results :-(
Given the other (seeming) urllib problem I mentioned in another
post in this thread, which appeared without any involvement
of batch scripts, I am getting more and more suspicious that
urllib is buggy, at least with non-single byte data.

Hm, what happens if you make a test2.py and pass it the name of an output
file instead of piping the output from print? In fact, eliminate the
encoding and the line generator and everything, and just let test2 copy the entire
server data in one single read and write it in binary. I.e,
open(sys.argv[2],'wb').write(urllib.urlopen(...).read())

That should show whether python is seeing the identical input from the server.
Then you could do it line-wise (not with a print line ending in ",", but with
a binary file write). That would say whether line generation chunking on input
was doing anything to the data -- if possibly urrlib is buffering/chunking
differently for interactive vs bat file. Just grasping at straws, but eliminating
chunking, piping, re/encoding, binary vs text mode doubts from the test should
show why interactive vs .bat is different IWT.

Also, your mention of two-character errors made me wonder about spurious BOMs
or such from encoding file substrings as though they were entire files?
Would a final print for a final '\n' do anything that might trigger a final flush
differently with potential cooking consequence? (why the print with space instead BTW)?
What if you just do your own file.write output in binary and control everything?

Just some additional thoughts. Sorry the cmd vs bat thing didn't do anything.
BTW, what command line options are in use to start your interactive session
(it is console, not idle, right?). You didn't seem to have any (e.g. -u) in test.py.
Could the .BAT file be seeing a different environment? could the http://.. need quoting?
I.e., could the server be seeing a glitched url tail and be sending the same file but with some
different option?

Hope something gives you a useful idea. That's all I can think of for the moment ;-)

Regards,
Bengt Richter
Jul 18 '05 #6
"Stuart McGraw" <sm******@frii.RimoovThisToReply.com> writes:
"John J. Lee" <jj*@pobox.com> wrote in message news:87************@pobox.com... [...]
Just a guess

[...] Did you try doing that?
No. It was a guess.
[...] That would be surprising to me given that urllib is shipped with the
Python distribution, I would think that any core libs would be pretty
bombproof. (Am I being naive? :-) [...]

Possibly:

http://article.gmane.org/gmane.comp.python.devel/63911
But I have a fairly strong suspicion that this *isn't* a bug in urllib
or Python: I think urllib regards HTTP response data simply as a
binary string (as opposed to the case of URLs, where things are... uh,
complicated).

*I'm* certainly more naive about encodings, character sets &c. than
any Python module, though <wink>...

Of course, still possible I hosed


Yes, that's possible :-)
John
Jul 18 '05 #7
Thanks everyone for all the suggestions. I will follow up on them,
but not right now. I am about to move halfway around the world
for a few months so I will need to get settled in before I have
time to look into this more.
Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Richard Shea | last post by:
Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy' function (part of the mx.tidy package). There's one thing I'm having real difficulties understanding. When I did this ... ...
0
by: C GIllespie | last post by:
Dear All, I'm having problems using the urllib module and was wondering if anyone could suggest a solution. The only thing I can thing of is that I'm using at university and my uni uses a...
11
by: Pater Maximus | last post by:
I am trying to implement the recipe listed at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/211886 However, I can not get to first base. When I try to run import urllib...
0
by: Shane Hathaway | last post by:
I started experimenting with SOAPpy yesterday and immediately hit a snag. Both web services I tried simply hung and never replied. After a lot of digging, I found out what was going wrong:...
0
by: Pieter Edelman | last post by:
Hi all, I'm trying to submit some data using a POST request to a HTTP server with BASIC authentication with python, but I can't get it to work. Since it's driving me completely nuts, so here's...
6
by: JabaPyth | last post by:
Hello, I'm trying to use the urllib module, but when i try urllib.urlopen, it gives me a socket error: >>import urllib >>print urllib.urlopen('http://www.google.com/').read() Traceback (most...
11
by: livin | last post by:
I need to post form data to an ASP page that looks like this on the page itself... <form method='POST'><input src=\icons\devices\coffee-on.gif type='image' align='absmiddle' width=16 height=16...
8
by: Gabriel Zachmann | last post by:
Here is a very simple Python script utilizing urllib: import urllib url = "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological" print url print file = urllib.urlopen(...
4
by: kgrafals | last post by:
Hi, I'm just trying to read from a webpage with urllib but I'm getting IOErrors. This is my code: import urllib sock = urllib.urlopen("http://www.google.com/") and this is the error:
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.