473,705 Members | 3,166 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

bad data from urllib when run from MS .bat file

I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

=============== =============== =============== ===

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.

Note: To reproduce this problem, it helps to have East Asian font
support installed on the test system. In Windows 2000:
Control Panel,
Regional Options, General tab
check mark in Japanese in the "Language seetings..." area.
Python also needs either the cjkcodecs (http://cjkpython.berlios.de/)
or Tamito KAJIYAMA's japanese codecs
(http://www.asahi-net.or.jp/~rd6t-kjym/python/)
installed.

To reproduce the problem...

1. Create a python file, test.py:
test.py:
----------------
import sys, urllib, cjkcodecs
f = urllib.urlopen (sys.argv[1])
for ln in f:
ln = ln.decode ("cjkcodecs. euc-jp")
print ln.encode("utf-8"),
----------------

2. Create a batch file that will run test.py:
test.bat:
----------------
python test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1
----------------

3. In a cmd.exe window run the following two commands:
python test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1 >out1.txt
test.bat >out2.txt

4. out1.txt and out2.txt should be identical. But they are not.

The url used will return a EUC-JP encoded page with some japanese
characters in it. Test.py reads the page line by line, decodes
the lines to unicode, reencodes them to UTF-8, and writes to a file.
Thus the output file should be a UTF-8 version of the EUC-JP web page.

The first command runs test.py directly. The second command runs
the identical command from a Windows batch file. One should expect
out1.txt and out2.txt to be identical.

out1.txt (created by running test.py from the command line) is
correct (verify by opening out1.txt in notepad, and selecting a
Japanese capable font, e.g. Lucida Sans Unicode). The string in
the first cell of the html table is the three japanese characters
for word "taberu".

But in out2.txt (created by running test.py from a windows .bat
file), instead of japanese characters there, we see an ascii text
string "A9D9EB". (The EUC-JP value of the actual japanese characters
that should be there are \xBF\xA9\xA4\xD 9\XA4\xEB, so the printed
hex digits seems to come from alternate bytes of the EUC-JP string.

In other lines with japanese characters a similar effect is seen:
the first two japanese character are replaced with with a string of
hex digits. Strangely, remaining japanese characters on the line
are not corrupted.

Running with a debugger shows that the corruption is in the text
received from urllib; it is not a result of the euc-jp decoding,
UTF-8 encoding, or writing to the output file.

So it looks like some bad mojo between urllib and the Windows
batch environment.
Jul 18 '05 #1
7 2327
"Stuart McGraw" <sm******@frii. RimoovThisToRep ly.com> writes:
[...]
2. Create a batch file that will run test.py:
test.bat:
----------------
python test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1
----------------

3. In a cmd.exe window run the following two commands:
python test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1 >out1.txt
test.bat >out2.txt

4. out1.txt and out2.txt should be identical. But they are not. [...] Running with a debugger shows that the corruption is in the text
received from urllib; it is not a result of the euc-jp decoding,
UTF-8 encoding, or writing to the output file.
Hmm...

So it looks like some bad mojo between urllib and the Windows
batch environment.


Just a guess, without actually bothering to think about the numerology
in detail:

test.bat:
----------------
python -u test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1
----------------

Note the -u switch (for 'unbuffered', but also 'um, binary mode'
<wink>).
John
Jul 18 '05 #2
"John J. Lee" <jj*@pobox.co m> wrote in message news:87******** ****@pobox.com. ..
"Stuart McGraw" <sm******@frii. RimoovThisToRep ly.com> writes:
[...]
So it looks like some bad mojo between urllib and the Windows
batch environment.


Just a guess, without actually bothering to think about the numerology
in detail:

test.bat:
----------------
python -u test.py http://etext.lib.virginia.edu/cgi-lo...A4%D9%A4%EB_v1
----------------

Note the -u switch (for 'unbuffered', but also 'um, binary mode'
<wink>).


Did you try doing that? Did it work for you? I just tried here, and
still have the same problem.

Even worse, in the original script that the test script is derived from
I encountered a new problem. Intermixed with the web page data
returned by urllib, is bits and pieces (10-20 characters long) of local
file and directory names. Only happens reading some web pages
(EUC-JP encoded as with the original problem) but I'm wondering
if there are some single-byte/double-byte character issues with urllib.
That would be surprising to me given that urllib is shipped with the
Python distribution, I would think that any core libs would be pretty
bombproof. (Am I being naive? :-) Of course, still possible I hosed
something in my script, so I will double check...

Jul 18 '05 #3
On Sat, 18 Sep 2004 16:23:40 -0600, "Stuart McGraw" <sm******@frii. RimoovThisToRep ly.com> wrote:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

============== =============== =============== ====

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.

Just a thought: in case your command line is being interpreted
by cmd.exe and .bat by something else (command.com?) you could
check if it makes a difference, e.g.,

copy test.bat test.cmd

and try it again? (explicitly as test.cmd, not just test, since any
same-name .com or .exe or .bat may have priority over .cmd)
You can probably investigate the latter by something like

[21:54] C:\pywk\junk>ec ho %pathext%
.COM;.EXE;.BAT; .CMD

Regards,
Bengt Richter
Jul 18 '05 #4
"Bengt Richter" <bo**@oz.net> wrote in message news:ci******** *************** **@theriver.com ...
On Sat, 18 Sep 2004 16:23:40 -0600, "Stuart McGraw" <sm******@frii. RimoovThisToRep ly.com> wrote:
I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...

============== =============== =============== ====

Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)

Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.

Just a thought: in case your command line is being interpreted
by cmd.exe and .bat by something else (command.com?) you could
check if it makes a difference, e.g.,

copy test.bat test.cmd

and try it again? (explicitly as test.cmd, not just test, since any
same-name .com or .exe or .bat may have priority over .cmd)
You can probably investigate the latter by something like

[21:54] C:\pywk\junk>ec ho %pathext%
.COM;.EXE;.BAT; .CMD


Well, I'm pretty sure cmd.exe was executing it, but I tried your
suggestion to make absolutely sure. Same results :-(
Given the other (seeming) urllib problem I mentioned in another
post in this thread, which appeared without any involvement
of batch scripts, I am getting more and more suspicious that
urllib is buggy, at least with non-single byte data.

Jul 18 '05 #5
On Mon, 20 Sep 2004 07:33:13 -0600, "Stuart McGraw" <sm******@frii. RimoovThisToRep ly.com> wrote:
"Bengt Richter" <bo**@oz.net> wrote in message news:ci******** *************** **@theriver.com ...
On Sat, 18 Sep 2004 16:23:40 -0600, "Stuart McGraw" <sm******@frii. RimoovThisToRep ly.com> wrote:
>I just spent a $*#@!*&^&% hour registering at ^$#@#%^
>Sourceforce and trying to submit a Python bug report
>but it still won't let me. I give up. Maybe someone who
>cares will see this post, or maybe it will save time for
>someone else who runs into this problem...
>
>============== =============== =============== ====
>
>Environment:
>- Microsoft Windows 2000 Pro
>- Python 2.3.4
>- urllib (version shipped with Python-2.3.4)
>
>Problem:
> urllib returns corrupted data when reading an EUC-JP encoded
> web page, from a python script run from a MS Windows .BAT
>file, but not when the same script is run from the command line.

Just a thought: in case your command line is being interpreted
by cmd.exe and .bat by something else (command.com?) you could
check if it makes a difference, e.g.,

copy test.bat test.cmd

and try it again? (explicitly as test.cmd, not just test, since any
same-name .com or .exe or .bat may have priority over .cmd)
You can probably investigate the latter by something like

[21:54] C:\pywk\junk>ec ho %pathext%
.COM;.EXE;.BAT; .CMD


Well, I'm pretty sure cmd.exe was executing it, but I tried your
suggestion to make absolutely sure. Same results :-(
Given the other (seeming) urllib problem I mentioned in another
post in this thread, which appeared without any involvement
of batch scripts, I am getting more and more suspicious that
urllib is buggy, at least with non-single byte data.

Hm, what happens if you make a test2.py and pass it the name of an output
file instead of piping the output from print? In fact, eliminate the
encoding and the line generator and everything, and just let test2 copy the entire
server data in one single read and write it in binary. I.e,
open(sys.argv[2],'wb').write(ur llib.urlopen(.. .).read())

That should show whether python is seeing the identical input from the server.
Then you could do it line-wise (not with a print line ending in ",", but with
a binary file write). That would say whether line generation chunking on input
was doing anything to the data -- if possibly urrlib is buffering/chunking
differently for interactive vs bat file. Just grasping at straws, but eliminating
chunking, piping, re/encoding, binary vs text mode doubts from the test should
show why interactive vs .bat is different IWT.

Also, your mention of two-character errors made me wonder about spurious BOMs
or such from encoding file substrings as though they were entire files?
Would a final print for a final '\n' do anything that might trigger a final flush
differently with potential cooking consequence? (why the print with space instead BTW)?
What if you just do your own file.write output in binary and control everything?

Just some additional thoughts. Sorry the cmd vs bat thing didn't do anything.
BTW, what command line options are in use to start your interactive session
(it is console, not idle, right?). You didn't seem to have any (e.g. -u) in test.py.
Could the .BAT file be seeing a different environment? could the http://.. need quoting?
I.e., could the server be seeing a glitched url tail and be sending the same file but with some
different option?

Hope something gives you a useful idea. That's all I can think of for the moment ;-)

Regards,
Bengt Richter
Jul 18 '05 #6
"Stuart McGraw" <sm******@frii. RimoovThisToRep ly.com> writes:
"John J. Lee" <jj*@pobox.co m> wrote in message news:87******** ****@pobox.com. .. [...]
Just a guess

[...] Did you try doing that?
No. It was a guess.
[...] That would be surprising to me given that urllib is shipped with the
Python distribution, I would think that any core libs would be pretty
bombproof. (Am I being naive? :-) [...]

Possibly:

http://article.gmane.org/gmane.comp.python.devel/63911
But I have a fairly strong suspicion that this *isn't* a bug in urllib
or Python: I think urllib regards HTTP response data simply as a
binary string (as opposed to the case of URLs, where things are... uh,
complicated).

*I'm* certainly more naive about encodings, character sets &c. than
any Python module, though <wink>...

Of course, still possible I hosed


Yes, that's possible :-)
John
Jul 18 '05 #7
Thanks everyone for all the suggestions. I will follow up on them,
but not right now. I am about to move halfway around the world
for a few months so I will need to get settled in before I have
time to look into this more.
Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
3993
by: Richard Shea | last post by:
Hi - I'm new to Python. I've been trying to use URLLIB and the 'tidy' function (part of the mx.tidy package). There's one thing I'm having real difficulties understanding. When I did this ... finA= urllib.urlopen('http://www.python.org/') foutA=open('C:\\testout.html','w') tidy(finA,foutA,None) I get ...
0
1381
by: C GIllespie | last post by:
Dear All, I'm having problems using the urllib module and was wondering if anyone could suggest a solution. The only thing I can thing of is that I'm using at university and my uni uses a compulsory proxy. I'm using the python 2.3.3 and have tried both windows and Linux environments. I've tried different urls as well. Here's the code and error message I get
11
5050
by: Pater Maximus | last post by:
I am trying to implement the recipe listed at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/211886 However, I can not get to first base. When I try to run import urllib fo=urllib.urlopen("http://www.dictionary.com/") page = fo.read() I get:
0
1291
by: Shane Hathaway | last post by:
I started experimenting with SOAPpy yesterday and immediately hit a snag. Both web services I tried simply hung and never replied. After a lot of digging, I found out what was going wrong: urllib.urlopen() is issuing an HTTP/1.0 request, but Microsoft IIS 5 ignores the client HTTP version and replies with an HTTP/1.1 response. This is a problem because while HTTP/1.0 servers are expected to close the connection once the response is...
0
3594
by: Pieter Edelman | last post by:
Hi all, I'm trying to submit some data using a POST request to a HTTP server with BASIC authentication with python, but I can't get it to work. Since it's driving me completely nuts, so here's my cry for help. The server is an elog logbook server (http://midas.psi.ch/elog/). It is protected with a password and an empty username. I can login both using urllib and urllib2 (suppose the password is "foobar", the logbook is running on port...
6
14530
by: JabaPyth | last post by:
Hello, I'm trying to use the urllib module, but when i try urllib.urlopen, it gives me a socket error: >>import urllib >>print urllib.urlopen('http://www.google.com/').read() Traceback (most recent call last): File "<input>", line 1, in ? File "C:\Python24\lib\urllib.py", line 77, in urlopen return opener.open(url)
11
4801
by: livin | last post by:
I need to post form data to an ASP page that looks like this on the page itself... <form method='POST'><input src=\icons\devices\coffee-on.gif type='image' align='absmiddle' width=16 height=16 title='Off'><input type='hidden' value='Image' name='Action'><input type='hidden' value='hs.ExecX10ByName "Kitchen Espresso Machine", "Off", 100'></form> I've been trying this but I get a syntax error... params = urllib.urlencode({'hidden':...
8
3495
by: Gabriel Zachmann | last post by:
Here is a very simple Python script utilizing urllib: import urllib url = "http://commons.wikimedia.org/wiki/Commons:Featured_pictures/chronological" print url print file = urllib.urlopen( url ) mime = file.info() print mime
4
9342
by: kgrafals | last post by:
Hi, I'm just trying to read from a webpage with urllib but I'm getting IOErrors. This is my code: import urllib sock = urllib.urlopen("http://www.google.com/") and this is the error:
0
8690
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9139
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9034
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
7895
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6606
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4440
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4704
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3138
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2083
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.