473,499 Members | 1,614 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Issue with unicode

4 New Member
Hi Guys
I have written a library for reading from XML files.
I have used the SAX library for the same.
Then i use the strings(unicode strings) returned from the parser and pass it on to a Fuzzing Library written by me.
The Strings consists of hex characters.
for eg.They can be something like
String = "a\x01\x02\x0c"
However the problem is that Sax is returning strings like this as a series of contiguous characters and not as escaped hex characters.
If you do a len(String) you will get something like 13 instead of 4.
I tried a lot of encoding and decoding combinations to get the original string but i cannot.
i have been stuck in this situation for around 2 days now and my coding phase has been obstructed.
Please help me with this ASAP.

Details:
OS:Linux
Python:Version 2.4.3.
XML Library:sax
Jul 24 '07 #1
5 3493
bartonc
6,596 Recognized Expert Expert
It should be as easy as converting to a string. Given a unicode string:
>>> a = "ÀÈÌÒÙ"
the built in str() should convert it
>>> b = str(a)
>>> b
'\xc0\xc8\xcc\xd2\xd9'
>>>
Jul 24 '07 #2
anandrage
4 New Member
It should be as easy as converting to a string. Given a unicode string:
>>> a = "ÀÈÌÒÙ"
the built in str() should convert it
>>> b = str(a)
>>> b
'\xc0\xc8\xcc\xd2\xd9'
>>>



Thanks for the reply
This wont work however.
This is the first thing that everyone tries in order to convert.(str())
The problem is not with the conversion aspect of it.
The issue is with the format of conversion.When we are trying to typecast as you have mentioned it will do a regular typecast but the format remains the same.
That is the characters are still contigous and the "Escaped" meaning is not present.
The issue is that i want the unicode to be treated as escaped hex characters and not one character each for "\" ,"x" and so on.
This is because of the SAX Parser returning the input in such a fashion.

See when the string is u"\x01\x02"
i want the len(string) to return 2 and not 8.
so that \x01 is treated as one character.

If u use the python command prompt and make a plain declaration like
a="\x01\x02"
len(a) will return 2 and not 8.You can check the same.Hence i want the same behavior with my string as well as that is how my Fuzzing Library expects the string to come.

I hope i have made the issue a little clearer this time around.

Thanks a lot for the prompt reply.I need a solution ASAP or else im in deep trouble.
Jul 25 '07 #3
bvdet
2,851 Recognized Expert Moderator Specialist


Thanks for the reply
This wont work however.
This is the first thing that everyone tries in order to convert.(str())
The problem is not with the conversion aspect of it.
The issue is with the format of conversion.When we are trying to typecast as you have mentioned it will do a regular typecast but the format remains the same.
That is the characters are still contigous and the "Escaped" meaning is not present.
The issue is that i want the unicode to be treated as escaped hex characters and not one character each for "\" ,"x" and so on.
This is because of the SAX Parser returning the input in such a fashion.

See when the string is u"\x01\x02"
i want the len(string) to return 2 and not 8.
so that \x01 is treated as one character.

If u use the python command prompt and make a plain declaration like
a="\x01\x02"
len(a) will return 2 and not 8.You can check the same.Hence i want the same behavior with my string as well as that is how my Fuzzing Library expects the string to come.

I hope i have made the issue a little clearer this time around.

Thanks a lot for the prompt reply.I need a solution ASAP or else im in deep trouble.
If I understand your problem correctly, you can do something like this:
Expand|Select|Wrap|Line Numbers
  1. >>> b
  2. u'\x01\x02'
  3. >>> str(b)
  4. '\x01\x02'
  5. >>> repr(str(b))
  6. "'\\x01\\x02'"
  7. >>> len(repr(str(b)))
  8. 10
  9. >>> s = repr(str(b)).replace("'", "")
  10. >>> s
  11. '\\x01\\x02'
  12. >>> len(s)
  13. 8
Jul 25 '07 #4
anandrage
4 New Member
If I understand your problem correctly, you can do something like this:
Expand|Select|Wrap|Line Numbers
  1. >>> b
  2. u'\x01\x02'
  3. >>> str(b)
  4. '\x01\x02'
  5. >>> repr(str(b))
  6. "'\\x01\\x02'"
  7. >>> len(repr(str(b)))
  8. 10
  9. >>> s = repr(str(b)).replace("'", "")
  10. >>> s
  11. '\\x01\\x02'
  12. >>> len(s)
  13. 8
Hi
Thanks for the reply
Im afraid you still didnt understand the actual issue.
I want the length of the string as two and not 10 or 8.
When we are doin len(string) it should consider the escaped characters and not the string as a series of raw characters as we can see here.

I want something like This

Code:
"What Normally happens on the command Line":
>>>s = u"\x01\x02"
>>>len(s)
2



However if i do the same on the Unicode String sent back by the Sax Parser I get this:
>>>s
"\x01\x02"
>>>len(s)
8
>>>type(s)
<type 'unicode>
>>>s = str(s) or s = s.encode("utf-8")
>>>s
"\x01\x02"
>>>len(s)
8


>>>"THIS IS NOT WHAT I WANT"
>>>"What i want Follows this"
>>>s = <some Function>(s)
>>>len(s)
2
>>>s
"\x01\x02"
>>>"I THINK THIS MAKES A BIT MORE CLEAR"


If some one can get me that function(some function above)
Then my problem will be solved.

Thanks for the try anyhow.
Jul 26 '07 #5
bvdet
2,851 Recognized Expert Moderator Specialist
Hi
Thanks for the reply
Im afraid you still didnt understand the actual issue.
I want the length of the string as two and not 10 or 8.
When we are doin len(string) it should consider the escaped characters and not the string as a series of raw characters as we can see here.

I want something like This

Code:
"What Normally happens on the command Line":
>>>s = u"\x01\x02"
>>>len(s)
2



However if i do the same on the Unicode String sent back by the Sax Parser I get this:
>>>s
"\x01\x02"
>>>len(s)
8
>>>type(s)
<type 'unicode>
>>>s = str(s) or s = s.encode("utf-8")
>>>s
"\x01\x02"
>>>len(s)
8


>>>"THIS IS NOT WHAT I WANT"
>>>"What i want Follows this"
>>>s = <some Function>(s)
>>>len(s)
2
>>>s
"\x01\x02"
>>>"I THINK THIS MAKES A BIT MORE CLEAR"


If some one can get me that function(some function above)
Then my problem will be solved.

Thanks for the try anyhow.
Sorry, but I took a part of your previous post out of context.
I cannot explain why your string is represented as unicode but returns a length of 8. You might try something like this:
Expand|Select|Wrap|Line Numbers
  1. import re
  2.  
  3. def lenUstr(s):
  4.     patt = r'\\x\d+'
  5.     return len(re.findall(patt, s1))
  6.  
  7. s = u'\x01\x02'
  8. print type(s)
  9.  
  10. print lenUstr(repr(str(s)).replace("'", ""))
Expand|Select|Wrap|Line Numbers
  1. >>> <type 'unicode'>
  2. 2
  3. >>> 
Jul 26 '07 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

1
2162
by: Marc Petitmermet | last post by:
The line below looks up the name "öttinger" (with the German umlaut) of an author using the mysql console: mysql> select author from records where author like '%&#xd6;ttinger%'; This...
0
2571
by: Jonathan | last post by:
I have a unicode database and I basically wish to publish out certain data (via views) from it to a non unicode database. Unfortunately we can not change the type of either of the databases due to...
2
12066
by: Mark Anderson | last post by:
Hi, I've a problem with code that should produce a Windows(ANSI) encoded text file but doesn't. Server is IIS 5 on Win 2k, with ASP ver? My ASP uses data from an upstream HTML form on a UTF-8...
2
2691
by: john | last post by:
Hello, We have a tablet pc that is trying to sync data to our main sql server during the night but we keep getting a timeout issue. The tablet pc has an access database that contains about 5000...
4
3861
by: candide_sh | last post by:
Hello, I created a script with database publishing wizard to convert a SS2005 db into a SS2k db. The script has schema and data and looks very good. When I try to start the created script in...
8
1407
by: John Nagle | last post by:
The Python documentation for "str" says "str() : Return a string containing a nicely printable representation of an object." However, there's no mention of the fact that "str" of a Unicode...
23
1292
by: Thorsten Kampe | last post by:
Hi, I've written a script which uses Optik/Optparse to display the options (which works fine). The text for the help message is localised (with german umlauts) and when I execute the script with...
7
8516
by: simonroses | last post by:
Hello Guys, I have installed python 2.5 (AMD64) on Vista (64), also installed wx 2.8 but I'm getting this error: """" Traceback (most recent call last): File "mymodule.py", line 39, in...
10
1394
by: Samuel | last post by:
Hi I am trying to read text files that are saved in ANSI format with Unicode characters such as French e German big S etc, and as I read the file these characters appear as squares etc. I...
0
1279
by: amollokhande1 | last post by:
Hi All, Currently we are facing an issue while decoding the Base64Encoded unicode data. Here is the scenario We have one custom javascript function that encodes the unicode data using Base64...
0
7134
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7014
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7180
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
1
6905
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
7395
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
1
4921
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4609
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3103
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
311
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.