471,319 Members | 3,224 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,319 software developers and data experts.

Issue with unicode

Hi Guys
I have written a library for reading from XML files.
I have used the SAX library for the same.
Then i use the strings(unicode strings) returned from the parser and pass it on to a Fuzzing Library written by me.
The Strings consists of hex characters.
for eg.They can be something like
String = "a\x01\x02\x0c"
However the problem is that Sax is returning strings like this as a series of contiguous characters and not as escaped hex characters.
If you do a len(String) you will get something like 13 instead of 4.
I tried a lot of encoding and decoding combinations to get the original string but i cannot.
i have been stuck in this situation for around 2 days now and my coding phase has been obstructed.
Please help me with this ASAP.

Details:
OS:Linux
Python:Version 2.4.3.
XML Library:sax
Jul 24 '07 #1
5 3405
bartonc
6,596 Expert 4TB
It should be as easy as converting to a string. Given a unicode string:
>>> a = ""
the built in str() should convert it
>>> b = str(a)
>>> b
'\xc0\xc8\xcc\xd2\xd9'
>>>
Jul 24 '07 #2
It should be as easy as converting to a string. Given a unicode string:
>>> a = ""
the built in str() should convert it
>>> b = str(a)
>>> b
'\xc0\xc8\xcc\xd2\xd9'
>>>



Thanks for the reply
This wont work however.
This is the first thing that everyone tries in order to convert.(str())
The problem is not with the conversion aspect of it.
The issue is with the format of conversion.When we are trying to typecast as you have mentioned it will do a regular typecast but the format remains the same.
That is the characters are still contigous and the "Escaped" meaning is not present.
The issue is that i want the unicode to be treated as escaped hex characters and not one character each for "\" ,"x" and so on.
This is because of the SAX Parser returning the input in such a fashion.

See when the string is u"\x01\x02"
i want the len(string) to return 2 and not 8.
so that \x01 is treated as one character.

If u use the python command prompt and make a plain declaration like
a="\x01\x02"
len(a) will return 2 and not 8.You can check the same.Hence i want the same behavior with my string as well as that is how my Fuzzing Library expects the string to come.

I hope i have made the issue a little clearer this time around.

Thanks a lot for the prompt reply.I need a solution ASAP or else im in deep trouble.
Jul 25 '07 #3
bvdet
2,851 Expert Mod 2GB


Thanks for the reply
This wont work however.
This is the first thing that everyone tries in order to convert.(str())
The problem is not with the conversion aspect of it.
The issue is with the format of conversion.When we are trying to typecast as you have mentioned it will do a regular typecast but the format remains the same.
That is the characters are still contigous and the "Escaped" meaning is not present.
The issue is that i want the unicode to be treated as escaped hex characters and not one character each for "\" ,"x" and so on.
This is because of the SAX Parser returning the input in such a fashion.

See when the string is u"\x01\x02"
i want the len(string) to return 2 and not 8.
so that \x01 is treated as one character.

If u use the python command prompt and make a plain declaration like
a="\x01\x02"
len(a) will return 2 and not 8.You can check the same.Hence i want the same behavior with my string as well as that is how my Fuzzing Library expects the string to come.

I hope i have made the issue a little clearer this time around.

Thanks a lot for the prompt reply.I need a solution ASAP or else im in deep trouble.
If I understand your problem correctly, you can do something like this:
Expand|Select|Wrap|Line Numbers
  1. >>> b
  2. u'\x01\x02'
  3. >>> str(b)
  4. '\x01\x02'
  5. >>> repr(str(b))
  6. "'\\x01\\x02'"
  7. >>> len(repr(str(b)))
  8. 10
  9. >>> s = repr(str(b)).replace("'", "")
  10. >>> s
  11. '\\x01\\x02'
  12. >>> len(s)
  13. 8
Jul 25 '07 #4
If I understand your problem correctly, you can do something like this:
Expand|Select|Wrap|Line Numbers
  1. >>> b
  2. u'\x01\x02'
  3. >>> str(b)
  4. '\x01\x02'
  5. >>> repr(str(b))
  6. "'\\x01\\x02'"
  7. >>> len(repr(str(b)))
  8. 10
  9. >>> s = repr(str(b)).replace("'", "")
  10. >>> s
  11. '\\x01\\x02'
  12. >>> len(s)
  13. 8
Hi
Thanks for the reply
Im afraid you still didnt understand the actual issue.
I want the length of the string as two and not 10 or 8.
When we are doin len(string) it should consider the escaped characters and not the string as a series of raw characters as we can see here.

I want something like This

Code:
"What Normally happens on the command Line":
>>>s = u"\x01\x02"
>>>len(s)
2



However if i do the same on the Unicode String sent back by the Sax Parser I get this:
>>>s
"\x01\x02"
>>>len(s)
8
>>>type(s)
<type 'unicode>
>>>s = str(s) or s = s.encode("utf-8")
>>>s
"\x01\x02"
>>>len(s)
8


>>>"THIS IS NOT WHAT I WANT"
>>>"What i want Follows this"
>>>s = <some Function>(s)
>>>len(s)
2
>>>s
"\x01\x02"
>>>"I THINK THIS MAKES A BIT MORE CLEAR"


If some one can get me that function(some function above)
Then my problem will be solved.

Thanks for the try anyhow.
Jul 26 '07 #5
bvdet
2,851 Expert Mod 2GB
Hi
Thanks for the reply
Im afraid you still didnt understand the actual issue.
I want the length of the string as two and not 10 or 8.
When we are doin len(string) it should consider the escaped characters and not the string as a series of raw characters as we can see here.

I want something like This

Code:
"What Normally happens on the command Line":
>>>s = u"\x01\x02"
>>>len(s)
2



However if i do the same on the Unicode String sent back by the Sax Parser I get this:
>>>s
"\x01\x02"
>>>len(s)
8
>>>type(s)
<type 'unicode>
>>>s = str(s) or s = s.encode("utf-8")
>>>s
"\x01\x02"
>>>len(s)
8


>>>"THIS IS NOT WHAT I WANT"
>>>"What i want Follows this"
>>>s = <some Function>(s)
>>>len(s)
2
>>>s
"\x01\x02"
>>>"I THINK THIS MAKES A BIT MORE CLEAR"


If some one can get me that function(some function above)
Then my problem will be solved.

Thanks for the try anyhow.
Sorry, but I took a part of your previous post out of context.
I cannot explain why your string is represented as unicode but returns a length of 8. You might try something like this:
Expand|Select|Wrap|Line Numbers
  1. import re
  2.  
  3. def lenUstr(s):
  4.     patt = r'\\x\d+'
  5.     return len(re.findall(patt, s1))
  6.  
  7. s = u'\x01\x02'
  8. print type(s)
  9.  
  10. print lenUstr(repr(str(s)).replace("'", ""))
Expand|Select|Wrap|Line Numbers
  1. >>> <type 'unicode'>
  2. 2
  3. >>> 
Jul 26 '07 #6

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

1 post views Thread by Marc Petitmermet | last post: by
reply views Thread by Jonathan | last post: by
2 posts views Thread by john | last post: by
8 posts views Thread by John Nagle | last post: by
23 posts views Thread by Thorsten Kampe | last post: by
7 posts views Thread by simonroses | last post: by
10 posts views Thread by Samuel | last post: by
reply views Thread by rosydwin | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.