Issue with unicode

4 New Member

Hi Guys
I have written a library for reading from XML files.
I have used the SAX library for the same.
Then i use the strings(unicode strings) returned from the parser and pass it on to a Fuzzing Library written by me.
The Strings consists of hex characters.
for eg.They can be something like
String = "a\x01\x02\x0c"
However the problem is that Sax is returning strings like this as a series of contiguous characters and not as escaped hex characters.
If you do a len(String) you will get something like 13 instead of 4.
I tried a lot of encoding and decoding combinations to get the original string but i cannot.
i have been stuck in this situation for around 2 days now and my coding phase has been obstructed.
Please help me with this ASAP.

Details:
OS:Linux
Python:Version 2.4.3.
XML Library:sax

Jul 24 '07 #1

Subscribe Reply

3493

bartonc

6,596

Recognized Expert Expert

It should be as easy as converting to a string. Given a unicode string:
>>> a = "ÀÈÌÒÙ"
the built in str() should convert it
>>> b = str(a)
>>> b
'\xc0\xc8\xcc\xd2\xd9'
>>>

Jul 24 '07 #2

anandrage

New Member

It should be as easy as converting to a string. Given a unicode string:
>>> a = "ÀÈÌÒÙ"
the built in str() should convert it
>>> b = str(a)
>>> b
'\xc0\xc8\xcc\xd2\xd9'
>>>

Thanks for the reply
This wont work however.
This is the first thing that everyone tries in order to convert.(str())
The problem is not with the conversion aspect of it.
The issue is with the format of conversion.When we are trying to typecast as you have mentioned it will do a regular typecast but the format remains the same.
That is the characters are still contigous and the "Escaped" meaning is not present.
The issue is that i want the unicode to be treated as escaped hex characters and not one character each for "\" ,"x" and so on.
This is because of the SAX Parser returning the input in such a fashion.

See when the string is u"\x01\x02"
i want the len(string) to return 2 and not 8.
so that \x01 is treated as one character.

If u use the python command prompt and make a plain declaration like
a="\x01\x02"
len(a) will return 2 and not 8.You can check the same.Hence i want the same behavior with my string as well as that is how my Fuzzing Library expects the string to come.

I hope i have made the issue a little clearer this time around.

Thanks a lot for the prompt reply.I need a solution ASAP or else im in deep trouble.

Jul 25 '07 #3

bvdet

2,851

Recognized Expert Moderator Specialist

Thanks for the reply
This wont work however.
This is the first thing that everyone tries in order to convert.(str())
The problem is not with the conversion aspect of it.
The issue is with the format of conversion.When we are trying to typecast as you have mentioned it will do a regular typecast but the format remains the same.
That is the characters are still contigous and the "Escaped" meaning is not present.
The issue is that i want the unicode to be treated as escaped hex characters and not one character each for "\" ,"x" and so on.
This is because of the SAX Parser returning the input in such a fashion.

See when the string is u"\x01\x02"
i want the len(string) to return 2 and not 8.
so that \x01 is treated as one character.

If u use the python command prompt and make a plain declaration like
a="\x01\x02"
len(a) will return 2 and not 8.You can check the same.Hence i want the same behavior with my string as well as that is how my Fuzzing Library expects the string to come.

I hope i have made the issue a little clearer this time around.

Thanks a lot for the prompt reply.I need a solution ASAP or else im in deep trouble.

If I understand your problem correctly, you can do something like this:

Expand|Select|Wrap|Line Numbers

 >>> b

u'\x01\x02'

>>> str(b)

'\x01\x02'

>>> repr(str(b))

"'\\x01\\x02'"

>>> len(repr(str(b)))

10

>>> s = repr(str(b)).replace("'", "")

>>> s

'\\x01\\x02'

>>> len(s)

8

Jul 25 '07 #4

anandrage

New Member

If I understand your problem correctly, you can do something like this:

Expand|Select|Wrap|Line Numbers

>>> b

u'\x01\x02'

>>> str(b)

'\x01\x02'

>>> repr(str(b))

"'\\x01\\x02'"

>>> len(repr(str(b)))

10

>>> s = repr(str(b)).replace("'", "")

>>> s

'\\x01\\x02'

>>> len(s)

8

Hi
Thanks for the reply
Im afraid you still didnt understand the actual issue.
I want the length of the string as two and not 10 or 8.
When we are doin len(string) it should consider the escaped characters and not the string as a series of raw characters as we can see here.

I want something like This

Code:
"What Normally happens on the command Line":
>>>s = u"\x01\x02"
>>>len(s)
2

However if i do the same on the Unicode String sent back by the Sax Parser I get this:
>>>s
"\x01\x02"
>>>len(s)
8
>>>type(s)
<type 'unicode>
>>>s = str(s) or s = s.encode("utf-8")
>>>s
"\x01\x02"
>>>len(s)
8

>>>"THIS IS NOT WHAT I WANT"
>>>"What i want Follows this"
>>>s = <some Function>(s)
>>>len(s)
2
>>>s
"\x01\x02"
>>>"I THINK THIS MAKES A BIT MORE CLEAR"

If some one can get me that function(some function above)
Then my problem will be solved.

Thanks for the try anyhow.

Jul 26 '07 #5

bvdet

2,851

Recognized Expert Moderator Specialist

Hi
Thanks for the reply
Im afraid you still didnt understand the actual issue.
I want the length of the string as two and not 10 or 8.
When we are doin len(string) it should consider the escaped characters and not the string as a series of raw characters as we can see here.

I want something like This

Code:
"What Normally happens on the command Line":
>>>s = u"\x01\x02"
>>>len(s)
2

However if i do the same on the Unicode String sent back by the Sax Parser I get this:
>>>s
"\x01\x02"
>>>len(s)
8
>>>type(s)
<type 'unicode>
>>>s = str(s) or s = s.encode("utf-8")
>>>s
"\x01\x02"
>>>len(s)
8

>>>"THIS IS NOT WHAT I WANT"
>>>"What i want Follows this"
>>>s = <some Function>(s)
>>>len(s)
2
>>>s
"\x01\x02"
>>>"I THINK THIS MAKES A BIT MORE CLEAR"

If some one can get me that function(some function above)
Then my problem will be solved.

Thanks for the try anyhow.

Sorry, but I took a part of your previous post out of context.
I cannot explain why your string is represented as unicode but returns a length of 8. You might try something like this:

Expand|Select|Wrap|Line Numbers

 import re
 
def lenUstr(s):

    patt = r'\\x\d+'

    return len(re.findall(patt, s1))
 
s = u'\x01\x02'

print type(s)
 
print lenUstr(repr(str(s)).replace("'", ""))

Expand|Select|Wrap|Line Numbers

 >>> <type 'unicode'>

2

>>>

Jul 26 '07 #6

Similar topics

2162

utf-8 encoding issue

by: Marc Petitmermet | last post by:

The line below looks up the name "öttinger" (with the German umlaut) of an author using the mysql console: mysql> select author from records where author like '%Öttinger%'; This...

Python

2571

Unicode conversion issue.

by: Jonathan | last post by:

I have a unicode database and I basically wish to publish out certain data (via views) from it to a non unicode database. Unfortunately we can not change the type of either of the databases due to...

Microsoft SQL Server

12066

Writing to valid ANSI text file (UTF8 issue?)

by: Mark Anderson | last post by:

Hi, I've a problem with code that should produce a Windows(ANSI) encoded text file but doesn't. Server is IIS 5 on Win 2k, with ASP ver? My ASP uses data from an upstream HTML form on a UTF-8...

ASP / Active Server Pages

2691

help please with time out issue

by: john | last post by:

Hello, We have a tablet pc that is trying to sync data to our main sql server during the night but we keep getting a timeout issue. The tablet pc has an access database that contains about 5000...

Visual Basic .NET

3861

downgrade SQL2005 to SQL2k - Unicode ANSI issue

by: candide_sh | last post by:

Hello, I created a script with database publishing wizard to convert a SS2005 db into a SS2k db. The script has schema and data and looks very good. When I try to start the created script in...

Microsoft SQL Server

1407

Documentation for "str()" could use some adjustment.

by: John Nagle | last post by:

The Python documentation for "str" says "str() : Return a string containing a nicely printable representation of an object." However, there's no mention of the fact that "str" of a Unicode...

Python

1292

I18n issue with optik

by: Thorsten Kampe | last post by:

Hi, I've written a script which uses Optik/Optparse to display the options (which works fine). The text for the help message is localised (with german umlauts) and when I execute the script with...

Python

8516

Vista 64 + Python2.5 + wxpython 28 issue

by: simonroses | last post by:

Hello Guys, I have installed python 2.5 (AMD64) on Vista (64), also installed wx 2.8 but I'm getting this error: """" Traceback (most recent call last): File "mymodule.py", line 39, in...

Python

1394

Unicode Character Issue

by: Samuel | last post by:

Hi I am trying to read text files that are saved in ANSI format with Unicode characters such as French e German big S etc, and as I read the file these characters appear as squares etc. I...

Visual Basic .NET

1279

Issue with Base64Encoding for Unicode Data

by: amollokhande1 | last post by:

Hi All, Currently we are facing an issue while decoding the Base64Encoded unicode data. Here is the scenario We have one custom javascript function that encodes the unicode data using Base64...

Visual Basic .NET

7134

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

7014

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7180

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

6905

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

7395

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

4921

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

4609

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3103

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

311

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General