473,387 Members | 1,904 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

byte count unicode string

>willie wrote:
>Marc 'BlackJack' Rintsch:
> >In <mailman.313.1158732191.10491.python-l...@python.org>, willie
wrote:
> ># What's the correct way to get the
# byte count of a unicode (UTF-8) string?
# I couldn't find a builtin method
# and the following is memory inefficient.
> >ustr = "example\xC2\x9D".decode('UTF-8')
> >num_chars = len(ustr) # 8
> >buf = ustr.encode('UTF-8')
> >num_bytes = len(buf) # 9
> >That is the correct way.
># Apologies if I'm being dense, but it seems
# unusual that I'd have to make a copy of a
# unicode string, converting it into a byte
# string, before I can determine the size (in bytes)
# of the unicode string. Can someone provide the rational
# for that or correct my misunderstanding?
>You initially asked "What's the correct way to get the byte countof a
unicode (UTF-8) string".

It appears you meant "How can I find how many bytes there are in the
UTF-8 representation of a Unicode string without manifesting the UTF-8
representation?".

The answer is, "You can't", and the rationale would have to be that
nobody thought of a use case for counting the length of the UTF-8 form
but not creating the UTF-8 form. What is your use case?
# Sorry for the confusion. My use case is a web app that
# only deals with UTF-8 strings. I want to prevent silent
# truncation of the data, so I want to validate the number
# of bytes that make up the unicode string before sending
# it to the database to be written.

# For instance, say I have a name column that is varchar(50).
# The 50 is in bytes not characters. So I can't use the length of
# the unicode string to check if it's over the maximum allowed bytes.

name = post.input('name') # utf-8 string

# preferable
if bytes(name) 50:
send_http_headers()
display_page_begin()
display_error_msg('the name is too long')
display_form(name)
display_page_end()

# If I have a form with many input elements,
# I have to convert each to a byte string
# before i can see how many bytes make up the
# unicode string. That's very memory inefficient
# with large text fields - having to duplicate each
# one to get its size in bytes:

buf = name.encode('UTF-8')
num_bytes = len(buf)
# That said, I'm not losing any sleep over it,
# so feel free to disregard any of this if it's
# way off base.
Sep 20 '06 #1
1 8294
willie wrote:
willie wrote:
>Marc 'BlackJack' Rintsch:
>>
> >In <mailman.313.1158732191.10491.python-l...@python.org>, willie
wrote:
> ># What's the correct way to get the
> ># byte count of a unicode (UTF-8) string?
> ># I couldn't find a builtin method
> ># and the following is memory inefficient.
> >ustr = "example\xC2\x9D".decode('UTF-8')
> >num_chars = len(ustr) # 8
> >buf = ustr.encode('UTF-8')
> >num_bytes = len(buf) # 9
> >That is the correct way.
># Apologies if I'm being dense, but it seems
># unusual that I'd have to make a copy of a
># unicode string, converting it into a byte
># string, before I can determine the size (in bytes)
># of the unicode string. Can someone provide the rational
># for that or correct my misunderstanding?
>You initially asked "What's the correct way to get the byte countof a
>unicode (UTF-8) string".
>
>It appears you meant "How can I find how many bytes there are in the
>UTF-8 representation of a Unicode string without manifesting the UTF-8
>representation?".
>
>The answer is, "You can't", and the rationale would have to be that
>nobody thought of a use case for counting the length of the UTF-8 form
>but not creating the UTF-8 form. What is your use case?

# Sorry for the confusion. My use case is a web app that
# only deals with UTF-8 strings. I want to prevent silent
# truncation of the data, so I want to validate the number
# of bytes that make up the unicode string before sending
# it to the database to be written.

# For instance, say I have a name column that is varchar(50).
# The 50 is in bytes not characters. So I can't use the length of
# the unicode string to check if it's over the maximum allowed bytes.
What is the database API expecting to get as an arg: a Python unicode
object, or a Python str (8-bit, presumably encoded in utf-8) ?
>
name = post.input('name') # utf-8 string
You are confusing the hell out of yourself. You say that your web app
deals only with UTF-8 strings. Where do you get "the unicode string"
from??? If name is a utf-8 string, as your comment says, then len(name)
is all you need!!!

*PLEASE* print type(name), repr(name) so that we can see what type it
is!!
If it says the type is str, then it's an 8-bit string, (presumably)
encoded in utf-8.
If it says the type is unicode, then please explain "web app that only
deals with UTF-8 strings" ...
>
# preferable
if bytes(name) 50:
send_http_headers()
display_page_begin()
display_error_msg('the name is too long')
display_form(name)
display_page_end()

# If I have a form with many input elements,
# I have to convert each to a byte string
# before i can see how many bytes make up the
# unicode string. That's very memory inefficient
# with large text fields - having to duplicate each
# one to get its size in bytes:
They'd be garbage collected unless you worked very hard to hang on to
them. How large is "large"?

Sep 20 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Robin Siebler | last post by:
I have no idea what is causing this error, or how to fix it. The full error is: Traceback (most recent call last): File "D:\ScriptRuntime\PS\Automation\Handlers\SCMTestToolResourceToolsBAT.py",...
0
by: Phillip Farber | last post by:
Hello, I'm posting here with a somewhat technical question in the hope of finding someone with experience coding C++ against the SP_API in OpenSP 1.5. I have an app that uses the SP_API to...
43
by: Vladimir | last post by:
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2. Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4. But why that? Look: /* Each Unicode character...
3
by: Imran Aziz | last post by:
Hello All, I am getting the following error on our production server, and I dont get the same error on the development box. Unable to cast object of type 'System.Byte' to type 'System.String'. ...
4
by: ThunderMusic | last post by:
Hi, I have to go from Byte() to String, do some processing then reconvert the String to byte() but using ascii format, not unicode. I currently use a stream to write the char()...
3
by: chrisperkins99 | last post by:
It seems to me that str.count is awfully slow. Is there some reason for this? Evidence: ######## str.count time test ######## import string import time import array s = string.printable *...
2
by: willie | last post by:
Martin v. Löwis: Thanks for the thorough explanation. One last question about terminology then I'll go away :) What is the proper way to describe "ustr" below? <type 'unicode'>
5
by: jeremyje | last post by:
I'm writing some code that will convert a regular string to a byte for compression and then beable to convert that compressed string back into original form. Conceptually I have.... For...
4
by: Oleg Parashchenko | last post by:
Hello, I'm working on an unicode-aware application. I like to use "print" to debug programs, but in this case it was nightmare. The most popular result of "print" was: UnicodeDecodeError:...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.