byte count unicode string

>willie wrote:

>Marc 'BlackJack' Rintsch:

> >In <mailman.313.1158732191.10491.python-l...@python.org>, willie

wrote:

> ># What's the correct way to get the
# byte count of a unicode (UTF-8) string?
# I couldn't find a builtin method
# and the following is memory inefficient.

> >ustr = "example\xC2\x9D".decode('UTF-8')

> >num_chars = len(ustr) # 8

> >buf = ustr.encode('UTF-8')

> >num_bytes = len(buf) # 9

> >That is the correct way.

># Apologies if I'm being dense, but it seems
# unusual that I'd have to make a copy of a
# unicode string, converting it into a byte
# string, before I can determine the size (in bytes)
# of the unicode string. Can someone provide the rational
# for that or correct my misunderstanding?

>You initially asked "What's the correct way to get the byte countof a
unicode (UTF-8) string".

It appears you meant "How can I find how many bytes there are in the
UTF-8 representation of a Unicode string without manifesting the UTF-8
representation?".

The answer is, "You can't", and the rationale would have to be that
nobody thought of a use case for counting the length of the UTF-8 form
but not creating the UTF-8 form. What is your use case?

# Sorry for the confusion. My use case is a web app that
# only deals with UTF-8 strings. I want to prevent silent
# truncation of the data, so I want to validate the number
# of bytes that make up the unicode string before sending
# it to the database to be written.

# For instance, say I have a name column that is varchar(50).
# The 50 is in bytes not characters. So I can't use the length of
# the unicode string to check if it's over the maximum allowed bytes.

name = post.input('name') # utf-8 string

# preferable
if bytes(name) 50:
send_http_headers()
display_page_begin()
display_error_msg('the name is too long')
display_form(name)
display_page_end()

# If I have a form with many input elements,
# I have to convert each to a byte string
# before i can see how many bytes make up the
# unicode string. That's very memory inefficient
# with large text fields - having to duplicate each
# one to get its size in bytes:

buf = name.encode('UTF-8')
num_bytes = len(buf)
# That said, I'm not losing any sleep over it,
# so feel free to disregard any of this if it's
# way off base.

Sep 20 '06 #1

Subscribe Post Reply

8294

John Machin

willie wrote:

willie wrote:
>Marc 'BlackJack' Rintsch:
>>
> >In <mailman.313.1158732191.10491.python-l...@python.org>, willie

wrote:

> ># What's the correct way to get the
> ># byte count of a unicode (UTF-8) string?
> ># I couldn't find a builtin method
> ># and the following is memory inefficient.

> >ustr = "example\xC2\x9D".decode('UTF-8')

> >num_chars = len(ustr) # 8

> >buf = ustr.encode('UTF-8')

> >num_bytes = len(buf) # 9

> >That is the correct way.

># Apologies if I'm being dense, but it seems
># unusual that I'd have to make a copy of a
># unicode string, converting it into a byte
># string, before I can determine the size (in bytes)
># of the unicode string. Can someone provide the rational
># for that or correct my misunderstanding?

>You initially asked "What's the correct way to get the byte countof a
>unicode (UTF-8) string".
>
>It appears you meant "How can I find how many bytes there are in the
>UTF-8 representation of a Unicode string without manifesting the UTF-8
>representation?".
>
>The answer is, "You can't", and the rationale would have to be that
>nobody thought of a use case for counting the length of the UTF-8 form
>but not creating the UTF-8 form. What is your use case?

# Sorry for the confusion. My use case is a web app that
# only deals with UTF-8 strings. I want to prevent silent
# truncation of the data, so I want to validate the number
# of bytes that make up the unicode string before sending
# it to the database to be written.

# For instance, say I have a name column that is varchar(50).
# The 50 is in bytes not characters. So I can't use the length of
# the unicode string to check if it's over the maximum allowed bytes.

What is the database API expecting to get as an arg: a Python unicode
object, or a Python str (8-bit, presumably encoded in utf-8) ?

>
name = post.input('name') # utf-8 string

You are confusing the hell out of yourself. You say that your web app
deals only with UTF-8 strings. Where do you get "the unicode string"
from??? If name is a utf-8 string, as your comment says, then len(name)
is all you need!!!

*PLEASE* print type(name), repr(name) so that we can see what type it
is!!
If it says the type is str, then it's an 8-bit string, (presumably)
encoded in utf-8.
If it says the type is unicode, then please explain "web app that only
deals with UTF-8 strings" ...

>
# preferable
if bytes(name) 50:
send_http_headers()
display_page_begin()
display_error_msg('the name is too long')
display_form(name)
display_page_end()

# If I have a form with many input elements,
# I have to convert each to a byte string
# before i can see how many bytes make up the
# unicode string. That's very memory inefficient
# with large text fields - having to duplicate each
# one to get its size in bytes:

They'd be garbage collected unless you worked very hard to hang on to
them. How large is "large"?

Sep 20 '06 #2

by: Robin Siebler | last post by:

I have no idea what is causing this error, or how to fix it. The full error is: Traceback (most recent call last): File "D:\ScriptRuntime\PS\Automation\Handlers\SCMTestToolResourceToolsBAT.py",...

Python

OpenSP API, Unicode character byte offsets

by: Phillip Farber | last post by:

Hello, I'm posting here with a somewhat technical question in the hope of finding someone with experience coding C++ against the SP_API in OpenSP 1.5. I have an app that uses the SP_API to...

.NET Framework

Byte size of characters when encoding

by: Vladimir | last post by:

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2. Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4. But why that? Look: /* Each Unicode character...

.NET Framework

Unable to cast object of type 'System.Byte[]' to type 'System.String'.

by: Imran Aziz | last post by:

Hello All, I am getting the following error on our production server, and I dont get the same error on the development box. Unable to cast object of type 'System.Byte' to type 'System.String'. ...

ASP.NET

Byte() to String and String to Byte(). How?

by: ThunderMusic | last post by:

Hi, I have to go from Byte() to String, do some processing then reconvert the String to byte() but using ascii format, not unicode. I currently use a stream to write the char()...

.NET Framework

str.count is slow

by: chrisperkins99 | last post by:

It seems to me that str.count is awfully slow. Is there some reason for this? Evidence: ######## str.count time test ######## import string import time import array s = string.printable *...

Python

byte count unicode string

by: willie | last post by:

Martin v. LÃ¶wis: Thanks for the thorough explanation. One last question about terminology then I'll go away :) What is the proper way to describe "ustr" below? <type 'unicode'>

Python

string to byte[] back to string + Compression Failed!

by: jeremyje | last post by:

I'm writing some code that will convert a regular string to a byte for compression and then beable to convert that compressed string back into original form. Conceptually I have.... For...

C# / C Sharp

Long way around UnicodeDecodeError, or 'ascii' codec can't decode byte

by: Oleg Parashchenko | last post by:

Hello, I'm working on an unicode-aware application. I like to use "print" to debug programs, but in this case it was nightmare. The most popular result of "print" was: UnicodeDecodeError:...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

byte count unicode string

Similar topics