byte count unicode string

>willie wrote:

>Marc 'BlackJack' Rintsch:

> >In <mailman.313.11 58732191.10491. python-l...@python.org >, willie

wrote:

> ># What's the correct way to get the
# byte count of a unicode (UTF-8) string?
# I couldn't find a builtin method
# and the following is memory inefficient.

> >ustr = "example\xC2\x9 D".decode('U TF-8')

> >num_chars = len(ustr) # 8

> >buf = ustr.encode('UT F-8')

> >num_bytes = len(buf) # 9

> >That is the correct way.

># Apologies if I'm being dense, but it seems
# unusual that I'd have to make a copy of a
# unicode string, converting it into a byte
# string, before I can determine the size (in bytes)
# of the unicode string. Can someone provide the rational
# for that or correct my misunderstandin g?

>You initially asked "What's the correct way to get the byte countof a
unicode (UTF-8) string".

It appears you meant "How can I find how many bytes there are in the
UTF-8 representation of a Unicode string without manifesting the UTF-8
representation ?".

The answer is, "You can't", and the rationale would have to be that
nobody thought of a use case for counting the length of the UTF-8 form
but not creating the UTF-8 form. What is your use case?

# Sorry for the confusion. My use case is a web app that
# only deals with UTF-8 strings. I want to prevent silent
# truncation of the data, so I want to validate the number
# of bytes that make up the unicode string before sending
# it to the database to be written.

# For instance, say I have a name column that is varchar(50).
# The 50 is in bytes not characters. So I can't use the length of
# the unicode string to check if it's over the maximum allowed bytes.

name = post.input('nam e') # utf-8 string

# preferable
if bytes(name) 50:
send_http_heade rs()
display_page_be gin()
display_error_m sg('the name is too long')
display_form(na me)
display_page_en d()

# If I have a form with many input elements,
# I have to convert each to a byte string
# before i can see how many bytes make up the
# unicode string. That's very memory inefficient
# with large text fields - having to duplicate each
# one to get its size in bytes:

buf = name.encode('UT F-8')
num_bytes = len(buf)
# That said, I'm not losing any sleep over it,
# so feel free to disregard any of this if it's
# way off base.

Sep 20 '06 #1

Subscribe Reply

8338

John Machin

willie wrote:

willie wrote:
>Marc 'BlackJack' Rintsch:
>>
> >In <mailman.313.11 58732191.10491. python-l...@python.org >, willie

wrote:

> ># What's the correct way to get the
> ># byte count of a unicode (UTF-8) string?
> ># I couldn't find a builtin method
> ># and the following is memory inefficient.

> >ustr = "example\xC2\x9 D".decode('U TF-8')

> >num_chars = len(ustr) # 8

> >buf = ustr.encode('UT F-8')

> >num_bytes = len(buf) # 9

> >That is the correct way.

># Apologies if I'm being dense, but it seems
># unusual that I'd have to make a copy of a
># unicode string, converting it into a byte
># string, before I can determine the size (in bytes)
># of the unicode string. Can someone provide the rational
># for that or correct my misunderstandin g?

>You initially asked "What's the correct way to get the byte countof a
>unicode (UTF-8) string".
>
>It appears you meant "How can I find how many bytes there are in the
>UTF-8 representation of a Unicode string without manifesting the UTF-8
>representation ?".
>
>The answer is, "You can't", and the rationale would have to be that
>nobody thought of a use case for counting the length of the UTF-8 form
>but not creating the UTF-8 form. What is your use case?

# Sorry for the confusion. My use case is a web app that
# only deals with UTF-8 strings. I want to prevent silent
# truncation of the data, so I want to validate the number
# of bytes that make up the unicode string before sending
# it to the database to be written.

# For instance, say I have a name column that is varchar(50).
# The 50 is in bytes not characters. So I can't use the length of
# the unicode string to check if it's over the maximum allowed bytes.

What is the database API expecting to get as an arg: a Python unicode
object, or a Python str (8-bit, presumably encoded in utf-8) ?

>
name = post.input('nam e') # utf-8 string

You are confusing the hell out of yourself. You say that your web app
deals only with UTF-8 strings. Where do you get "the unicode string"
from??? If name is a utf-8 string, as your comment says, then len(name)
is all you need!!!

*PLEASE* print type(name), repr(name) so that we can see what type it
is!!
If it says the type is str, then it's an 8-bit string, (presumably)
encoded in utf-8.
If it says the type is unicode, then please explain "web app that only
deals with UTF-8 strings" ...

>
# preferable
if bytes(name) 50:
send_http_heade rs()
display_page_be gin()
display_error_m sg('the name is too long')
display_form(na me)
display_page_en d()

# If I have a form with many input elements,
# I have to convert each to a byte string
# before i can see how many bytes make up the
# unicode string. That's very memory inefficient
# with large text fields - having to duplicate each
# one to get its size in bytes:

They'd be garbage collected unless you worked very hard to hang on to
them. How large is "large"?

Sep 20 '06 #2

Similar topics

9364

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 10: ordinal not in range(128)

by: Robin Siebler | last post by:

I have no idea what is causing this error, or how to fix it. The full error is: Traceback (most recent call last): File "D:\ScriptRuntime\PS\Automation\Handlers\SCMTestToolResourceToolsBAT.py", line 60, in Run PS.Automation.Utility.System.AppendSystemPath(args, context) File "D:\ScriptRuntime\PS\Automation\Utility\System.py", line 55, in...

Python

1981

OpenSP API, Unicode character byte offsets

by: Phillip Farber | last post by:

Hello, I'm posting here with a somewhat technical question in the hope of finding someone with experience coding C++ against the SP_API in OpenSP 1.5. I have an app that uses the SP_API to parse XML and record file offsets for elements and attribute values. It works fine with ISO-8859-1 encoded data. However, in UTF-8 encoded XML data,...

.NET Framework

3707

Byte size of characters when encoding

by: Vladimir | last post by:

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2. Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4. But why that? Look: /* Each Unicode character in a string is defined by a Unicode scalar value, also called ...

.NET Framework

12965

Unable to cast object of type 'System.Byte[]' to type 'System.String'.

by: Imran Aziz | last post by:

Hello All, I am getting the following error on our production server, and I dont get the same error on the development box. Unable to cast object of type 'System.Byte' to type 'System.String'. here is the code that I used to create a table and then add columns to it later, later I populate the rows in the table.

ASP.NET

3385

Byte() to String and String to Byte(). How?

by: ThunderMusic | last post by:

Hi, I have to go from Byte() to String, do some processing then reconvert the String to byte() but using ascii format, not unicode. I currently use a stream to write the char() (BinaryWriter.Write) from the string (String.ToCharArray), then use Stream.ToArray to convert everything to byte(). It works most of the time, but it happens that an...

.NET Framework

2852

str.count is slow

by: chrisperkins99 | last post by:

It seems to me that str.count is awfully slow. Is there some reason for this? Evidence: ######## str.count time test ######## import string import time import array s = string.printable * int(1e5) # 10**7 character string

Python

401

byte count unicode string

by: willie | last post by:

Martin v. LÃ¶wis: Thanks for the thorough explanation. One last question about terminology then I'll go away :) What is the proper way to describe "ustr" below? <type 'unicode'>

Python

5951

string to byte[] back to string + Compression Failed!

by: jeremyje | last post by:

I'm writing some code that will convert a regular string to a byte for compression and then beable to convert that compressed string back into original form. Conceptually I have.... For compression string ->(Unicode Conversion) byte -(Compression + Unicode Conversion) string

C# / C Sharp

5362

Long way around UnicodeDecodeError, or 'ascii' codec can't decode byte

by: Oleg Parashchenko | last post by:

Hello, I'm working on an unicode-aware application. I like to use "print" to debug programs, but in this case it was nightmare. The most popular result of "print" was: UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position 0: ordinal not in range(128) I spent two hours fixing it, and I hope it's done. The solution is one

Python

7695

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...

General

7922

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...

C / C++

8119

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...

Online Marketing

7668

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...

Windows Server

6281

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...

Career Advice

5218

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...

C# / C Sharp

3653

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...

Networking - Hardware / Configuration

3637

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1209

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP