473,569 Members | 2,793 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

byte count unicode string

>willie wrote:
>Marc 'BlackJack' Rintsch:
> >In <mailman.313.11 58732191.10491. python-l...@python.org >, willie
wrote:
> ># What's the correct way to get the
# byte count of a unicode (UTF-8) string?
# I couldn't find a builtin method
# and the following is memory inefficient.
> >ustr = "example\xC2\x9 D".decode('U TF-8')
> >num_chars = len(ustr) # 8
> >buf = ustr.encode('UT F-8')
> >num_bytes = len(buf) # 9
> >That is the correct way.
># Apologies if I'm being dense, but it seems
# unusual that I'd have to make a copy of a
# unicode string, converting it into a byte
# string, before I can determine the size (in bytes)
# of the unicode string. Can someone provide the rational
# for that or correct my misunderstandin g?
>You initially asked "What's the correct way to get the byte countof a
unicode (UTF-8) string".

It appears you meant "How can I find how many bytes there are in the
UTF-8 representation of a Unicode string without manifesting the UTF-8
representation ?".

The answer is, "You can't", and the rationale would have to be that
nobody thought of a use case for counting the length of the UTF-8 form
but not creating the UTF-8 form. What is your use case?
# Sorry for the confusion. My use case is a web app that
# only deals with UTF-8 strings. I want to prevent silent
# truncation of the data, so I want to validate the number
# of bytes that make up the unicode string before sending
# it to the database to be written.

# For instance, say I have a name column that is varchar(50).
# The 50 is in bytes not characters. So I can't use the length of
# the unicode string to check if it's over the maximum allowed bytes.

name = post.input('nam e') # utf-8 string

# preferable
if bytes(name) 50:
send_http_heade rs()
display_page_be gin()
display_error_m sg('the name is too long')
display_form(na me)
display_page_en d()

# If I have a form with many input elements,
# I have to convert each to a byte string
# before i can see how many bytes make up the
# unicode string. That's very memory inefficient
# with large text fields - having to duplicate each
# one to get its size in bytes:

buf = name.encode('UT F-8')
num_bytes = len(buf)
# That said, I'm not losing any sleep over it,
# so feel free to disregard any of this if it's
# way off base.
Sep 20 '06 #1
1 8338
willie wrote:
willie wrote:
>Marc 'BlackJack' Rintsch:
>>
> >In <mailman.313.11 58732191.10491. python-l...@python.org >, willie
wrote:
> ># What's the correct way to get the
> ># byte count of a unicode (UTF-8) string?
> ># I couldn't find a builtin method
> ># and the following is memory inefficient.
> >ustr = "example\xC2\x9 D".decode('U TF-8')
> >num_chars = len(ustr) # 8
> >buf = ustr.encode('UT F-8')
> >num_bytes = len(buf) # 9
> >That is the correct way.
># Apologies if I'm being dense, but it seems
># unusual that I'd have to make a copy of a
># unicode string, converting it into a byte
># string, before I can determine the size (in bytes)
># of the unicode string. Can someone provide the rational
># for that or correct my misunderstandin g?
>You initially asked "What's the correct way to get the byte countof a
>unicode (UTF-8) string".
>
>It appears you meant "How can I find how many bytes there are in the
>UTF-8 representation of a Unicode string without manifesting the UTF-8
>representation ?".
>
>The answer is, "You can't", and the rationale would have to be that
>nobody thought of a use case for counting the length of the UTF-8 form
>but not creating the UTF-8 form. What is your use case?

# Sorry for the confusion. My use case is a web app that
# only deals with UTF-8 strings. I want to prevent silent
# truncation of the data, so I want to validate the number
# of bytes that make up the unicode string before sending
# it to the database to be written.

# For instance, say I have a name column that is varchar(50).
# The 50 is in bytes not characters. So I can't use the length of
# the unicode string to check if it's over the maximum allowed bytes.
What is the database API expecting to get as an arg: a Python unicode
object, or a Python str (8-bit, presumably encoded in utf-8) ?
>
name = post.input('nam e') # utf-8 string
You are confusing the hell out of yourself. You say that your web app
deals only with UTF-8 strings. Where do you get "the unicode string"
from??? If name is a utf-8 string, as your comment says, then len(name)
is all you need!!!

*PLEASE* print type(name), repr(name) so that we can see what type it
is!!
If it says the type is str, then it's an 8-bit string, (presumably)
encoded in utf-8.
If it says the type is unicode, then please explain "web app that only
deals with UTF-8 strings" ...
>
# preferable
if bytes(name) 50:
send_http_heade rs()
display_page_be gin()
display_error_m sg('the name is too long')
display_form(na me)
display_page_en d()

# If I have a form with many input elements,
# I have to convert each to a byte string
# before i can see how many bytes make up the
# unicode string. That's very memory inefficient
# with large text fields - having to duplicate each
# one to get its size in bytes:
They'd be garbage collected unless you worked very hard to hang on to
them. How large is "large"?

Sep 20 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
9364
by: Robin Siebler | last post by:
I have no idea what is causing this error, or how to fix it. The full error is: Traceback (most recent call last): File "D:\ScriptRuntime\PS\Automation\Handlers\SCMTestToolResourceToolsBAT.py", line 60, in Run PS.Automation.Utility.System.AppendSystemPath(args, context) File "D:\ScriptRuntime\PS\Automation\Utility\System.py", line 55, in...
0
1981
by: Phillip Farber | last post by:
Hello, I'm posting here with a somewhat technical question in the hope of finding someone with experience coding C++ against the SP_API in OpenSP 1.5. I have an app that uses the SP_API to parse XML and record file offsets for elements and attribute values. It works fine with ISO-8859-1 encoded data. However, in UTF-8 encoded XML data,...
43
3707
by: Vladimir | last post by:
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2. Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4. But why that? Look: /* Each Unicode character in a string is defined by a Unicode scalar value, also called ...
3
12965
by: Imran Aziz | last post by:
Hello All, I am getting the following error on our production server, and I dont get the same error on the development box. Unable to cast object of type 'System.Byte' to type 'System.String'. here is the code that I used to create a table and then add columns to it later, later I populate the rows in the table.
4
3385
by: ThunderMusic | last post by:
Hi, I have to go from Byte() to String, do some processing then reconvert the String to byte() but using ascii format, not unicode. I currently use a stream to write the char() (BinaryWriter.Write) from the string (String.ToCharArray), then use Stream.ToArray to convert everything to byte(). It works most of the time, but it happens that an...
3
2852
by: chrisperkins99 | last post by:
It seems to me that str.count is awfully slow. Is there some reason for this? Evidence: ######## str.count time test ######## import string import time import array s = string.printable * int(1e5) # 10**7 character string
2
401
by: willie | last post by:
Martin v. Löwis: Thanks for the thorough explanation. One last question about terminology then I'll go away :) What is the proper way to describe "ustr" below? <type 'unicode'>
5
5951
by: jeremyje | last post by:
I'm writing some code that will convert a regular string to a byte for compression and then beable to convert that compressed string back into original form. Conceptually I have.... For compression string ->(Unicode Conversion) byte -(Compression + Unicode Conversion) string
4
5362
by: Oleg Parashchenko | last post by:
Hello, I'm working on an unicode-aware application. I like to use "print" to debug programs, but in this case it was nightmare. The most popular result of "print" was: UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position 0: ordinal not in range(128) I spent two hours fixing it, and I hope it's done. The solution is one
0
7695
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7922
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8119
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7668
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
6281
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
0
5218
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3653
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3637
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1209
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.