How to print first(national) char from unicode string encoded inutf-8?

sniipe

Hi,

I have a problem with unicode string in Pylons templates(Mako). I will
print first char from my string encoded in UTF-8 and urllib.quote(),
for example string '£ukasz':

${urllib.unquote(c.user.firstName).encode('latin-1')[0:1]}

and I received this information:

<type 'exceptions.UnicodeDecodeError'>: 'utf8' codec can't decode byte
0xc5 in position 0: unexpected end of data

When I change from [0:1] to [0:2] everything is ok. I think it is
because of unicode and encoding utf-8(2 bytes).

How to resolve this problem?

Best regards

Sep 1 '08 #1

Subscribe Post Reply

2353

sniipe

On 1 Wrz, 15:10, "Marco Bizzarri" <marco.bizza...@gmail.comwrote:

2008/9/1 <sni...@gmail.com>:

Hi,

I have a problem with unicode string in Pylons templates(Mako). I will
print first char from my string encoded in UTF-8 and urllib.quote(),
for example string 'Åukasz':

${urllib.unquote(c.user.firstName).encode('latin-1')[0:1]}

and I received this information:

<type 'exceptions.UnicodeDecodeError'>: 'utf8' codec can't decode byte
0xc5 in position 0: unexpected end of data

When I change from [0:1] to [0:2] everything is ok. I think it is
because of unicode and encoding utf-8(2 bytes).

How to resolve this problem?

Best regards
--
http://mail.python.org/mailman/listinfo/python-list

First: you're talking about utf8 encoding, but you've written latin1
encoding. Even though I do not know Mako templates, there should be no
problem in your snippet of code, if encoding is latin1, at least for
what I can understand.

Do not assume utf8 is a two byte encoding; utf8 is a variable length
encoding. Indeed,

'a' encoded as utf8 is 'a' (one byte)

'Ã*' encode as utf8 is '\xc3\xa0' (two bytes).

Can you explain what you're trying to accomplish (rather than how
you're tryin to accomplish it) ?

Regards
Marco

--
Marco Bizzarrihttp://notenotturne.blogspot.com/http://iliveinpisa.blogspot.com/

When I do ${urllib.unquote(c.user.firstName)} without encoding to
latin-1 I got different chars than I will get: no Åukasz but Ã…Âukasz

Sep 1 '08 #2

Marco Bizzarri

On Mon, Sep 1, 2008 at 3:25 PM, <sn****@gmail.comwrote:

>
When I do ${urllib.unquote(c.user.firstName)} without encoding to
latin-1 I got different chars than I will get: no Ùukasz but Å ukasz
--
http://mail.python.org/mailman/listinfo/python-list

That's crazy. "string".encode('latin1') gives you a latin1 encoded
string; latin1 is a single byte encoding, therefore taking the first
byte should be no problem.

Have you tried:

urlib.unquote(c.user.firstName)[0].encode('latin1') or

urlib.unquote(c.user.firstName)[0].encode('utf8')

I'm assuming here that the urlib.unquote(c.user.firstName) returns an
encodable string (which I'm absolutely not sure), but if it does, this
should take the first 'character'.

Regards
Marco
--
Marco Bizzarri
http://notenotturne.blogspot.com/
http://iliveinpisa.blogspot.com/

Sep 1 '08 #3

Mark Tolonen

"Marco Bizzarri" <ma************@gmail.comwrote in message
news:ma*************************************@pytho n.org...

On Mon, Sep 1, 2008 at 3:25 PM, <sn****@gmail.comwrote:

>>
When I do ${urllib.unquote(c.user.firstName)} without encoding to
latin-1 I got different chars than I will get: no Ùukasz but Å ukasz
--
http://mail.python.org/mailman/listinfo/python-list

That's crazy. "string".encode('latin1') gives you a latin1 encoded
string; latin1 is a single byte encoding, therefore taking the first
byte should be no problem.

Have you tried:

urlib.unquote(c.user.firstName)[0].encode('latin1') or

urlib.unquote(c.user.firstName)[0].encode('utf8')

I'm assuming here that the urlib.unquote(c.user.firstName) returns an
encodable string (which I'm absolutely not sure), but if it does, this
should take the first 'character'.

The OP stated that the original string was "encoded in UTF-8 and
urllib.quote()", so after urllib.unquote the string is in UTF-8 format.
This must be decoded into a Unicode string before removing the first
character:

urllib.unquote(c.user.firstName).decode('utf-8')[0]

The next problem is that the character in the OP's example string 'Ù' is not
present in the latin-1 encoding, but using utf-8 encoding demonstrates that
the full two-byte UTF-8 encoded character is collected:

>>import urllib
name = urllib.quote(u'Ùukasz'.encode('utf-8'))
name

'%C5%81ukasz'

>>urllib.unquote(name).decode('utf-8')[0].encode('utf-8')

'\xc5\x81'

-Mark

Sep 2 '08 #4

sniipe

On 2 Wrz, 06:05, "Mark Tolonen" <M8R-yft...@mailinator.comwrote:

"Marco Bizzarri" <marco.bizza...@gmail.comwrote in message

news:ma*************************************@pytho n.org...

On Mon, Sep 1, 2008 at 3:25 PM, <sni...@gmail.comwrote:

When I do ${urllib.unquote(c.user.firstName)} without encoding to
latin-1 I got different chars than I will get: no Ùukasz but Å ukasz
--
http://mail.python.org/mailman/listinfo/python-list

That's crazy. "string".encode('latin1') gives you a latin1 encoded
string; latin1 is a single byte encoding, therefore taking the first
byte should be no problem.

Have you tried:

urlib.unquote(c.user.firstName)[0].encode('latin1') or

urlib.unquote(c.user.firstName)[0].encode('utf8')

I'm assuming here that the urlib.unquote(c.user.firstName) returns an
encodable string (which I'm absolutely not sure), but if it does, this
should take the first 'character'.

The OP stated that the original string was "encoded in UTF-8 and
urllib.quote()", so after urllib.unquote the string is in UTF-8 format.
This must be decoded into a Unicode string before removing the first
character:

urllib.unquote(c.user.firstName).decode('utf-8')[0]

The next problem is that the character in the OP's example string 'Ù' is not
present in the latin-1 encoding, but using utf-8 encoding demonstrates that
the full two-byte UTF-8 encoded character is collected:

>>import urllib
>>name = urllib.quote(u'Ùukasz'.encode('utf-8'))
>>name

'%C5%81ukasz'

>>urllib.unquote(name).decode('utf-8')[0].encode('utf-8')

'\xc5\x81'

-Mark

@Mark, when I tried urllib.unquote(c.user.firstName).decode('utf-8')
[0].encode('utf-8'), I received this message:

> return render('/reports/create_report_step2.mako')

Module pylons.templating:344 in render
<< **cache_args)
return pylons.buffet.render(template_name=template,
fragment=fragment,
format=format, namespace=kargs,
**cache_args)

> format=format, namespace=kargs, **cache_args)

Module pylons.templating:229 in render
<< log.debug("Rendering template %s with engine %s",
full_path, engine_name)
return engine_config['engine'].render(namespace,
template=full_path,
**options)> **options)
Module mako.ext.turbogears:49 in render
<< info.update(self.extra_vars_func())

return template.render(**info)

> return template.render(**info)

Module mako.template:114 in render
<< declared by this template's internal rendering method are
also pulled from the given *args, **data
members. members."""
return runtime._render(self, self.callable_, args, data)

def render_unicode(self, *args, **data):> return
runtime._render(self, self.callable_, args, data)
Module mako.runtime:287 in _render
<< context = Context(buf, **data)
context._with_template = template
_render_context(template, callable_, context, *args,
**_kwargs_for_callable(callable_, data))
return context.pop_buffer().getvalue()>>
_render_context(template, callable_, context, *args,
**_kwargs_for_callable(callable_, data))
Module mako.runtime:304 in _render_context
<< # if main render method, call from the base of the
inheritance stack
(inherit, lclcontext) = _populate_self_namespace(context,
tmpl)
_exec_template(inherit, lclcontext, args=args,
kwargs=kwargs)
else:
# otherwise, call the actual rendering method specified>>
_exec_template(inherit, lclcontext, args=args, kwargs=kwargs)
Module mako.runtime:337 in _exec_template
<< error_template.render_context(context,
error=error)
else:
callable_(context, *args, **kwargs)> callable_(context,
*args, **kwargs)
Module _reports_create_report_step2_mako:57 in render_body
<<
context.write(filters.decode.utf8(urllib.unquote(s tr(c.period.end))))
context.write(u' + ')

context.write(filters.decode.utf8(urllib.unquote(c .user.firstName).decode('utf-8')
[0].encode('utf-8')))

context.write(filters.decode.utf8(urllib.unquote(s tr(c.user.secondName)
[0:1])))
context.write(u'</h3>\r\n <input type="hidden"
name="works[]" value="')>>
context.write(filters.decode.utf8(urllib.unquote(c .user.firstName).decode('utf-8')
[0].encode('utf-8')))
Module encodings.utf_8:16 in decode
<<
def decode(input, errors='strict'):
return codecs.utf_8_decode(input, errors, True)

class IncrementalEncoder(codecs.IncrementalEncoder):> return
codecs.utf_8_decode(input, errors, True)
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode
characters in position 0-1: ordinal not in range(128)

Sep 2 '08 #5

sniipe

On 2 Wrz, 10:17, sni...@gmail.com wrote:

On 2 Wrz, 06:05, "Mark Tolonen" <M8R-yft...@mailinator.comwrote:

"Marco Bizzarri" <marco.bizza...@gmail.comwrote in message

news:ma*************************************@pytho n.org...

On Mon, Sep 1, 2008 at 3:25 PM, <sni...@gmail.comwrote:

>When I do ${urllib.unquote(c.user.firstName)} without encoding to
>latin-1 I got different chars than I will get: no Ùukasz but Å ukasz
>--
>>http://mail.python.org/mailman/listinfo/python-list

That's crazy. "string".encode('latin1') gives you a latin1 encoded
string; latin1 is a single byte encoding, therefore taking the first
byte should be no problem.

Have you tried:

urlib.unquote(c.user.firstName)[0].encode('latin1') or

urlib.unquote(c.user.firstName)[0].encode('utf8')

I'm assuming here that the urlib.unquote(c.user.firstName) returns an
encodable string (which I'm absolutely not sure), but if it does, this
should take the first 'character'.

The OP stated that the original string was "encoded in UTF-8 and
urllib.quote()", so after urllib.unquote the string is in UTF-8 format.
This must be decoded into a Unicode string before removing the first
character:

urllib.unquote(c.user.firstName).decode('utf-8')[0]

The next problem is that the character in the OP's example string 'Ù'is not
present in the latin-1 encoding, but using utf-8 encoding demonstrates that
the full two-byte UTF-8 encoded character is collected:

>>import urllib
>>name = urllib.quote(u'Ùukasz'.encode('utf-8'))
>>name
'%C5%81ukasz'
>>urllib.unquote(name).decode('utf-8')[0].encode('utf-8')
'\xc5\x81'

-Mark

@Mark, when I tried urllib.unquote(c.user.firstName).decode('utf-8')
[0].encode('utf-8'), I received this message:

return render('/reports/create_report_step2.mako')

Module pylons.templating:344 in render
<< **cache_args)
return pylons.buffet.render(template_name=template,
fragment=fragment,
format=format, namespace=kargs,
**cache_args)

> format=format, namespace=kargs, **cache_args)

Module pylons.templating:229 in render
<< log.debug("Rendering template %s with engine %s",
full_path, engine_name)
return engine_config['engine'].render(namespace,
template=full_path,
**options)> **options)
Module mako.ext.turbogears:49 in render
<< info.update(self.extra_vars_func())

return template.render(**info)

> return template.render(**info)

Module mako.template:114 in render
<< declared by this template's internal rendering method are
also pulled from the given *args, **data
members. members."""
return runtime._render(self, self.callable_, args, data)

def render_unicode(self, *args, **data):> return
runtime._render(self, self.callable_, args, data)
Module mako.runtime:287 in _render
<< context = Context(buf, **data)
context._with_template = template
_render_context(template, callable_, context, *args,
**_kwargs_for_callable(callable_, data))
return context.pop_buffer().getvalue()>>
_render_context(template, callable_, context, *args,
**_kwargs_for_callable(callable_, data))
Module mako.runtime:304 in _render_context
<< # if main render method, call from the base of the
inheritance stack
(inherit, lclcontext) = _populate_self_namespace(context,
tmpl)
_exec_template(inherit, lclcontext, args=args,
kwargs=kwargs)
else:
# otherwise, call the actual rendering method specified>>
_exec_template(inherit, lclcontext, args=args, kwargs=kwargs)
Module mako.runtime:337 in _exec_template
<< error_template.render_context(context,
error=error)
else:
callable_(context, *args, **kwargs)> callable_(context,
*args, **kwargs)
Module _reports_create_report_step2_mako:57 in render_body
<<
context.write(filters.decode.utf8(urllib.unquote(s tr(c.period.end))))
context.write(u' + ')

context.write(filters.decode.utf8(urllib.unquote(c .user.firstName).decode('utf-8')
[0].encode('utf-8')))

context.write(filters.decode.utf8(urllib.unquote(s tr(c.user.secondName)
[0:1])))
context.write(u'</h3>\r\n <input type="hidden"
name="works[]" value="')>>
context.write(filters.decode.utf8(urllib.unquote(c .user.firstName).decode('utf-8')
[0].encode('utf-8')))
Module encodings.utf_8:16 in decode
<<
def decode(input, errors='strict'):
return codecs.utf_8_decode(input, errors, True)

class IncrementalEncoder(codecs.IncrementalEncoder):> return
codecs.utf_8_decode(input, errors, True)
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode
characters in position 0-1: ordinal not in range(128)

ok, I resolved this problem $
{urllib.unquote(str(c.user.firstName)).decode('utf-8')[0]}

Could anyone explain me why this code works?

Sep 2 '08 #6

Similar topics

Print String

by: Balaji | last post by:

Hello Eveybody, I have written a method which prints the prefix notation of any expression. here is the method... def PrintPrefix(expr): if expr.__class__==E: print expr.operator,...

Python

how to print unicode structures?

by: Timothy Babytch | last post by:

Imagine you have some list that looks like ('unicode', 'not-acii', 'russian') and contains characters not from acsii. or list of dicts, or dict of dicts. how can I print it? not on by one, with...

Python

Fancy First char...

by: Mel | last post by:

i want to have a block, like some articles i see, with the first char (very Fancy and huge) and have the rest of the block kind of wrap around this char (to the right of). something like below...

HTML / CSS

Can't print char!

by: JS | last post by:

I would like to print char 'd': main(){ char g; g = 'a'; g = 'b'; g = 'c'; g = 'd';

C / C++

Print ( char )10?

by: jceddy | last post by:

Hey, I'm trying to write a file with unix-style newlines (ASCII character 10) from a c++ program on Windows...it seems that the most straightforward way to do that is just to print ( char )10, but...

.NET Framework

Want to print string of text with effect, can u help? :)

by: applegreenss | last post by:

I am looking for a javascript function ( don't know what they call ths ) which will print a line of text, one letter at a time at a certain speed until the full string is printed. Can this be done...

Javascript

how to print first string in a line

by: lekshminair | last post by:

hello friends, can u help me. how to print first string in a line.(before first space) for example: String str={"Hello world java"} output Hello

Java

Why getchar() doesn't quit if EOF isn't the first char

by: lovecreatesbea... | last post by:

Thank you for your time. #include <stdio.h> int main(void) { int c; while ((c = getchar()) != EOF){

C / C++

reverse string, how to print string and not decimals?

by: ssecorp | last post by:

char* reverse(char* str) { int length = strlen(str); char* acc; int i; for (i=0; i<=length-1; i++){ acc = str; } return acc; }

C / C++

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice