473,574 Members | 17,993 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

UTF8 to Unicode conversion

I only work in Perl occasionaly, and have been searching for a
solution for a conversion, and everything I found seems much too
complex.

All I need to do is take a simple text file [that had been created
from a Perl script] and copy it, however some specific lines are in
fact in UTF8 as printed garbagy characters and they need to be
converted to Unicode, so that the new text file can be imported into a
desktop program and into some Word documents.

For the moment [if it makes it easier] I would be happy to get a
solution for most European languages, and could skip things like
Russion and Chinese till later

Español - for example should convert to Espaol
Jul 19 '05 #1
6 18314
Spamtrap wrote:
All I need to do is take a simple text file [that had been created
from a Perl script] and copy it, however some specific lines are in
fact in UTF8 as printed garbagy characters and they need to be
converted to Unicode,
Sorry, but this doesn't make sense. Do you know what the U in UTF stands
for? You already have Unicode!
so that the new text file can be imported into a
desktop program and into some Word documents.
When you mention Word I'm guessing that you are using some version of
Windows? Are you still running Windows 98 or so? I'm asking because any
somewhat newer Microsoft OS as well as Word can handle Unicode (and thus
UTF8) just fine. Actually Microsoft is one of the major proponents of
Unicode.
For the moment [if it makes it easier] I would be happy to get a
solution for most European languages, and could skip things like
Russion and Chinese till later

Español - for example
If this is really what you can see when opening the file in your program
then it is far more likely that the program believes the file is in
ISO-8859-1 or ANSI-1252. If the program would assume UTF16 or UTF32 then the
text would be displayed vastly different.
should convert to Espaol


But this is not encoded in Unicode (in whatever transfer format) but in
ISO-8859-1 as your header clearly says:
Content-Type: text/plain; charset=ISO-8859-1

While it seems you are quite confused nevertheless I suggest to look at the
Text::Iconv module.

jue
Jul 19 '05 #2
Spamtrap wrote:
All I need to do is take a simple text file [that had been created
from a Perl script] and copy it, however some specific lines are in
fact in UTF8 as printed garbagy characters and they need to be
converted to Unicode, so that the new text file can be imported into a
desktop program and into some Word documents.
UTF8 *is* Unicode.
Some programs that deal with UTF16 or ISO-8859-1 need to be told that
the file is encoded in UTF8.
Español - for example should convert to Espaol


That's what happens when a file is in UTF8 but the program reading
the file thinks it is ISO-8859-1. You'll need to either mark the file
in some why so that programs recognize it as UTF8, or use an option
in the program to force it to process the the input as UTF8.

-Joe
Jul 19 '05 #3
Spamtrap <oc*******@snea kemail.com> wrote in message news:<nh******* *************** **********@4ax. com>...
I only work in Perl occasionaly, and have been searching for a
solution for a conversion, and everything I found seems much too
complex.

All I need to do is take a simple text file [that had been created
from a Perl script] and copy it, however some specific lines are in
fact in UTF8 as printed garbagy characters and they need to be
converted to Unicode,
What do you mean by "converted to Unicode"?

Do you perhaps mean some other specific encoding of Unicode? If so
which one?
so that the new text file can be imported into a
desktop program and into some Word documents.


Ah, sounds like you may be using Microsoft products. You probably
want to convert utf8 into utf16 (can't recall if MS uses BE or LE but
any utf16 implementation is supposed to autodetect anyhow).

This has nothing to do with Perl, as such.

I just tried "convert utf8 utf16" in Google and found lots of stuff.

This newsgroup does not exist (see FAQ). Please do not start threads
here.
Jul 19 '05 #4
Ok let me try to redefine the problem.

I have a text file, [ in Windows 98], which by definition is in plain
256 character ASCII. When I view it I see Español - which I assumed
was originally UTF8 - but I want to see Espaol [which of course
could exist in ASCII, without even having to go to Unicode or anything
fancy] so the encoding is using the two characters ñ for the single
character

The data from that text file is being imported into a database [this
part is not Perl programming]. When I display the data, it displays
Español not Espaol

Then a program will manipulate that database and create a Microsoft
Word document [or possibly an Adobe PDF document] and I assume the
text will continue to be incorrect. Therefore I want to use Perl to
fix that text data before I do the other processing.

I also have things like Субъе - which is supposed to be Russian
and judeţul which is Romanian.

It is possible I might have to maitain 2 copies of the strings in the
database tables, one as an ASCII close match for display purposes,
[since the database will not support UNICODE directly] and one as
actual UNICODE for passing into Word.

Jul 19 '05 #5
Spamtrap wrote:
I have a text file, [ in Windows 98], which by definition is in plain
256 character ASCII.
Impossible. ASCII by it's very definition has only 127 characters.
When I view it I see Español - which I assumed
was originally UTF8 -
Yep, this sounds about right.
but I want to see Espaol [which of course
could exist in ASCII,
No, it cannot because ASCII contains only English characters and does not
contain any extended characters.
without even having to go to Unicode or anything
fancy]
But UTF-8 which apparently is the current encoding of your text _is_ already
Unicode.
so the encoding is using the two characters ñ for the single
character

The data from that text file is being imported into a database [this
part is not Perl programming]. When I display the data, it displays
Español not Espaol
That simply means one of two things:
- either the program you are using to display the data does not _know_ how
to handle UTF-8. If this is the case, then you should use a program that
actually understands UTF-8.
- or the program does not realize that the file is in UTF-8 and therefore
uses whatever default encoding is selected. In that case simply make the
program recognize the file as UTF-8 encoded, either by changing some option
in the program or by setting the byte order mark in the file or similar
means.
Then a program will manipulate that database and create a Microsoft
Word document [or possibly an Adobe PDF document] and I assume the
text will continue to be incorrect. Therefore I want to use Perl to
fix that text data before I do the other processing.
See Text::Iconv if you really want to convert text forth and back
I also have things like СfбSе - which is supposed to be Russian
and judeţul which is Romanian.
Then you _really_ should keep your text as Unicode because cyrillic
characters are not part of Windows-1252 or ISO-Latin-1. Which means you
cannot represent Russian text and Spanish text in the same file.
It is possible I might have to maitain 2 copies of the strings in the
database tables, one as an ASCII close match for display purposes,
There are neither cyrillic nor extended characters in ASCII.
[since the database will not support UNICODE directly] and one as
actual UNICODE for passing into Word.


Then change the database. This is 2004, not 1984. A database that today
cannot handle arbitrary international text is not worth it's money, even if
it's free.

jue
Jul 19 '05 #6
Spamtrap wrote:
Ok let me try to redefine the problem.

I have a text file, [ in Windows 98], which by definition is in plain
256 character ASCII. When I view it I see Español - which I assumed
was originally UTF8 - but I want to see Espaol [which of course
could exist in ASCII, without even having to go to Unicode or anything
fancy] so the encoding is using the two characters ñ for the single
character
ASCII is only 128 characters. Character codes 128 to 255 can be
1) ISO-8859-1 (the Latin-1 alphabet), for western European languages.
2) Some Microsoft CP (code page). There are many.
3) Special bit patterns used in the UTF8 encoding scheme.

For Espaol, all you need is a UTF8-to-ISO8859 conversion utility.
The data from that text file is being imported into a database [this
part is not Perl programming]. When I display the data, it displays
Español not Espaol
That means that whatever program you are using to display the data
does not understand UTF8. There are terminal emulators and command
consoles that do understand UTF8.
Then a program will manipulate that database and create a Microsoft
Word document [or possibly an Adobe PDF document] and I assume the
text will continue to be incorrect. Therefore I want to use Perl to
fix that text data before I do the other processing.
You could try playing around with
open IN,':utf8',$inp ut_file or die;
open OUT,':crlf',$ou tput_file or die;
print OUT <IN>;
I also have things like Субъе - which is supposed to be Russian
and judeţul which is Romanian.
Russian characters simply cannot be displayed in ASCII or ISO-8859-1.
ISO-8859-9 has Cyrillic, but not western european accented characters.
Read http://czyborra.com/charsets/iso8859.html (or Google's cache).
It is possible I might have to maitain 2 copies of the strings in the
database tables, one as an ASCII close match for display purposes,
[since the database will not support UNICODE directly] and one as
actual UNICODE for passing into Word.


The major databases do support Unicode directly. Often it is as simple
as exporting the database to a flat file, defining a new database
with UTF8 enabled, and importing the data. You will have to ask the
DBA to perform this operation.
-Joe
Jul 19 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
5038
by: sinasalek | last post by:
i have a problem with MySQL 4.1.x and UTF8. in version 4.0, i'm using html forms with utf8 charset for inserting unicode strings. but in version 4.1.x it is not working! if i change the charset of column, ALTER TABLE `icons` CHANGE `name_farsi` `name_farsi` VARCHAR( 99 ) CHARACTER SET utf8 COLLATE utf8_persian_ci DEFAULT NULL and change...
5
6891
by: Richard Lewis | last post by:
Hi there, I'm having a problem with unicode files and ftplib (using Python 2.3.5). I've got this code: xml_source = codecs.open("foo.xml", 'w+b', "utf8") #xml_source = file("foo.xml", 'w+b') ftp.retrbinary("RETR foo.xml", xml_source.write)
12
4076
by: Chris Mullins | last post by:
I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following mappings though, and it seems like a shortcoming of the .NET framework: When I see Unicode value 0x10400, I'm supposed to map it to value 0x10428....
3
7745
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but...
3
3324
by: Mikael Karon | last post by:
Hi, I'm desperatly trying to convert a string wich I have in Shift-JIS (@", ,,,̉p,̃x") to UTF8 but I just can't get it to work. One would thing this would be trivial and well documented somewhere (perhaps it is and I'm not getting/finding it) but however much I try I just can't get it to work. I'm trying to do this on a webpage...
16
8479
by: Greg Miller | last post by:
I have an application that uses sqlite3 to store job/error data. When I log in as a German user the error codes generated are translated into German. The error code text is then stored in the db. When I use the fetchall() to retrieve the data to generate a report I get the following error: Traceback (most recent call last): File...
2
10281
by: Jason | last post by:
Hi, I was wondering if anyone could advise me on this. Right now I am setting up a DB2 UDB V8.2.3 database with UTF8 character set, which will work with a J2EE application running on WebSphere Application Server. I have two questions: 1. How many characters, such as Chinese, Japanese, can a CHAR(128) or
4
4472
by: weheh | last post by:
I'm developing a cgi-bin application that must be unicode sensitive. I'm striving for a UTF8 implementation. I'm running python 2.3 on a development machine (windows xp) and a server (windows xp server). Both environments are running Apache 2.2 with the same configuration file. The problem is this. On my development machine I get the...
8
2645
by: Yves Dorfsman | last post by:
Can you put UTF-8 characters in a dbhash in python 2.5 ? It fails when I try: #!/bin/env python # -*- coding: utf-8 -*- import dbhash db = dbhash.open('dbfile.db', 'w') db = u'☺'
0
7820
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main...
0
7738
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8081
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8258
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7835
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
1
5635
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3777
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1360
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
1084
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.