473,322 Members | 1,510 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

How to read UTF-8 text files?

I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.

Are there any ways to read the UTF-8 text files as simple and
convenient as the way that we read ANSI text files? Thanks a lot~~

Apr 25 '06 #1
8 27441
you should find some functions, which must have some parameter through
that you can choose which code.
maybe STL includes that kind of function. I'm not familiar with it. try
by yourself.

Apr 25 '06 #2
Zephyre posted:
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.

Are there any ways to read the UTF-8 text files as simple and
convenient as the way that we read ANSI text files? Thanks a lot~~


I was writing a program just recently to convert between the different
encoding schemes for Unicode. I used std::bitset to read and write the
values. Look up "ifstream". It's easy to use like as follows:

ifstream in("blah.txt");

std::bitset<8> octet;

in >> octet;
and then when you're writing:

ofstream out("blah.txt");

std::bitset<32> thirtytwo;

out << thirtytwo;
-Tomás

Apr 25 '06 #3
Zephyre wrote :
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
fopen() is the C method.
In C++ we have iostreams.

I must read the contents byte by byte
You don't have to.
Just read everything at once.

change the UTF-8
characters to Unicode
Unicode isn't a character encoding by itself, only a character set.
You probably mean UCS-2 or UCS-4. Since you're using Windows terminology
I suppose you mean UCS-2, which is lossy.

Anyway you can simply work with utf-8, no need to convert to something else.
Are there any ways to read the UTF-8 text files as simple and
convenient as the way that we read ANSI text files? Thanks a lot~~


Simply read them as if they were "ANSI" (windows main locale) text files.
The only thing that changes is that a character may be multiple bytes.
If you really care about that being handled correctly use a set of
functions or classes dedicated to Unicode handling, like ICU or
Glib::ustring, that acts just like a std::string.
Apr 25 '06 #5

Zephyre wrote:
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.


Yep, with C++ iostreams the only thing you need is an UTF-8 "codecvt
facet."
You might have one in your std:: library implementation, you could
write one,
or you could buy one (There's one in the Core library from Dinkumware)

HTH,
Michiel Salters

Apr 25 '06 #6
Mi*************@tomtom.com wrote:
Zephyre wrote:
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.

Yep, with C++ iostreams the only thing you need is an UTF-8 "codecvt
facet."
You might have one in your std:: library implementation, you could
write one,
or you could buy one (There's one in the Core library from Dinkumware)


There's an unsupported one hidden away in boost. You just need to do
something like to this to get it:

#define BOOST_UTF8_BEGIN_NAMESPACE namespace mynamespace {
#define BOOST_UTF8_END_NAMESPACE }
#define BOOST_UTF8_DECL
#include <boost/detail/utf8_codecvt_facet.hpp>

//...
std::wifstream ifs;
std::locale utf8loc(std::locale(),
new mynamespace::utf8_codecvt_facet());
ifs.imbue(utf9loc);
ifs.open(...);
//...

Tom
Apr 25 '06 #7
Tom Widmer wrote:
#define BOOST_UTF8_BEGIN_NAMESPACE namespace mynamespace {
#define BOOST_UTF8_END_NAMESPACE }
#define BOOST_UTF8_DECL
#include <boost/detail/utf8_codecvt_facet.hpp>

//...
std::wifstream ifs;
std::locale utf8loc(std::locale(),
new mynamespace::utf8_codecvt_facet());
ifs.imbue(utf9loc);
ifs.open(...);


For those of us lost in iostream-style locales...

....then what? What code will behave differently because this stream is
imbued? Must I imbue std::strings and std::stringstreams, also, to store
UTF-8 in them?

(And to the original poster: Is this stuff answering your question, or did
you need to do something else with your text besides reading its data?)

--
Phlip
http://www.greencheese.us/ZeekLand <-- NOT a blog!!!
Apr 25 '06 #8
Phlip wrote:
Tom Widmer wrote:

#define BOOST_UTF8_BEGIN_NAMESPACE namespace mynamespace {
#define BOOST_UTF8_END_NAMESPACE }
#define BOOST_UTF8_DECL
#include <boost/detail/utf8_codecvt_facet.hpp>

//...
std::wifstream ifs;
std::locale utf8loc(std::locale(),
new mynamespace::utf8_codecvt_facet());
ifs.imbue(utf9loc);
ifs.open(...);

For those of us lost in iostream-style locales...

...then what? What code will behave differently because this stream is
imbued?


The wchar_t's that are read off the stream will be converted from the
utf8 multibyte characters. In effect, the input file is UTF8, but this
gives you a "view" of the file as UCS-2 (on Windows at least).

E.g.

int i;
ifs >> i; //reads a number

std::wstring ws;
std::getline(ifs, ws);
//ws will correctly contain any international chars

Must I imbue std::strings and std::stringstreams, also, to store UTF-8 in them?


Well, it only applies to converting between wchars and raw bytes, and
that operation is most commonly performed with file (and network) IO.
So, for standard streams, it only applies to file streams, since other
streams don't perform any code conversion (e.g. wide string streams just
hold the characters in memory as wide characters, whereas wide file
streams have to convert between wide characters and raw bytes, which is
where codecvt comes in).

Tom
Apr 25 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Luke | last post by:
Python doesn't seem to read UTF-8 properly from an interactive session. Am I doing something wrong? luked@sor ~ $ echo $LANG en_AU.UTF-8 luked@sor ~ $ python Python 2.3.4 (#1, Aug 12 2004,...
4
by: Alban Hertroys | last post by:
Another python/psycopg question, for which the solution is probably quite simple; I just don't know where to look. I have a query that inserts data originating from an utf-8 encoded XML file....
4
by: comp.lang.php | last post by:
I have a very simple XML file I created as a sample: I am trying to simply read it using tree-based XML functions in PHP which I know (sorry, don't know or quickly understand DOM though that...
9
by: John Stivenson | last post by:
I'm using PHP 4.3.10 & MySQL 4.1.9. When I read UTF-8 string from database and echo it to the page I get only question marks. The number of question marks is equal to the real length of the...
1
by: Zhongjian Lu | last post by:
Hi Guys, I was processing a UTF-16 coded file with BOM and was not aware of the codecs package at first. I wrote the following code: ===== Code 1============================ for i in...
2
by: anubis | last post by:
Heay, i've got this problem: http://rafb.net/paste/results/lpNgbn49.html i'm using wifstream to read utf-16 file and i've got this problem, that each byte is read into seperate char while...
3
by: stil | last post by:
hi, i've got a little problem with encoding in UTF_16. i create my document, giving him to my writer to write in a string, when i set an encoding in UTF-8: DOMWriter* L_poWriter =...
6
by: Harshad Modi | last post by:
hello , I make one function for encoding latin1 to utf-8. but i think it is not work proper. plz guide me. it is not get proper result . such that i got "Belgi�" using this method, (Belgium)...
6
by: docbook.xml | last post by:
I have the following in the XHTML 1.0 Strict page: <meta http-equiv="Content-Type" content="text/html;charset=utf-32" /> However W3 validator complains that "The character encoding specified in...
5
by: dave_140390 | last post by:
Hi, I have problems getting my Python code to work with UTF-8 encoding when reading from stdin / writing to stdout. Say I have a file, utf8_input, that contains a single character, é, coded...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.