By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,959 Members | 1,196 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,959 IT Pros & Developers. It's quick & easy.

How to read UTF-8 text files?

P: n/a
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.

Are there any ways to read the UTF-8 text files as simple and
convenient as the way that we read ANSI text files? Thanks a lot~~

Apr 25 '06 #1
Share this Question
Share on Google+
8 Replies


P: n/a
you should find some functions, which must have some parameter through
that you can choose which code.
maybe STL includes that kind of function. I'm not familiar with it. try
by yourself.

Apr 25 '06 #2

P: n/a
Zephyre posted:
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.

Are there any ways to read the UTF-8 text files as simple and
convenient as the way that we read ANSI text files? Thanks a lot~~


I was writing a program just recently to convert between the different
encoding schemes for Unicode. I used std::bitset to read and write the
values. Look up "ifstream". It's easy to use like as follows:

ifstream in("blah.txt");

std::bitset<8> octet;

in >> octet;
and then when you're writing:

ofstream out("blah.txt");

std::bitset<32> thirtytwo;

out << thirtytwo;
-Tomás

Apr 25 '06 #3

P: n/a

P: n/a
Zephyre wrote :
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
fopen() is the C method.
In C++ we have iostreams.

I must read the contents byte by byte
You don't have to.
Just read everything at once.

change the UTF-8
characters to Unicode
Unicode isn't a character encoding by itself, only a character set.
You probably mean UCS-2 or UCS-4. Since you're using Windows terminology
I suppose you mean UCS-2, which is lossy.

Anyway you can simply work with utf-8, no need to convert to something else.
Are there any ways to read the UTF-8 text files as simple and
convenient as the way that we read ANSI text files? Thanks a lot~~


Simply read them as if they were "ANSI" (windows main locale) text files.
The only thing that changes is that a character may be multiple bytes.
If you really care about that being handled correctly use a set of
functions or classes dedicated to Unicode handling, like ICU or
Glib::ustring, that acts just like a std::string.
Apr 25 '06 #5

P: n/a

Zephyre wrote:
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.


Yep, with C++ iostreams the only thing you need is an UTF-8 "codecvt
facet."
You might have one in your std:: library implementation, you could
write one,
or you could buy one (There's one in the Core library from Dinkumware)

HTH,
Michiel Salters

Apr 25 '06 #6

P: n/a
Mi*************@tomtom.com wrote:
Zephyre wrote:
I have some UTF-8 text files written in Chinese to be read. Now the
only method that I know to read text from it is to use fopen()
function. Thus, I must read the contents byte by byte, change the UTF-8
characters to Unicode, store the characters into wchar_t variables. But
I think this method is too complex and isn't elegant at all.

Yep, with C++ iostreams the only thing you need is an UTF-8 "codecvt
facet."
You might have one in your std:: library implementation, you could
write one,
or you could buy one (There's one in the Core library from Dinkumware)


There's an unsupported one hidden away in boost. You just need to do
something like to this to get it:

#define BOOST_UTF8_BEGIN_NAMESPACE namespace mynamespace {
#define BOOST_UTF8_END_NAMESPACE }
#define BOOST_UTF8_DECL
#include <boost/detail/utf8_codecvt_facet.hpp>

//...
std::wifstream ifs;
std::locale utf8loc(std::locale(),
new mynamespace::utf8_codecvt_facet());
ifs.imbue(utf9loc);
ifs.open(...);
//...

Tom
Apr 25 '06 #7

P: n/a
Tom Widmer wrote:
#define BOOST_UTF8_BEGIN_NAMESPACE namespace mynamespace {
#define BOOST_UTF8_END_NAMESPACE }
#define BOOST_UTF8_DECL
#include <boost/detail/utf8_codecvt_facet.hpp>

//...
std::wifstream ifs;
std::locale utf8loc(std::locale(),
new mynamespace::utf8_codecvt_facet());
ifs.imbue(utf9loc);
ifs.open(...);


For those of us lost in iostream-style locales...

....then what? What code will behave differently because this stream is
imbued? Must I imbue std::strings and std::stringstreams, also, to store
UTF-8 in them?

(And to the original poster: Is this stuff answering your question, or did
you need to do something else with your text besides reading its data?)

--
Phlip
http://www.greencheese.us/ZeekLand <-- NOT a blog!!!
Apr 25 '06 #8

P: n/a
Phlip wrote:
Tom Widmer wrote:

#define BOOST_UTF8_BEGIN_NAMESPACE namespace mynamespace {
#define BOOST_UTF8_END_NAMESPACE }
#define BOOST_UTF8_DECL
#include <boost/detail/utf8_codecvt_facet.hpp>

//...
std::wifstream ifs;
std::locale utf8loc(std::locale(),
new mynamespace::utf8_codecvt_facet());
ifs.imbue(utf9loc);
ifs.open(...);

For those of us lost in iostream-style locales...

...then what? What code will behave differently because this stream is
imbued?


The wchar_t's that are read off the stream will be converted from the
utf8 multibyte characters. In effect, the input file is UTF8, but this
gives you a "view" of the file as UCS-2 (on Windows at least).

E.g.

int i;
ifs >> i; //reads a number

std::wstring ws;
std::getline(ifs, ws);
//ws will correctly contain any international chars

Must I imbue std::strings and std::stringstreams, also, to store UTF-8 in them?


Well, it only applies to converting between wchars and raw bytes, and
that operation is most commonly performed with file (and network) IO.
So, for standard streams, it only applies to file streams, since other
streams don't perform any code conversion (e.g. wide string streams just
hold the characters in memory as wide characters, whereas wide file
streams have to convert between wide characters and raw bytes, which is
where codecvt comes in).

Tom
Apr 25 '06 #9

This discussion thread is closed

Replies have been disabled for this discussion.