unicode text file

Koulbak

I have some unicode (utf8) text file. I _tried_ to write a simple
program that read one of them and write it to the standard output but...
of course it doesn't work. There is an easy way to do it? Thanks, K.

This is my program.

#include <fstream>
#include <iostream>
#include <string>

using namespace std;

int main(){
ifstream infile ("in.txt");
string s;
while (infile >> s) {
cout << s;
}
}

Jul 23 '05 #1

Subscribe Post Reply

2579

Mike Wahler

"Koulbak" <tu********@gmail.com> wrote in message
news:42**********@news.bluewin.ch...

I have some unicode (utf8) text file. I _tried_ to write a simple program
that read one of them and write it to the standard output but... of course
it doesn't work. There is an easy way to do it? Thanks, K.

This is my program.

#include <fstream>
#include <iostream>
#include <string>

using namespace std;

int main(){
ifstream infile ("in.txt");
You should here check that file was opened successfully
before attempting to read from it.
string s;
while (infile >> s) {
cout << s;
}
}

Try using 'wifstream' and 'wcout'.

-Mike

Jul 23 '05 #2

Koulbak

Mike Wahler wrote:
[read unicode text file]

int main(){
ifstream infile ("in.txt");

You should here check that file was opened successfully
before attempting to read from it.

In the real program of course I do it, but in my post I put only the
essential part of the question.

string s;
while (infile >> s) {
cout << s;
}
}

Try using 'wifstream' and 'wcout'.

1 Tried, it doesn't compile.

error C2679: binary '>>' : no operator found which takes a right-hand
operand of type 'std::string' (or there is no acceptable conversion)

I added also wstring and it compile but it doens't work correctly: it
prints a lot of garbage.

2 I thought that with C++ there was the possibility to use exactly the
standard way (avoid special construct as wcout) maybe setting some
library option. Is it not at all true?

Thanks a lot, K.

Jul 23 '05 #3

Ioannis Vranos

Koulbak wrote:

1 Tried, it doesn't compile.

error C2679: binary '>>' : no operator found which takes a right-hand
operand of type 'std::string' (or there is no acceptable conversion)

You should use wstring. A wchar_t string literal is prefixed with L. For example:
wstring s= L"Some string";
I added also wstring and it compile but it doens't work correctly: it
prints a lot of garbage.

2 I thought that with C++ there was the possibility to use exactly the
standard way (avoid special construct as wcout) maybe setting some
library option. Is it not at all true?

These *are* standard facilities. All string facilities come with their wchar_t equivalents
(including the facilities of the C-subset).

--
Ioannis Vranos

http://www23.brinkster.com/noicys

Jul 23 '05 #4

Koulbak

Ioannis Vranos wrote:

You should use wstring. [...]

I add wstring, it doesn't works.

2 I thought that with C++ there was the possibility to use exactly the
standard way (avoid special construct as wcout) maybe setting some
library option. Is it not at all true?

These *are* standard facilities. All string facilities come with their
wchar_t equivalents (including the facilities of the C-subset).

Sorry I was not clear at all. I would like to avoid as mush as possible
the implementation details. I don't want to use explicitely unicode
function but simply say to the compiler or to the library that my
character code is unicode and then read a file exactly in the usual way.

I would like to avoid to learn a new set of function to read and
manipulate unicode character, unicode string and so on. Of course if it
is possible.

Thanks, K.

Jul 23 '05 #5

Rapscallion

Koulbak wrote:

string s;
while (infile >> s) {
cout << s;
}
}

1 Tried, it doesn't compile.

error C2679: binary '>>' : no operator found which takes a right-hand

operand of type 'std::string' (or there is no acceptable conversion)
You have not included all necessary or the wrong header files (or have
the wrong files in your include path).
I added also wstring and it compile but it doens't work correctly: it prints a lot of garbage.

wstring is not appropriate for UTF-8.

R.C.

Jul 23 '05 #6

Koulbak

[....]

I added also wstring and it compile but it doens't work correctly: it

prints a lot of garbage.

wstring is not appropriate for UTF-8.

Ok, that' s the problem. My encoding is UTF-8.

Any solution?
Thanks, K.

Jul 23 '05 #7

Rapscallion

Koulbak wrote:

wstring is not appropriate for UTF-8.

Ok, that' s the problem. My encoding is UTF-8.

Any solution?

Maybe I've been wrong. See e.g.
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www-106.ibm.com/developerwork.../l-linuni.html
or search for other 'UTF-8' resources.

Jul 23 '05 #8

Old Wolf

Koulbak wrote:

I have some unicode (utf8) text file. I _tried_ to write a
simple program that read one of them and write it to the
standard output but... of course it doesn't work. There
is an easy way to do it? Thanks, K.

This is my program.

#include <fstream>
#include <iostream>
#include <string>

using namespace std;

int main(){
ifstream infile ("in.txt");
string s;
while (infile >> s) {
cout << s;
}
}

ostream >> string reads a word (up to whitespace), and then
ignores any adjacent whitespace and newlines.
To do line-by-line reading, you would go:

while (getline(infile, s))
cout << s;

But this is not good for UTF-8 files because newline characters
might be part of a UTF-8 code.

To output the whole file at once:

cout << infile.rdbuf();

I'm assuming you want to output UTF-8 on stdout (Standard
C++ offers no facilities for converting UTF-8 to a stream
of wide characters). Can you clarify your intention?

The best thing to do (IMHO) would be to open the file in
binary mode, and also force std::cout into binary mode. (This
would require a system-specific code). Then, no translation
will occur and it will work correctly.

If you can't force cout to binary, then it *might* work to
open the input in text mode too, and hope that the input
conversions match the output conversions!

Jul 23 '05 #9

Ioannis Vranos

Koulbak wrote:

Sorry I was not clear at all. I would like to avoid as mush as possible
the implementation details. I don't want to use explicitely unicode
function but simply say to the compiler or to the library that my
character code is unicode and then read a file exactly in the usual way.

I would like to avoid to learn a new set of function to read and
manipulate unicode character, unicode string and so on. Of course if it
is possible.

wchar_t represents the largest character set of a system, char mainly represents a byte
and 1 byte character sets. If you have to deal with various character sets, then better
stick to wchar_t and the corresponding facilities for it (which are the same with plain
char facilities, with an additional w in their name) .

--
Ioannis Vranos

http://www23.brinkster.com/noicys

Jul 23 '05 #10

Ioannis Vranos

Rapscallion wrote:

wstring is not appropriate for UTF-8.

Why?
--
Ioannis Vranos

http://www23.brinkster.com/noicys

Jul 23 '05 #11

Ioannis Vranos

Old Wolf wrote:

The best thing to do (IMHO) would be to open the file in
binary mode, and also force std::cout into binary mode. (This
would require a system-specific code). Then, no translation
will occur and it will work correctly.

If you can't force cout to binary, then it *might* work to
open the input in text mode too, and hope that the input
conversions match the output conversions!

What is wrong with the use of wcout?

--
Ioannis Vranos

http://www23.brinkster.com/noicys

Jul 23 '05 #12

Serge Skorokhodov (216716244)

>> The best thing to do (IMHO) would be to open the file in

binary mode, and also force std::cout into binary mode.
(This would require a system-specific code). Then, no
translation will occur and it will work correctly.

If you can't force cout to binary, then it *might* work to
open the input in text mode too, and hope that the input
conversions match the output conversions!
What is wrong with the use of wcout?

UTF-8 is a stream of 1-byte chars with characters beyond ASCII
coded as multi byte sequences. I guess that you need to read such
a stream as a char or binary stream and then decode each line
with appropriate routine to UTF-16 Unicode. Say
MultiByteToWideChar and WideCharToMultiByte strings on Win32
platform. Other API exists on *nix platform in iconv ets.

--
Serge

Jul 23 '05 #13

phil_gg04

> I have some unicode (utf8) text file. I _tried_ to write a simple

program that read one of them and write it to the standard output but... of course it doesn't work

What character set do you want to use when writing to standard output?

If you want it to write using a character set other than the UTF-8 that
it read in, you need to do some conversion. You have to do this
explicitly. It will not happen automatically.

Assuming that your program is going to actually do something with the
text, rather than just reading it in and then writing it out again, you
need to decide what character set you want to use internally. I mostly
use UTF-8 internally and for input/output, so there is rarely any
conversion. I store this in chars. This is on Unix, and I'm in the
"western hemisphere". I understand that Windows programmers tend to
use UTF-16 quite often and that would also be sensible for non-European
languages. For that you should use wchars. You should not be using
ASCII for any new applications.

To actually perform the conversion you need something like the iconv
library. This is supported just about everywhere, but you'll want a
C++ wrapper for it to make it more palateable.

Regards, Phil.

Jul 23 '05 #14

Koulbak

ph*******@treefic.com wrote:

I have some unicode (utf8) text file. I _tried_ to write a simple
program that read one of them and write it to the standard output
but... of course it doesn't work
What character set do you want to use when writing to standard output? [..]
If you want it to write using a character set other than the UTF-8 that
it read in, you need to do some conversion. You have to do this
explicitly. It will not happen automatically.

Thanks! I think I perfectly understood the problem.

My program was only an exercise, but the goal was learn how to "set" the
library (?) to read unicode (or eventually another encoding), manipulate
it using the string functionality of the standard library and then
write it back in a particular encoding on a file or to the standard output.
Assuming that your program is going to actually do
something with the text, rather than just reading
it in and then writing it out again, you need to
decide what character set you want to use internally.
It's really necessary that I specify the internal encoding? At my level
(scholastic level) I have no performance problem so if does exists a
default encoding this is ok for me.

So I would like specify the input file encoding and the ouput file
encoding, than use my program, for example:

string s;
while (infile >> s) {
if (s=="hello")
{;} //delete "hello" from input
else
{cout << s;}
}

I don't want, if it's possible, to specify wstring, wcut and so on because
1 I don't want to change the program the day I will need a diffent encoding
2 The program written without wstring, wcut etc. is more natural and
general and don't touch the implementation level

[Old Wolf write] I'm assuming you want to output UTF-8 on stdout (Standard
C++ offers no facilities for converting UTF-8 to a stream
of wide characters). Can you clarify your intention?

I hope now it's more clear.

Thanks to all for the help. K.

Jul 23 '05 #15

Similar topics

UNICODE support in VB 6.0

by: ..... | last post by:

I have an established program that I am changing to allow users to select one of eight languages and have all the label captions change accordingly. I have no problems with English, French, Dutch,...

Visual Basic 4 / 5 / 6

Writing UTF-8 string to UNICODE file

by: Michael Weir | last post by:

I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...

Python

Trouble saving unicode text to file

by: Svennglenn | last post by:

I'm working on a program that is supposed to save different information to text files. Because the program is in swedish i have to use unicode text for ÅÄÖ letters. When I run the following...

Python

Read UTF8 (mixed byte) file & convert to Unicode

by: hunterb | last post by:

I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...

.NET Framework

Problem displaying Unicode characters on my forms labels and butto

by: Kidus Yared | last post by:

I am having a problem displaying Unicode characters on my Forms labels and buttons. After coding Button1.Text = unicode; where the unicode is a Unicode character or string (â€˜\u1234â€™ or...

C# / C Sharp

Sending mail message with unicode text

by: David Dvali | last post by:

Hello. I have a problem with sending Unicode text in mail message. So what I do: First of all I have some template file like this: ================================= <html> <head><title>Test...

ASP.NET

Convert DOS Cyrillic text to Unicode

by: Nikolay Petrov | last post by:

How can I convert DOS cyrillic text to Unicode

Visual Basic .NET

Unicode to ASCII string conversion

by: Ger | last post by:

I have not been able to find a simple, straight forward Unicode to ASCII string conversion function in VB.Net. Is that because such a function does not exists or do I overlook it? I found...

Visual Basic .NET

ASCII vs Unicode

by: Jeff | last post by:

Hi - I'm setting up a streamreader in a VB.NET app to read a text file and display its contents in a multiline textbox. If I set it up with System.Text.Encoding.Unicode, it reads a unicode...

Visual Basic .NET

how to read a Unicode file

by: starffly | last post by:

I want to read a xml file in Unicode, UTF-8 or a native encoding into a wchar_t type string, so i write a routine as follows, however, sometimes a Unicode file including Chinese character cannot...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice