By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,165 Members | 840 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,165 IT Pros & Developers. It's quick & easy.

Reading text file containing accented vowels

P: n/a
I'm trying to process a collection of text files, reading word by word. The
program run hangs whenever it encounters a word with an accented letter
(like rôle or passé) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.

Thanks

Phil
Jul 22 '05 #1
Share this Question
Share on Google+
8 Replies


P: n/a

"Phil Slater" <ph*********@amsjv.com> wrote in message
news:40********@baen1673807.greenlnk.net...
I'm trying to process a collection of text files, reading word by word. The program run hangs whenever it encounters a word with an accented letter
(like rtle or passi) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful if someone could spare the time to enlighten me as to how to read in and
process these strings.

Thanks

Phil

It would help greatly if we could see the code that's doing the reading now.
There's nothing special about "accented vowels" that would prevent reading
them, except that their values are not in the range of 0..127. If you read
unsigned char values (instead of char), then you can read anything in the
range 0..255. (Maybe that's the problem?) Using a stream and reading into
a string should work, and then you can parse each line word-by-word. But
again, I can't tell where your code *might* be stuck without seeing the
code.

-Howard

Jul 22 '05 #2

P: n/a
Phil Slater wrote:
I'm trying to process a collection of text files, reading word by word. The
program run hangs whenever it encounters a word with an accented letter
(like rôle or passé) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.


It's possible that your accented characters are from the "extended ASCII"
part of the code table. Try reading the words using _unsigned_char_ type.

V
Jul 22 '05 #3

P: n/a

"Phil Slater" <ph*********@amsjv.com> wrote in message
news:40********@baen1673807.greenlnk.net...
I'm trying to process a collection of text files, reading word by word. The program run hangs whenever it encounters a word with an accented letter
(like rôle or passé) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a suspicion that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be grateful if someone could spare the time to enlighten me as to how to read in and
process these strings.


I would guess that some where you are assuming that char values are
positive, for instance by using a char variable as the index of an array.
This is not necessarily true of non-ASCII characters which can have negative
values (depending on your implementation). Casts to unsigned char at
appropriate places in your code might solve this.

I would like to be more specific but you forgot to include any code at all
in your post.

john
Jul 22 '05 #4

P: n/a
Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>
<>
<>
<>
<>
<>
.... ad infinitum

Looks like the stream goes into a fail state when it hits the á

Now trying the same file with your suggestion of using unsigned char for
input, the program reads the whole file, but now it's not word-by-word -
the delimiter for input seems to have changed. The changed program is

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

The output I get is:

<bungle

Gruntfutter

Kárahnjúkar>

The change to unsigned char has solved the basic problem of accented
letters, but leaves me wondering what exactly the behaviour of these
basic_string<unsigned char> things is. They obviously don't behave like
ordinary strings (if the extraction operator is anything to go by).

Any ideas?

Thanks for your help.

Phil
Howard wrote:
"Phil Slater" <ph*********@amsjv.com> wrote in message
news:40********@baen1673807.greenlnk.net...
I'm trying to process a collection of text files, reading word by word.


The
program run hangs whenever it encounters a word with an accented letter
(like rtle or passi) - ie something that's not a "char" with an ASCII code
in 0..127

I've searched the ANSI C++ standard, the internet and various text books,
but can't see how to workaround this one. I've tried wchar_t and wstring
without success.

But rather than spending lots of time on trial and error, I have a


suspicion
that there *must* be a straightforward way of doing such an obviously
necessary thing. Sorry to ask such a trivial question - but I'd be


grateful
if someone could spare the time to enlighten me as to how to read in and
process these strings.

Thanks

Phil

It would help greatly if we could see the code that's doing the reading now.
There's nothing special about "accented vowels" that would prevent reading
them, except that their values are not in the range of 0..127. If you read
unsigned char values (instead of char), then you can read anything in the
range 0..255. (Maybe that's the problem?) Using a stream and reading into
a string should work, and then you can parse each line word-by-word. But
again, I can't tell where your code *might* be stuck without seeing the
code.

-Howard



Jul 22 '05 #5

P: n/a

"Phil Slater" <ph*********@removethisbitamsjv.com> wrote in message
news:c8**********@hercules.btinternet.com...
Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}
It's not relevant to your problem but your eof test is incorrect. eof does
not tell you when you are at the end of file. It is only reliably true
*after* you have attempted to read past the end of file. In other words you
should write

for (;;)
{
f >> word;
if (f.eof())
break;
cout << ...
}

or more simply

while (f >> word)
cout << ...

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>
I think you have a broken implementation of the STL, that works fine for me.
Which compiler are you using?

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}


That should simply not compile, the char of the string type is not that same
as the char of the fstream type.

Time to upgrade your compiler I think.

john
Jul 22 '05 #6

P: n/a


John Harrison wrote:
"Phil Slater" <ph*********@removethisbitamsjv.com> wrote in message
news:c8**********@hercules.btinternet.com...
Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

It's not relevant to your problem but your eof test is incorrect. eof does
not tell you when you are at the end of file. It is only reliably true
*after* you have attempted to read past the end of file. In other words you
should write

for (;;)
{
f >> word;
if (f.eof())
break;
cout << ...
}

or more simply

while (f >> word)
cout << ...


Thanks for that.
Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>

I think you have a broken implementation of the STL, that works fine for me.


Yours reads accented characters into a basic_string<char>? So I guess it
must be storing the á as a negative number?
Which compiler are you using?
g++ (cygwin)...
gcc version 2.95.3-5 (cygwin special)

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

That should simply not compile, the char of the string type is not that same
as the char of the fstream type.

Time to upgrade your compiler I think.


Which compiler are you using?

john


Jul 22 '05 #7

P: n/a
> >
Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Kárahnjúkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>

I think you have a broken implementation of the STL, that works fine for

me.
Yours reads accented characters into a basic_string<char>? So I guess it
must be storing the á as a negative number?
Yes.
Which compiler are you using?


g++ (cygwin)...
gcc version 2.95.3-5 (cygwin special)


I've heard that gcc 2.95 has a poor implementation of the standard template
library (STL), your experience seems to prove it. Last post I tried with
with VC++ 7.1, I've just tried with gcc 3.3.1 and got the same result. Your
first program runs correctly, your second doesn't compile. I really think
you are going to have to upgrade.

john
Jul 22 '05 #8

P: n/a

"Phil Slater" <ph*********@removethisbitamsjv.com> wrote in message
news:c8**********@hercules.btinternet.com...
Thanks Howard. I've produced a minimal code example to distil out the
problem.

To process a text file word-by-word, this kind of approach seemed
obvious (and very elementary):

int main(void)
{
string word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

Fine, but when I test it with, for example, the file "test.txt" as
(between the lines):
------------
bungle

Gruntfutter

Karahnjzkar
------------
I get the output:
<bungle>
<Gruntfutter>
<K>
<>
<>
<>
<>
<>
... ad infinitum

Looks like the stream goes into a fail state when it hits the a

Now trying the same file with your suggestion of using unsigned char for
input, the program reads the whole file, but now it's not word-by-word -
the delimiter for input seems to have changed. The changed program is

typedef basic_string<unsigned char> ustring;
int main(void)
{
ustring word;
ifstream f("test.txt");
while(!f.eof())
{
f >> word;
cout << '<' << word << '>' << endl;
}
}

The output I get is:

<bungle

Gruntfutter

Karahnjzkar>

The change to unsigned char has solved the basic problem of accented
letters, but leaves me wondering what exactly the behaviour of these
basic_string<unsigned char> things is. They obviously don't behave like
ordinary strings (if the extraction operator is anything to go by).

Any ideas?


That's correct. You're reading in a string. There is nothing that says
that that reading is delimited by anything when reading.

So what do you do? I'd probably read in a line at a time into a string (and
then parse the string into words, if there can be more than one word on a
line). I think std::getline is the function for reading a line.

Also, don't use while (!f.eof()), use while (getline(whatever)). The eof()
function is not valid to check until *after* attempting a read.

-Howard


Jul 22 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.