Hi programmers,
I want read line by line a Unicode (UTF-8) text file created by Notepad, i don't want display the Unicode string in the screen, i want just read and compare the strings!.
This code read ANSI file line by line, and compare the strings What i want - Read test_ansi.txt line by line
- if the line = "b" print "YES!"
- else print "NO!"
read_ansi_line_by_line.c - #include <stdio.h>
-
-
int main()
-
{
-
char *inname = "test_ansi.txt";
-
FILE *infile;
-
char line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
-
char line_number;
-
-
infile = fopen(inname, "r");
-
if (!infile) {
-
printf("\nfile '%s' not found\n", inname);
-
return 0;
-
}
-
printf("\n%s\n\n", inname);
-
-
line_number = 0;
-
while (fgets(line_buffer, sizeof(line_buffer), infile)) {
-
++line_number;
-
/* note that the newline is in the buffer */
-
if (strcmp("b\n", line_buffer) == 0 ){
-
printf("%d: YES!\n", line_number);
-
}else{
-
printf("%d: NO!\n", line_number,line_buffer);
-
}
-
}
-
printf("\n\nTotal: %d\n", line_number);
-
return 0;
-
}
test_ansi.txt Compiling - gcc -o read_ansi_line_by_line read_ansi_line_by_line.c
Output - test_ansi.txt
-
-
1: NO!
-
2: YES!
-
3: NO!
-
-
-
Total: 3
Now i need read Unicode (UTF-8) file created by Notepad, after more than 6 months i don't found any good code/library in C can read file coded in UTF-8!, i don't know exactly why but i think the standard C don't support Unicode!
Reading Unicode binary file its OK!, but the probleme is the binary file most be already created in binary mode!, that mean if we want read a Unicode (UTF-8) file created by Notepad we need to translate it from UTF-8 file to BINARY file!
This code write Unicode string to a binary file, NOTE the C file is coded in UTF-8 and compiled by GCC What i want - Write the Unicode char "ب" to test_bin.dat
create_bin.c - #define UNICODE
-
#ifdef UNICODE
-
#define _UNICODE
-
#else
-
#define _MBCS
-
#endif
-
-
#include <stdio.h>
-
#include <wchar.h>
-
-
int main()
-
{
-
/*Data to be stored in file*/
-
wchar_t line_buffer[BUFSIZ]=L"ب";
-
/*Opening file for writing in binary mode*/
-
FILE *infile=fopen("test_bin.dat","wb");
-
/*Writing data to file*/
-
fwrite(line_buffer, 1, 13, infile);
-
/*Closing File*/
-
fclose(infile);
-
-
return 0;
-
}
Compiling - gcc -o create_bin create_bin.c
Output
Now i want read the binary file line by line and compare! What i want - Read test_bin.dat line by line
- if the line = "ب" print "YES!"
- else print "NO!"
read_bin_line_by_line.c - #define UNICODE
-
#ifdef UNICODE
-
#define _UNICODE
-
#else
-
#define _MBCS
-
#endif
-
-
#include <stdio.h>
-
#include <wchar.h>
-
-
int main()
-
{
-
wchar_t *inname = L"test_bin.dat";
-
FILE *infile;
-
wchar_t line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
-
-
infile = _wfopen(inname,L"rb");
-
if (!infile) {
-
wprintf(L"\nfile '%s' not found\n", inname);
-
return 0;
-
}
-
wprintf(L"\n%s\n\n", inname);
-
-
/*Reading data from file into temporary buffer*/
-
while (fread(line_buffer,1,13,infile)) {
-
/* note that the newline is in the buffer */
-
if ( wcscmp ( L"ب" , line_buffer ) == 0 ){
-
wprintf(L"YES!\n");
-
}else{
-
wprintf(L"NO!\n", line_buffer);
-
}
-
}
-
/*Closing File*/
-
fclose(infile);
-
return 0;
-
}
Compiling - gcc -o read_bin_line_by_line read_bin_line_by_line.c
Output THE PROBLEM
This method is VERY LONG! and NOT POWERFUL (i m beginner in software engineering)
Please any one know how to read Unicode file ? (i know its not easy!) Please any one know how to convert Unicode file to Binary file ? (simple method) Please any one know how to read Unicode file in binary mode ? (i m not sure)
Thank You.
I would not recommend anyone to use this code:
The ISO/ANSI C standard contains, in an amendment which was added in 1995, a "wide character" type `wchar_t', a set of functions like those found in <string.h> and <ctype.h> (declared in <wchar.h> and <wctype.h>, respectively), and a set of conversion functions between `char *' and `wchar_t *' (declared in <stdlib.h>).
READ THIS!!: http://www.faqs.org/docs/Linux-HOWTO...OWTO.html#toc6 12 13417
Hello,
few things.
1. UNICODE and utf-8 is not same(if i am not wrong). UNICODE is 2 byte long. On the other hand UTF-8 is a multybyte encoding system.
2. (dont listen to me). Looking for a easy way. not a good idea :)
Best Regrads,
JOHNY
Instead of using fgets and strcmp you are going to want to use the wide character version of those methods.
You will have to read the documentation of the OS/libraries you are using to find out what the wide char variants are.
If you are using Windows a quick search on MSDN should be helpful. Also you can do conversions from one to the other.
@JOHNY
Yes, you are right, i want edit title to remove "unicode" but no permission ^_^
if you have a UTF-8 project, and you want to read UTF-8 file line by line, what is the easy way you use ? =)
@RedSon
6 months of searching in Books, MSDN, Documentations, Internet, Forums.. i never found a solution to read UTF-8 file in C99!, can you help me please ? =)
Oh wait, if you are using gcc then you are not on a windows machine, so that MSDN link is not going to do you any good. I don't know why you are even searching MSDN like you state in post #5.
That is why I suggested that you search your libraries and other documentation for wide string functions. Your header files that come with C99 should have something for that.
@RedSon
First Thank you, i already read all MSDN pages that talking about UTF-8 ^_^, but i think i need use MultiByteToWideChar() after reading string from UTF-8 file, but i don't know how to use exactly!
@RedSon
Yes, i m looking for a solution in C99 with GCC, i think i need read the UTF-8 file in binary mode and convert UTF-8 to UTF-16 or not! or other way.. i need help seriously =)
Like I said, you won't be able to use it, because you are not building a windows application using windows libraries.
You will need to find an appropriate library call in your headers.
This link is aimed at gcc users: The Unicode HOWTO. Note that you must use glibc-2.2 or later. Consider using libutf8 (version 0.7.3 or later).
I find a solution to my problem, i want share the solution to any one interested by reading UTF-8 file in C99. :) - void ReadUTF8(FILE* fp)
-
{
-
unsigned char iobuf[255] = {0};
-
while( fgets((char*)iobuf, sizeof(iobuf), fp) )
-
{
-
size_t len = strlen((char *)iobuf);
-
if(len > 1 && iobuf[len-1] == '\n')
-
iobuf[len-1] = 0;
-
len = strlen((char *)iobuf);
-
printf("(%d) \"%s\" ", len, iobuf);
-
if( iobuf[0] == '\n' )
-
printf("Yes\n");
-
else
-
printf("No\n");
-
}
-
}
-
-
void ReadUTF16BE(FILE* fp)
-
{
-
}
-
-
void ReadUTF16LE(FILE* fp)
-
{
-
}
-
-
int main()
-
{
-
FILE* fp = fopen("test_utf8.txt", "r");
-
if( fp != NULL)
-
{
-
// see http://en.wikipedia.org/wiki/Byte-order_mark for explaination of the BOM
-
// encoding
-
unsigned char b[3] = {0};
-
fread(b,1,2, fp);
-
if( b[0] == 0xEF && b[1] == 0xBB)
-
{
-
fread(b,1,1,fp); // 0xBF
-
ReadUTF8(fp);
-
}
-
else if( b[0] == 0xFE && b[1] == 0xFF)
-
{
-
ReadUTF16BE(fp);
-
}
-
else if( b[0] == 0 && b[1] == 0)
-
{
-
fread(b,1,2,fp);
-
if( b[0] == 0xFE && b[1] == 0xFF)
-
ReadUTF16LE(fp);
-
}
-
else
-
{
-
// we don't know what kind of file it is, so assume its standard
-
// ascii with no BOM encoding
-
rewind(fp);
-
ReadUTF8(fp);
-
}
-
}
-
-
fclose(fp);
-
}
I would not recommend anyone to use this code:
The ISO/ANSI C standard contains, in an amendment which was added in 1995, a "wide character" type `wchar_t', a set of functions like those found in <string.h> and <ctype.h> (declared in <wchar.h> and <wctype.h>, respectively), and a set of conversion functions between `char *' and `wchar_t *' (declared in <stdlib.h>).
READ THIS!!: http://www.faqs.org/docs/Linux-HOWTO...OWTO.html#toc6 Post your reply Sign in to post your reply or Sign up for a free account.
Similar topics
reply
views
Thread by php_xml |
last post: by
|
4 posts
views
Thread by Achim Domma |
last post: by
|
4 posts
views
Thread by 99miles |
last post: by
|
reply
views
Thread by sangui |
last post: by
|
5 posts
views
Thread by davihigh |
last post: by
|
1 post
views
Thread by Anderson |
last post: by
|
6 posts
views
Thread by ericunfuk |
last post: by
|
6 posts
views
Thread by zl2k |
last post: by
|
3 posts
views
Thread by Jim Cousins |
last post: by
| | | | | | | | | | |