472,143 Members | 1,466 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,143 software developers and data experts.

How to read unicode (utf-8) / binary file line by line

Hi programmers,

I want read line by line a Unicode (UTF-8) text file created by Notepad, i don't want display the Unicode string in the screen, i want just read and compare the strings!.

This code read ANSI file line by line, and compare the strings

What i want
  • Read test_ansi.txt line by line
  • if the line = "b" print "YES!"
  • else print "NO!"

read_ansi_line_by_line.c

Expand|Select|Wrap|Line Numbers
  1. #include <stdio.h>
  2.  
  3. int main()
  4. {
  5.     char *inname = "test_ansi.txt";
  6.     FILE *infile;
  7.     char line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
  8.     char line_number;
  9.  
  10.     infile = fopen(inname, "r");
  11.     if (!infile) {
  12.         printf("\nfile '%s' not found\n", inname);
  13.         return 0;
  14.     }
  15.     printf("\n%s\n\n", inname);
  16.  
  17.     line_number = 0;
  18.     while (fgets(line_buffer, sizeof(line_buffer), infile)) {
  19.         ++line_number;
  20.         /* note that the newline is in the buffer */
  21.         if (strcmp("b\n", line_buffer) == 0 ){
  22.             printf("%d: YES!\n", line_number);
  23.         }else{
  24.             printf("%d: NO!\n", line_number,line_buffer);
  25.         }
  26.     }
  27.     printf("\n\nTotal: %d\n", line_number);
  28.     return 0;
  29. }
test_ansi.txt

Expand|Select|Wrap|Line Numbers
  1. a
  2. b
  3. c
Compiling

Expand|Select|Wrap|Line Numbers
  1. gcc -o read_ansi_line_by_line read_ansi_line_by_line.c
Output

Expand|Select|Wrap|Line Numbers
  1. test_ansi.txt
  2.  
  3. 1: NO!
  4. 2: YES!
  5. 3: NO!
  6.  
  7.  
  8. Total: 3
Now i need read Unicode (UTF-8) file created by Notepad, after more than 6 months i don't found any good code/library in C can read file coded in UTF-8!, i don't know exactly why but i think the standard C don't support Unicode!

Reading Unicode binary file its OK!, but the probleme is the binary file most be already created in binary mode!, that mean if we want read a Unicode (UTF-8) file created by Notepad we need to translate it from UTF-8 file to BINARY file!

This code write Unicode string to a binary file, NOTE the C file is coded in UTF-8 and compiled by GCC

What i want
  • Write the Unicode char "ب" to test_bin.dat

create_bin.c

Expand|Select|Wrap|Line Numbers
  1. #define UNICODE
  2. #ifdef UNICODE
  3. #define _UNICODE
  4. #else
  5. #define _MBCS
  6. #endif
  7.  
  8. #include <stdio.h>
  9. #include <wchar.h>
  10.  
  11. int main()
  12. {
  13.      /*Data to be stored in file*/
  14.      wchar_t line_buffer[BUFSIZ]=L"ب";
  15.      /*Opening file for writing in binary mode*/
  16.      FILE *infile=fopen("test_bin.dat","wb");
  17.      /*Writing data to file*/
  18.      fwrite(line_buffer, 1, 13, infile);
  19.      /*Closing File*/
  20.      fclose(infile);
  21.  
  22.     return 0;
  23. }
Compiling

Expand|Select|Wrap|Line Numbers
  1. gcc -o create_bin create_bin.c
Output

Expand|Select|Wrap|Line Numbers
  1. create test_bin.dat


Now i want read the binary file line by line and compare!

What i want
  • Read test_bin.dat line by line
  • if the line = "ب" print "YES!"
  • else print "NO!"

read_bin_line_by_line.c

Expand|Select|Wrap|Line Numbers
  1. #define UNICODE
  2. #ifdef UNICODE
  3. #define _UNICODE
  4. #else
  5. #define _MBCS
  6. #endif
  7.  
  8. #include <stdio.h>
  9. #include <wchar.h>
  10.  
  11. int main()
  12. {
  13.     wchar_t *inname = L"test_bin.dat";
  14.     FILE *infile;
  15.     wchar_t line_buffer[BUFSIZ]; /* BUFSIZ is defined if you include stdio.h */
  16.  
  17.     infile = _wfopen(inname,L"rb");
  18.     if (!infile) {
  19.         wprintf(L"\nfile '%s' not found\n", inname);
  20.         return 0;
  21.     }
  22.     wprintf(L"\n%s\n\n", inname);
  23.  
  24.     /*Reading data from file into temporary buffer*/
  25.     while (fread(line_buffer,1,13,infile)) {
  26.         /* note that the newline is in the buffer */
  27.         if ( wcscmp ( L"ب" , line_buffer ) == 0 ){
  28.              wprintf(L"YES!\n");
  29.         }else{
  30.              wprintf(L"NO!\n", line_buffer);
  31.         }
  32.     }
  33.     /*Closing File*/
  34.     fclose(infile);
  35.     return 0;
  36. }
Compiling

Expand|Select|Wrap|Line Numbers
  1. gcc -o read_bin_line_by_line read_bin_line_by_line.c
Output

Expand|Select|Wrap|Line Numbers
  1. test_bin.dat
  2.  
  3. YES!
THE PROBLEM

This method is VERY LONG! and NOT POWERFUL (i m beginner in software engineering)

Please any one know how to read Unicode file ? (i know its not easy!) Please any one know how to convert Unicode file to Binary file ? (simple method) Please any one know how to read Unicode file in binary mode ? (i m not sure)

Thank You.
Jan 21 '10 #1

✓ answered by RedSon

I would not recommend anyone to use this code:

The ISO/ANSI C standard contains, in an amendment which was added in 1995, a "wide character" type `wchar_t', a set of functions like those found in <string.h> and <ctype.h> (declared in <wchar.h> and <wctype.h>, respectively), and a set of conversion functions between `char *' and `wchar_t *' (declared in <stdlib.h>).

READ THIS!!: http://www.faqs.org/docs/Linux-HOWTO...OWTO.html#toc6

12 13417
johny10151981
1,059 1GB
Hello,
few things.
1. UNICODE and utf-8 is not same(if i am not wrong). UNICODE is 2 byte long. On the other hand UTF-8 is a multybyte encoding system.

2. (dont listen to me). Looking for a easy way. not a good idea :)

Best Regrads,
JOHNY
Jan 22 '10 #2
RedSon
5,000 Expert 4TB
Instead of using fgets and strcmp you are going to want to use the wide character version of those methods.

You will have to read the documentation of the OS/libraries you are using to find out what the wide char variants are.

If you are using Windows a quick search on MSDN should be helpful. Also you can do conversions from one to the other.
Jan 22 '10 #3
@JOHNY

Yes, you are right, i want edit title to remove "unicode" but no permission ^_^
if you have a UTF-8 project, and you want to read UTF-8 file line by line, what is the easy way you use ? =)
Jan 22 '10 #4
@RedSon

6 months of searching in Books, MSDN, Documentations, Internet, Forums.. i never found a solution to read UTF-8 file in C99!, can you help me please ? =)
Jan 22 '10 #5
RedSon
5,000 Expert 4TB
Did you read the Unicode and Character Set functions on MSDN?

http://msdn.microsoft.com/en-us/libr...85(VS.85).aspx
Jan 22 '10 #6
RedSon
5,000 Expert 4TB
Oh wait, if you are using gcc then you are not on a windows machine, so that MSDN link is not going to do you any good. I don't know why you are even searching MSDN like you state in post #5.

That is why I suggested that you search your libraries and other documentation for wide string functions. Your header files that come with C99 should have something for that.
Jan 22 '10 #7
@RedSon

First Thank you, i already read all MSDN pages that talking about UTF-8 ^_^, but i think i need use MultiByteToWideChar() after reading string from UTF-8 file, but i don't know how to use exactly!
Jan 22 '10 #8
@RedSon

Yes, i m looking for a solution in C99 with GCC, i think i need read the UTF-8 file in binary mode and convert UTF-8 to UTF-16 or not! or other way.. i need help seriously =)
Jan 22 '10 #9
RedSon
5,000 Expert 4TB
Like I said, you won't be able to use it, because you are not building a windows application using windows libraries.

You will need to find an appropriate library call in your headers.
Jan 22 '10 #10
donbock
2,425 Expert 2GB
This link is aimed at gcc users: The Unicode HOWTO. Note that you must use glibc-2.2 or later. Consider using libutf8 (version 0.7.3 or later).
Jan 22 '10 #11
I find a solution to my problem, i want share the solution to any one interested by reading UTF-8 file in C99. :)

Expand|Select|Wrap|Line Numbers
  1. void ReadUTF8(FILE* fp)
  2. {
  3.     unsigned char iobuf[255] = {0};
  4.     while( fgets((char*)iobuf, sizeof(iobuf), fp) )
  5.     {
  6.             size_t len = strlen((char *)iobuf);
  7.             if(len > 1 &&  iobuf[len-1] == '\n')
  8.                 iobuf[len-1] = 0;
  9.             len = strlen((char *)iobuf);
  10.             printf("(%d) \"%s\"  ", len, iobuf);
  11.             if( iobuf[0] == '\n' )
  12.                 printf("Yes\n");
  13.             else
  14.                 printf("No\n");
  15.     }
  16. }
  17.  
  18. void ReadUTF16BE(FILE* fp)
  19. {
  20. }
  21.  
  22. void ReadUTF16LE(FILE* fp)
  23. {
  24. }
  25.  
  26. int main()
  27. {
  28.     FILE* fp = fopen("test_utf8.txt", "r");
  29.     if( fp != NULL)
  30.     {
  31.         // see http://en.wikipedia.org/wiki/Byte-order_mark for explaination of the BOM
  32.         // encoding
  33.         unsigned char b[3] = {0};
  34.         fread(b,1,2, fp);
  35.         if( b[0] == 0xEF && b[1] == 0xBB)
  36.         {
  37.             fread(b,1,1,fp); // 0xBF
  38.             ReadUTF8(fp);
  39.         }
  40.         else if( b[0] == 0xFE && b[1] == 0xFF)
  41.         {
  42.             ReadUTF16BE(fp);
  43.         }
  44.         else if( b[0] == 0 && b[1] == 0)
  45.         {
  46.             fread(b,1,2,fp); 
  47.             if( b[0] == 0xFE && b[1] == 0xFF)
  48.                 ReadUTF16LE(fp);
  49.         }
  50.         else
  51.         {
  52.             // we don't know what kind of file it is, so assume its standard
  53.             // ascii with no BOM encoding
  54.             rewind(fp);
  55.             ReadUTF8(fp);
  56.         }
  57.     }        
  58.  
  59.     fclose(fp);
  60. }
Jan 25 '10 #12
RedSon
5,000 Expert 4TB
I would not recommend anyone to use this code:

The ISO/ANSI C standard contains, in an amendment which was added in 1995, a "wide character" type `wchar_t', a set of functions like those found in <string.h> and <ctype.h> (declared in <wchar.h> and <wctype.h>, respectively), and a set of conversion functions between `char *' and `wchar_t *' (declared in <stdlib.h>).

READ THIS!!: http://www.faqs.org/docs/Linux-HOWTO...OWTO.html#toc6
Jan 25 '10 #13

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

reply views Thread by php_xml | last post: by
4 posts views Thread by Achim Domma | last post: by
4 posts views Thread by 99miles | last post: by
5 posts views Thread by davihigh | last post: by
6 posts views Thread by ericunfuk | last post: by
6 posts views Thread by zl2k | last post: by
3 posts views Thread by Jim Cousins | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.