"Eric Lilja" <ericliljaNoSpam@yahoo.com> wrote in message
news:cvff9h$hii$1@news.island.liu.se...[color=blue]
>
> "Chris Croughton" wrote:[color=green]
>> On Tue, 22 Feb 2005 01:24:58 +0100, Eric Lilja
>> <ericliljaNoSpam@yahoo.com> wrote:
>>[color=darkred]
>>> Thanks for your reply, Jerry. The file starts with 0xFF 0xFE, so that
>>> means
>>> utf-16? I was thinking of opening it in binary mode, read the first two
>>> bytes then start a loop that reads from the file byte by byte and adds
>>> the
>>> first, the third, the fifth byte etc to a std::string (or a std::vector
>>> of
>>> chars maybe). When the loop is done I should have the actual text of the
>>> file. Then I can look for the pattern I want and replace it as needed.
>>> Then
>>> I will open the file for writing (still in binary of course) and write
>>> out
>>> as utf-16. Sounds like this should work?[/color]
>>
>> It's more likely to be UCS-2 (UTF-16 is an extension to UCS-2 which
>> allows UCS-4 characters to be embedded in a UCS-2 stream). The Byte
>> Order Mark is defined to be 0xFEFF, with the character 0xFFFE defined as
>> invalid, so that the byte order (big/little endian) can be determined.
>> In your case the order must be LSB MSB, so you want all even numbered
>> bytes (assuming standard C array indices starting at zero), but you
>> ought to check for a portable implementation.
>>
>> You really should check that the other bytes are zero, as well, and give
>> some sort of error if not (it's a character not representable in a
>> normal string, unless you're on an implementation with 16 bit or more
>> bytes); at minimum I would either ignore such a character or convert it
>> to an error character ('?' for instance, like my mailer does).
>>
>> Or you can do all of your work in UCS-2 (or UCS-4), and thus preserve
>> any non-ASCII characters. This will be a bit slower as an
>> implementation, but on modern machines still faster than the I/O.
>>
>> If you really want portability, look at interpreting UCS-32, UTF-8 and
>> UTF-16 as well as UCS-2 (and plain old text), with both big- and
>> little-endian representations, and write a generic routine which
>> converts any of them to a string (note that a C++ string type can take
>> wide characters or longs as its element type). But for your case you
>> may only need to do one or two of the formats.
>>
>> For further reading, see:
>>
>>
http://www.unicode.org/faq/
>>
>> (and its parent if you want to get into the spec.). Warning: if you're
>> like me, you can waste (er, spend) many happy hours reading the spec.
>> and forget to do the work <g>...
>>
>> Chris C[/color]
>
> Thanks for your replies everyone. I wrote the following little test
> program that I hope to get working for ucs-2 encoded files where all
> characters are representable using ascii (i.e, the second byte after the
> byte-order mark is \0 for all chars in the file). The program doesn't work
> as expected, however, because if you look at the function read_file it
> will read the byte order mark into the contents variable so when I write
> the new file (where I have replaced some strings), I get the byte-order
> mark twice although the second one has padding. If you look at the file in
> a hex editor you see: FF FE FF 00 FE 00. I can easily work around it by I
> want to know why read_file() is doing what it's doing.
>
> Here's the complete code:
> #include <cstdlib>
> #include <fstream>
> #include <iostream>
> #include <string>
>
> using std::cerr;
> using std::cout;
> using std::endl;
> using std::exit;
> using std::ifstream;
> using std::ios_base;
> using std::ofstream;
> using std::string;
>
> static string read_file(const char *);
> static void find_and_replace(string& s, const string&, const string&);
> static void write_file(const char *, const string&);
>
> static const char padding = '\0';
>
> int
> main()
> {
> const string find_what = "foobar";
> const string replace_with = "abcdef";
>
> string contents = read_file("testfile.txt");
>
> find_and_replace(contents, find_what, replace_with);
>
> write_file("outfile.txt", contents);
>
> return EXIT_SUCCESS;
> }
>
> static string
> read_file(const char *filename)
> {
> ifstream file(filename, ios_base::binary);
>
> if(!file)
> {
> cerr << "Error: Failed to open " << filename << endl;
>
> exit(EXIT_FAILURE);
> }
>
> char c = '\0';
> string contents;
>
> file.read(&c, sizeof(c));
> contents += c;
> file.read(&c, sizeof(c));
> contents += c;
>
> if((unsigned char)contents[0] != 0xFF ||
> (unsigned char)contents[1] != 0xFE)
> {
> cerr << "Error: The file doesn't appear to be a unicode-file." <<
> endl;
>
> /* std::ifstreams destructor will close the file. */
> exit(EXIT_FAILURE);
> }
>
> int count = 0;
>
> while(file.read(&c, sizeof(c)))
> {
> if(!(count++ % 2))
> contents.push_back(c);
> else
> if(c != padding) /* padding is a static global that equals \0 */
> {
> cerr << "Error: Found a character that is too "
> << "big to fit into a single byte." << endl;
>
> /* std::ifstreams destructor will close the file. */
> exit(EXIT_FAILURE);
> }
> }
>
> /* std::ifstreams destructor will close the file. */
> return contents;
> }
>
> static void
> find_and_replace(string& s, const string& find_what, const string&
> replace_with)
> {
> string::size_type start = 0;
> string::size_type offset = 0;
> size_t occurencies = 0;
>
> while((start = s.find(find_what, offset)) != string::npos)
> {
> s.replace(start, find_what.length(), replace_with);
>
> /* Very important that we set offset to start + 1 or we will
> go into an infinite loop because we will find the first {
> over and over again. */
> offset = start + 1;
>
> ++occurencies;
> }
>
> cout << "Replaced " << occurencies << " occurencies." << endl;
> }
>
> static void
> write_file(const char *filename, const string& contents)
> {
> ofstream file(filename, ios_base::binary);
>
> const char byte_order_mark[2] = { 0xFF, 0xFE };
>
> file.write(&byte_order_mark[0], sizeof(char));
> file.write(&byte_order_mark[1], sizeof(char));
>
> for(string::size_type i = 0; i < contents.length(); ++i)
> {
> file.write(&contents[i], sizeof(char));
> file.write(&padding, sizeof(char));
> }
> }
>
> Thanks for any replies
>
> / Eric
>[/color]
Lol, nevermind! I saw that I was using the contents variable for reading the
byte-order mark. I thought the reading position was being rewound somehow.
Anyway, if you have any other comments on the code, please share them.
/ Eric