473,386 Members | 2,050 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

utf8 in regexp (perl 5.8.1)

I have a file containing thousands of Spanish words, encoded AFAIK)
in UTF-8. I also have a perl script in UTF-8, which says (hope
pasting works):

#!/usr/bin/perl -w -CSD
#
# NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)
use warnings;
use strict;
use utf8;

while (<>)
{
print if ( /ñ/ )
}

What is in the regexp is supposed to be "small n with tilde"
and I verified with od -xc that it is hex C3 B1 as is every
place in the file where that letter appears.

The script is intended to find all words containing that
letter. But it finds nothing. After wading through gallons
of text (man encoding, man utf8, man perlunicode, etc.),
I still had no reason to think it was wrong. But I added

use encoding "utf8";

and ran it again, getting only:

Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xde) at
/Volumes/Parents/wgroleau/bin/char-find line 12.

?!? According to 'od -xc' the script does NOT contain any
byte that is 0xde In fact, the ONLY bytes in the script that
are not ASCII are the bytes for the "enye" which are on line
twelve, but neither of them is a DE and NO bytes are 00.

Have I found a bug in perl or is my ignorance just getting
the best of me?

Oh, yeah, I also tried a few things with 'binmode' that didn't
work either.

WWG
Jul 19 '05 #1
1 7848
Wes Groleau wrote:
[problems with]
use utf8;

while (<>)
{
print if ( /ñ/ )
}
I removed "use utf8" and it worked. So I think it's
a bug, especially since
use encoding "utf8";
caused
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xde) at
/Volumes/Parents/wgroleau/bin/char-find line 12.

?!? According to 'od -xc' the script does NOT contain any
byte that is 0xde In fact, the ONLY bytes in the script that
are not ASCII are the bytes for the "enye" which are on line
twelve, but neither of them is a DE and NO bytes are 00.


--
Wes Groleau

A pessimist says the glass is half empty.

An optimist says the glass is half full.

An engineer says somebody made the glass
twice as big as it needed to be.
Jul 19 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Spamtrap | last post by:
I only work in Perl occasionaly, and have been searching for a solution for a conversion, and everything I found seems much too complex. All I need to do is take a simple text file and copy...
1
by: Vlajko Knezic | last post by:
Not so sure what is going on here but is something to do with the way UTF8 is handled in Perl and/or LibXML The sctript below: - accepts a value from a form text field; - ...
1
by: ryang | last post by:
I am trying to understand how to work with Unicode in Perl. I have read the relevant man pages (perluniintro, perlunicode, etc.) and have written severl scripts to test/verifiy my understanding. ...
6
by: Oliver Spiesshofer | last post by:
Hi, I would like to replace all strings in a table with regexp: the strings contain the substring "-na", and I would like to replace the whole table field with the original content but without...
5
by: Damien | last post by:
Hi to all, I would like to get field values from a multi-line text. For instance : " Name: Mr Smith Address: Paris " I've tried to use (I'm a total regexp newbie, so I took it from the...
8
by: Dmitry Korolyov | last post by:
ASP.NET app using c# and framework version 1.1.4322.573 on a IIS 6.0 web server. A single-line asp:textbox control and regexp validator attached to it. ^\d+$ expression does match an empty...
7
by: arno | last post by:
Hi, I want to search a substring within a string : fonction (str, substr) { if (str.search(substr) != -1) { // do something } }
3
by: jgarrard | last post by:
Hi, I have an array of strings which are regular expressions in the PERL syntax (ie / / delimeters). I wish to create a RegExp in order to do some useful work, but am stuck for a way of...
6
by: Pat | last post by:
I have a regexp in Perl that converts the last digit of an ip address to '9'. This is a very particular case so I don't want to go off on a tangent of IP octets. ( my $s = $str ) =~...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.