473,785 Members | 2,738 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

utf8 in regexp (perl 5.8.1)

I have a file containing thousands of Spanish words, encoded AFAIK)
in UTF-8. I also have a perl script in UTF-8, which says (hope
pasting works):

#!/usr/bin/perl -w -CSD
#
# NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)
use warnings;
use strict;
use utf8;

while (<>)
{
print if ( /ñ/ )
}

What is in the regexp is supposed to be "small n with tilde"
and I verified with od -xc that it is hex C3 B1 as is every
place in the file where that letter appears.

The script is intended to find all words containing that
letter. But it finds nothing. After wading through gallons
of text (man encoding, man utf8, man perlunicode, etc.),
I still had no reason to think it was wrong. But I added

use encoding "utf8";

and ran it again, getting only:

Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xde) at
/Volumes/Parents/wgroleau/bin/char-find line 12.

?!? According to 'od -xc' the script does NOT contain any
byte that is 0xde In fact, the ONLY bytes in the script that
are not ASCII are the bytes for the "enye" which are on line
twelve, but neither of them is a DE and NO bytes are 00.

Have I found a bug in perl or is my ignorance just getting
the best of me?

Oh, yeah, I also tried a few things with 'binmode' that didn't
work either.

WWG
Jul 19 '05 #1
1 7871
Wes Groleau wrote:
[problems with]
use utf8;

while (<>)
{
print if ( /ñ/ )
}
I removed "use utf8" and it worked. So I think it's
a bug, especially since
use encoding "utf8";
caused
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xde) at
/Volumes/Parents/wgroleau/bin/char-find line 12.

?!? According to 'od -xc' the script does NOT contain any
byte that is 0xde In fact, the ONLY bytes in the script that
are not ASCII are the bytes for the "enye" which are on line
twelve, but neither of them is a DE and NO bytes are 00.


--
Wes Groleau

A pessimist says the glass is half empty.

An optimist says the glass is half full.

An engineer says somebody made the glass
twice as big as it needed to be.
Jul 19 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
18332
by: Spamtrap | last post by:
I only work in Perl occasionaly, and have been searching for a solution for a conversion, and everything I found seems much too complex. All I need to do is take a simple text file and copy it, however some specific lines are in fact in UTF8 as printed garbagy characters and they need to be converted to Unicode, so that the new text file can be imported into a desktop program and into some Word documents. For the moment I would be...
1
27726
by: Vlajko Knezic | last post by:
Not so sure what is going on here but is something to do with the way UTF8 is handled in Perl and/or LibXML The sctript below: - accepts a value from a form text field; - builds XML document around it,
1
3975
by: ryang | last post by:
I am trying to understand how to work with Unicode in Perl. I have read the relevant man pages (perluniintro, perlunicode, etc.) and have written severl scripts to test/verifiy my understanding. However, I created a script that has unexpected output. The script is below and it contains some UTF-8 encoded characters which represent all five Spanish accented vowels plus the enye (n with a tilde over it) in upper and lower case. I hope...
6
10543
by: Oliver Spiesshofer | last post by:
Hi, I would like to replace all strings in a table with regexp: the strings contain the substring "-na", and I would like to replace the whole table field with the original content but without the substring "- na". I know already how to find the lines that contain the "-na" with LIKE and/or REGEXP, but i dont know how to replace it.
5
4157
by: Damien | last post by:
Hi to all, I would like to get field values from a multi-line text. For instance : " Name: Mr Smith Address: Paris " I've tried to use (I'm a total regexp newbie, so I took it from the doc):
8
2034
by: Dmitry Korolyov | last post by:
ASP.NET app using c# and framework version 1.1.4322.573 on a IIS 6.0 web server. A single-line asp:textbox control and regexp validator attached to it. ^\d+$ expression does match an empty string (when you don't enter any values) - this is wrong d+ expression does not match, for example "g24" string - this is also wrong www.regexplib.com test validator works fine for both cases, i.e. it is reporting "not match" for the...
7
1910
by: arno | last post by:
Hi, I want to search a substring within a string : fonction (str, substr) { if (str.search(substr) != -1) { // do something } }
3
1623
by: jgarrard | last post by:
Hi, I have an array of strings which are regular expressions in the PERL syntax (ie / / delimeters). I wish to create a RegExp in order to do some useful work, but am stuck for a way of getting these strings into the RegExp object. The RegExp constructor seems to want two parameters - the non / delimited expression, and the global modifiers.
6
2986
by: Pat | last post by:
I have a regexp in Perl that converts the last digit of an ip address to '9'. This is a very particular case so I don't want to go off on a tangent of IP octets. ( my $s = $str ) =~ s/((\d+\.){3})\d+/${1}9/ ; While I can do this in Python which accomplishes the same thing: ip = ip ip =+ '9'
0
9480
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10319
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10147
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10087
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
6737
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5380
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5511
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4046
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2877
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.