By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,929 Members | 1,575 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,929 IT Pros & Developers. It's quick & easy.

Unicode pattern matching

P: 5
I have just delved into the world of Unicode versus Latin1. I've written a quick program (ignore some of the messy code, it's been a work in progress) that traps a string of text that contains Unicode or more aptly said, traps only those lines that Perl cannot translate to Latin1. My goal was to trap and then substitute the Unicode character(s) within that string with something to my liking from the Latin1 character set. For example, Perl will translate the right accent in Unicode, says there is no direct translation, however, I want to translate that to the Latin1 character 39 or the apostrophe. The subroutine (created for ease of reading) is where that action is taking place. What I want to do is just substitute the unicode, using hex, which Perl gives me in it's error message, to another character however, Perl is not finding the hex value within that string, even though it is itself reporting the information. What happens is that there is no substitution. If I use the remmed out line perl will put question marks in for those characters that it does not recognize as having a literal translation to Latin1. Thanks for your time.

Expand|Select|Wrap|Line Numbers
  1. #use Encode;
  2. use Encode 'from_to';
  3. #use Encode qw (:fallbacks);
  4.  
  5. $infile = "\\hayter\\raw\\2007-00a\\Mod_extract.xml";
  6. $outfile = "rich.txt";
  7.  
  8. print "Open IN ... ";
  9. open (IN, "< $infile") || die "Could not open $infile: $!\n";
  10. print "Done\n";
  11. print "Open OUT ... ";
  12. open (OUT, "> $outfile") || die "Could not open $outfile: $!\n";
  13. print "Done\n";
  14. print "Processing file ";
  15. $data = <IN>;
  16. @datalines = split (/\>/, $data);
  17. close (IN);
  18. open (XML, "> test.xml");
  19. foreach $dataline (@datalines) {
  20.     $line++;
  21.     print XML "$dataline\>\n";
  22.     eval {
  23.         from_to ($dataline, "utf8", "iso-8859-1", 1);
  24.     };
  25.     if ($@) {
  26.         $myerror = $@;
  27.         $myerror =~ s/^.+\{(.+)\}.*$/$1/;
  28.         $myerror =~ s/\s*$//;
  29.         $myhex = $myerror;
  30.         $myerror = "\\x\{" . hex($myerror) . "\}";
  31.         $errors{$myerror}++;
  32.         $errhex{$myerror} = $myhex;
  33.         $errline{$myerror} = $line unless $errline{$myerror};
  34.         &unicode_latin1_lax ($dataline); # Translate to something I want
  35.     }
  36.     print OUT "$dataline\>\n";
  37. }
  38. close (OUT);
  39. close (XML);
  40. print " Done\n\n";
  41.  
  42. if (%errors) {
  43.     print "Printing errors ... ";
  44.     open (ERR, "> unicode.err") || die "Could not open unicode.err: $!\n";
  45.     foreach $error (sort keys %errors) {
  46.         $decerr = $error;
  47.         $decerr =~ s/\\//;
  48.         $decerr =~ s/x//;
  49.         $decerr =~ s/\{//;
  50.         $decerr =~ s/\}//;
  51.         print ERR "ERROR \($errors{$error}\): \"\\x\{$errhex{$error}\}\" does not map to iso-8859-1, DEC\. $decerr - EX\. line $errline{$error}\n";
  52.     }
  53.     close (ERR);
  54.     print "Done\n\n";
  55. }
  56.  
  57. sub unicode_latin1_lax {
  58.     my ($dataline) = @_;
  59.     $dataline =~ s/\x{2018}/\~/g; # Should translate unicode to what I want
  60.     # from_to ($dataline, "utf8", "iso-8859-1", 0);
  61. }
  62.  
Jun 1 '07 #1
Share this Question
Share on Google+
8 Replies


P: 5
Anyone? Can you refer me somewhere that might could help?
Jun 2 '07 #2

KevinADC
Expert 2.5K+
P: 4,059
change this:

Expand|Select|Wrap|Line Numbers
  1. sub unicode_latin1_lax {
  2.     my ($dataline) = @_;
  3.     $dataline =~ s/\x{2018}/\~/g; # Should translate unicode to what I want
  4.     # from_to ($dataline, "utf8", "iso-8859-1", 0);
  5. }
to:

Expand|Select|Wrap|Line Numbers
  1. sub unicode_latin1_lax {
  2.     my ($dataline) = @_;
  3.     $dataline =~ s/\x{2018}/\~/g; # Should translate unicode to what I want
  4.     # from_to ($dataline, "utf8", "iso-8859-1", 0);
  5.     return($dataline);
  6. }
and retry you script. By declaring $dataline with "my" in the sub routine it's not visible outside the sub routine.
Jun 2 '07 #3

P: 5
Thanks Kevin, ameture mistake and I've made the correction, however, in the long run, that's not the problem. I've fixed that and re-ran and I'm still not changing the unicode characters to something else. Any other thoughts?
Jun 4 '07 #4

KevinADC
Expert 2.5K+
P: 4,059
what is this regexp supposed to do?

$dataline =~ s/\x{2018}/\~/g;
Jun 4 '07 #5

P: 5
It's suppose to pattern match and find the hex value in the string. From the Oreilly book...

*****************************
3rd Edition, Programming Perl, Page 164
*****************************
\x{LONGHEX}
\xHEX
A character number specified as one or two hex digits ([0-9a-fA-F]), as in \x1B. The one-digit form is usable only if the character following it is not a hex digit. If braces are used, you may use as many digits as you'd like, which may result in a Unicode character. For example, \x{262f} matches a Unicode YIN YANG.
*****************************

The information within the braces is the hex data from Perl itself. I told Perl to translate the Unicode to Latin1 however, there are some characters that do not translate according to Perl and when this happens Perl produces an error with the hex value of the Unicode that will not translate. I wanted to take that information and do a substitution of my own and the statement your asking about was created. So I'm trying to take that Unicode hex value with resides in a text file and just convert it to a Latin1 character of my choice. My understanding from the above info from the Oreilly book as that I could use that to find and substitute that particular Unicode character.

Here's the information Perl is giving me directly when attempting to translate one of the lines from the text file....

"\x{2018}" does not map to iso-8859-1 at C:/Perl_58/site/lib/Encode.pm line 183.

And here is my edit of the output...
ERROR (578): "\x{2018}" does not map to iso-8859-1, DEC. 8216 - EX. line 123

What this is 578 is how many there are in the text file, \x{2018} if the value Perl gives, 8216 is the decimal translation of the hex and 123 is a line within the text that has an example.
Jun 5 '07 #6

KevinADC
Expert 2.5K+
P: 4,059
sorry mate, I don't know.
Jun 5 '07 #7

P: 5
Anyone know another forum that might help? I'm having little success in getting help with this issue. Anything you can do is appreciated.
Jun 20 '07 #8

KevinADC
Expert 2.5K+
P: 4,059
try perlmonks:

www.perlmonks.com
Jun 20 '07 #9

Post your reply

Sign in to post your reply or Sign up for a free account.