473,322 Members | 1,610 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Languid - Language Guesser

2
Hello,
I am trying to run Languid language guesser - see http://languid.cantbedone.org/

I am not a perl programmer, and I have no idea how to get this script running. I am running perl scripts on my computer but Im not sure how to run this one. Part of the main script is called Guess.pm and its code is shown below -

Can someone please explain how I run this script.
Thanks, Bob


package Language::Guess;

use strict;
use warnings;
require 5.008;
use Data::Dumper qw/Dumper/;
use Unicode::UCD 'charinfo';
use utf8;
use Encode qw/is_utf8 _utf8_on/;
use Unicode::Normalize; # don’t trust your input!
use File::Spec::Functions;
use Carp;

our $VERSION = '0.03';
our $MIN_LENGTH = 20;

=head1 NAME

Language::Guess

=head1 ABSTRACT

A statistical language guesser

=head1 SYNOPSIS

use Language::Guess;

my $guesser = Language::Guess->new( modeldir => '~/train' );

while (my $line = <> ) {
my $lang = $guesser->simple_guess($line);
print "Language was $lang\n\n";
}

=cut

#sub init {

our $MAX = 300;

our @BASIC_LATIN = qw/English cebuano hausa somali pig_latin klingon indonesian
hawaiian welsh latin swahili basque/;
our @EXOTIC_LATIN = qw/Czech Polish Croatian Romanian Slovak Slovene Turkish Hungarian
Azeri Lithuanian Estonian/;
our @ACCENTED_LATIN = (qw/Albanian catalan Spanish French German Dutch Italian Danish
Icelandic Norwegian Swedish Finnish Latvian Portuguese
/, @EXOTIC_LATIN);

our @ALL_LATIN = ( @BASIC_LATIN, @EXOTIC_LATIN, @ACCENTED_LATIN );

our @CYRILLIC = qw/Russian Ukrainian Belarussian Kazakh Uzbek Mongolian
Serbian Macedonian Bulgarian Kyrgyz/;
our @ARABIC = qw/Arabic Farsi Jawi Kurdish Pashto Sindhi Urdu/;
our @DEVANAGARI = qw/Bhojpuri Bihari Hindi Kashmiri Konkani Marathi Nepali
Sanskrit/;

our @SINGLETONS = qw/Armenian Hebrew Bengali Gurumkhi Greek Gujarati Oriya
Tamil Telugu Kannada Malayalam Sinhala Thai Lao Tibetan
Burmese Georgian Mongolian/;

#}

binmode STDOUT, 'utf8';
binmode STDERR, 'utf8';

sub new {
my ( $class, %params ) = @_;
croak "Must provide a model directory" unless exists $params{modeldir};
croak "Model directory does not exist" unless -d $params{modeldir};

my $self = bless { %params }, $class;
return $self;
}


sub guess {
my ( $self,$string ) = @_;
#warn $string;
_utf8_on($string);
$self->load_models() unless defined $self->{models};
my @runs = find_runs( $string );
#warn "Found ", scalar @runs, " runs\n";
#warn $runs[0][1];
my @langs;
my %scripts;
foreach my $run ( @runs ) {
$scripts{$run->[1]}++;
}

# returns arrayref of hashes in the form
# [ { name => NAME, score => SCORE }]

return $self->identify( $string, %scripts );

}

sub simple_guess {
my ( $self, $string ) = @_;
my $got = $self->guess($string);
#warn Dumper($got);
return $got->[0]{name};
}


sub load_models {
my ( $self ) = @_;

opendir my $dh, $self->{modeldir} or die "Unable to open dir:$!";
my %models;
while ( my $f = readdir $dh ) {
next unless $f =~ /\.train$/;
my ( $name ) = $f =~ m|(.*)\.|;
my $path = catfile( $self->{modeldir}, $f );
open my $fh, "<:utf8", $path or die "Failed to open file: $!";
my %model;
while ( my $line = <$fh> ) {
chomp $line;
my ( $k, $v) = $line =~ m|(.{3})\s+(.*)|;
next unless defined $k;
#warn "'$k' $v\n";
$model{$k} = $v;
}
$models{$name} = \%model;
}
$self->{models} = \%models;
}

=item find_runs STRING

This is unused for the moment; the subroutine finds runs of scripts in a string
and returns an array of them. Upgrades basic latin pieces to accented and
exotic latin if characters from those script blocks are found. This avoids
languages like Polish from being split into a thousand runs of two and three
basic latin characters, interspersed with accented.

=cut

sub find_runs {
my ( $raw ) = @_;

my @chars = split m//, $raw;

my $prev = '';
my @c;
my @runs;
my @run_types;
my $current_run = -1;

foreach my $c ( @chars ) {
my $is_alph = $c =~ /[[:alpha:]]/o;
my $inf = get_charinfo( $c );
if ( $is_alph and !( $inf->{block} eq $prev) ) {
$prev = $inf->{block};
@c = ();
$current_run++;
$run_types[$current_run] = $prev;
}
push @c, $c;
push @{ $runs[$current_run] }, $c if $current_run > -1;
}

my ( $newruns, $newtypes ) = reconcile_latin( \@runs, \@run_types );


my $counter =0;
my @result;
foreach my $r ( @$newruns ) {
push @result, [ $r, $newtypes->[$counter]];
$counter++;
}
return @result;
}

{ my %cache;
sub get_charinfo {
my ( $char ) = @_;
return $cache{$char} if exists $cache{$char};
my $inf = charinfo( ord( $char ));
$cache{$char} = $inf;
}
}


=item reconcile_latin STRING, ARREF

internal method, attempts to pick which level of weird diacriticalness
a latin string has. Consolidates runs into one string.

=cut

sub reconcile_latin {
my ( $runs, $types ) = @_;
my @types = @$types;
my (@new_runs, @new_types);
my $last_type = '';

my $upgrade;
$upgrade = 'Accented Latin' if has_supplemental_latin( @$types );
$upgrade = 'Exotic Latin' if has_extended_latin( @$types );
$upgrade = 'Superfreak Latin' if has_latin_extended_additional( @$types );

return ( $runs, $types ) unless $upgrade;
my $run_count = -1;
foreach my $r ( @$runs ) {
my $type = shift @types;
$type = $upgrade if $type =~ /Latin/;
$run_count++ unless $type eq $last_type;

push @{$new_runs[$run_count]}, @$r;
$new_types[$run_count] = $type;
$last_type = $type;
}
return ( \@new_runs, \@new_types );
}



sub has_extended_latin {
my ( @types ) = @_;
return scalar grep { /Latin Extended-A/ } @types;
}

sub has_supplemental_latin {
my ( @types ) = @_;
return scalar grep { /Latin-1 Supplement/ } @types;
}

sub has_latin_extended_additional {
my ( @types ) = @_;
return scalar grep { /Latin Extended Additional/ } @types;
}


sub identify {
my ( $self, $sample, %scripts ) = @_;
#warn "Incoming scripts are ", join ", ", keys %scripts;

return [{ name => 'too short', score => 1 }] if length($sample) < 3;
return [{ name => "Swedish Chef", score => 1}] if $sample =~ /bork bork bork/i;
return [{ name => "Pacman", score => 1}] if $sample =~ /waka waka waka/i;



# Check for Korean
if ( exists $scripts{'Hangul Syllables'} or
exists $scripts{'Hangul Jamo'} or
exists $scripts{'Hangul Compatibility Jamo'} or
exists $scripts{'Hangul'}) {
return [{ name =>'korean', score => 1 }];
}
if ( exists $scripts{'Greek and Coptic'} ){

return [{ name =>'greek', score => 1 }];
}

if ( exists $scripts{'Katakana'} or
exists $scripts{'Hiragana'} or
exists $scripts{'Katakana Phonetic Extensions'}) {
return [{ name =>'japanese', score => 1 }];
}


if ( exists $scripts{'CJK Unified Ideographs'} or
exists $scripts{'Bopomofo'} or
exists $scripts{'Bopomofo Extended'} or
exists $scripts{'KangXi Radicals'} or
exists $scripts{'Arabic Presentation Forms-A'} ) {
return [{ name => 'chinese', score => 1 }];
}

if ( exists $scripts{'Cyrillic'} ) {
return $self->check( $sample, @CYRILLIC );
}


if ( exists $scripts{'Arabic'} or
exists $scripts{'Arabic Presentation Forms-A'} or
exists $scripts{'Arabic Presentation Forms-B'}
){
return $self->check( $sample, @ARABIC );
}

if ( exists $scripts{'Devanagari'} ) {
return $self->check( $sample, @DEVANAGARI );
}


# Try languages with unique scripts
foreach my $s ( @SINGLETONS ) {
return [{ name => lc($s), score => 1 }] if exists $scripts{$s};
}

if ( exists $scripts{'Superfreak Latin'} ) {
return [{ name => 'vietnamese', score => 1 }];
}

if ( exists $scripts{'Exotic Latin'} ) {
return $self->check( $sample, @EXOTIC_LATIN );
}

if ( exists $scripts{'Accented Latin'} ) {
return $self->check( $sample, @ACCENTED_LATIN );
}


if ( exists $scripts{'Basic Latin'} ) {
return $self->check( $sample, @ALL_LATIN );
}

return [{ name => "unknown script: '".(join ", ", keys %scripts)."'", score => 1}];

}


sub check {
my ( $self, $raw, @langs ) = @_;
#return join ' ', @langs
#warn "Checking sample $sample", "\n";
#my $num_tri = length( $sample ) / 3;

my $sample = __normalize($raw);
return { name => 'too short', score => 1 } if length($sample) < $MIN_LENGTH;
my $mod = __make_model( $sample );
my $num_tri = scalar keys %{$mod};
my %scores;
foreach my $key ( @langs ) {
my $l = lc( $key );
#warn "Checking $l\n";
next unless exists $self->{models}{$l};
my $score = __distance( $mod, $self->{models}{$l} );
$scores{$l} = $score;
}
my @sorted = sort { $scores{$a} <=> $scores{$b} } keys %scores;
my @out;
$num_tri ||=1;
foreach my $s ( @sorted ) {
my $norm = $scores{$s}/$num_tri;
push @out, { name => $s , score => int($norm) };
}
return [splice ( @out, 0, 4 )];

if ( @sorted ) {
return splice ( @sorted, 0, 4 );
my @all;
my $firstscore = $scores{$sorted[0]};
while ( my $next = shift @sorted ) {
last unless $scores{$next} == $firstscore;
push @all, $next;
}
return join ',', @all;
}
return { name => 'unknown'. ( join ' ', @langs), score =>1 };
}


sub __distance {
my ( $m1, $m2 ) = @_;
my $dist =0;
foreach my $k ( keys %{$m1} ) {
$dist +=
( exists $m2->{$k} ?
abs( $m2->{$k} - $m1->{$k} ) :
$MAX
);
}
return $dist;
}
Sep 13 '06 #1
0 1980

Sign in to post your reply or Sign up for a free account.

Similar topics

0
by: Thiva Charanasri | last post by:
http://www.poweroflanguage.org Track: Computer Language 1st World Congress on the Power of Language: Theory, Practice and Performance Date: March 6 - 10, 2006 Bangkok, Thailand On this...
22
by: Michael Nahas | last post by:
Antti & all interested, The draft description of my language to replace C is available at: http://nahas.is-a-geek.com/~mike/MyC.pdf I am a long time C programmer (I read the old testament...
134
by: evolnet.regular | last post by:
I've been utilising C for lots of small and a few medium-sized personal projects over the course of the past decade, and I've realised lately just how little progress it's made since then. I've...
0
by: Thiva Charanasri | last post by:
http://www.poweroflanguage.org Track: Computer Language 1st World Congress on the Power of Language: Theory, Practice and Performance Date: March 6 - 10, 2006 Bangkok, Thailand On this...
2
by: rjack | last post by:
I'm using VS 2005 Beta 2. In VS 2003, the Page directive in an aspx page has Language and CodeBehind attributes. You can have the language be different than the code behind file language. For...
35
by: John Coleman | last post by:
Greetings, I have a rough classification of languages into 2 classes: Zen languages and tool languages. A tool language is a language that is, well, a *tool* for programming a computer. C is the...
23
by: Xah Lee | last post by:
The Concepts and Confusions of Pre-fix, In-fix, Post-fix and Fully Functional Notations Xah Lee, 2006-03-15 Let me summarize: The LISP notation, is a functional notation, and is not a...
669
by: Xah Lee | last post by:
in March, i posted a essay “What is Expressiveness in a Computer Language”, archived at: http://xahlee.org/perl-python/what_is_expresiveness.html I was informed then that there is a academic...
10
by: Immortalist | last post by:
Various aquisition devices that guide learning along particular pathways towards human biases. And as E.O. Wilson might say mental development appears to be genetically constrained. (1) Language...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shćllîpôpď 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.