By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,402 Members | 1,236 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,402 IT Pros & Developers. It's quick & easy.

Erroneous Text Extraction using HTML::Parser

P: n/a
Hello,
I am using HTML::Parser to extract text from html pages from
http://bbc.co.uk/urdu/

However the encoding of the input text seems to change to some
unknown encoding in the output.

The program is given below. The HTML is in a string to keep the
example simple. The same problem appears with HTML in a file.

################################################## ###############
use HTML::Parser;

# set standard output to utf8
binmode(STDOUT, ":utf8");

# Create parser object
my $p = HTML::Parser->new( api_version => 3, text_h => [\&text,
"text"] );

# parse UTF-8 encoded arabic text
$p->parse( "<html> <body>
پاکستان </body> </html>");

sub text
{
my ($txt) = @_;
print $txt;
}
################################################## ###############

Also, I am unable to pin point the problem by looking at the
parser source code because HTML/Parser.pm doesn't seem to contain any
code that does the real parsing work.

Thank You
Himanshu.
Jul 19 '05 #1
Share this question for a faster answer!
Share on Google+

This discussion thread is closed

Replies have been disabled for this discussion.