decoding numeric HTML entities

Andreas Gohr

Hi all!

I need a way to decode numeric HTML entities (like Ü) back to
their UTF-8 character to place them into a textarea. I tried the
following but it doesn't work in IE.

data = data.replace(/&#(\d+);/g,
function() {
return String.fromCharCode(RegExp.$1);
});

Has anyone a crossbrowser solution?

Regards

Jul 23 '05 #1

Subscribe Reply

15311

fox

Andreas Gohr wrote:

Hi all!

I need a way to decode numeric HTML entities (like Ü) back to
their UTF-8 character to place them into a textarea. I tried the
following but it doesn't work in IE.

data = data.replace(/&#(\d+);/g,
function() {
return String.fromCharCode(RegExp.$1);
}); try:

function(wholematch, parenmatch1) {
return String.fromCharCode(+parenmatch1);
}

Has anyone a crossbrowser solution?

Regards

Jul 23 '05 #2

Andreas Gohr

It works! You saved my day :-)

Just for my understanding: What does the plus sign do? Is it a typo and
just happens to work or does it do some magic?

Thanks again
Andi

Jul 23 '05 #3

fox

Andreas Gohr wrote:

It works! You saved my day :-)

Just for my understanding: What does the plus sign do? Is it a typo and
just happens to work or does it do some magic?
the parenthetic match from the regex is *usually* interpreted as a
string value. In JavaScript, because data types are "flexible", using
the plus sign (a unary operator) removes any ambiguity as to the type of
the value passed -- if the characters are all digits, then a number will
be interpreted (otherwise, you'll receive a "NaN" result, so, in most
cases, care should be taken to make sure that the string will be all
digits). The behavior of this technique is different than using parseInt
which *will* return a numerical value if a string *starts* with digit
characters:

parseInt("123 Rue Morgue") => 123

[ comparatively:
+("123 Rue Morgue") => NaN
]

Logically, it would seem that using a unary operator (+) is a faster
conversion than parseInt (which examines every character to test whether
a digit or not) -- I've never benchmarked it.

Realize also, that in using the unary operator, it *MUST* appear
immediately adjacent to the value it is being applied to:

var aNumAsString = "23";
var asum = 15 + +aNumAsString;
^no space here

otherwise, it will be interpreted as a concatenation or addition
operator (or a syntax error, as in 1 + + 3)

'+' is one of the most "overloaded" operators in the language.

Thanks again
Andi

Jul 23 '05 #4

Michael Winter

On 11/06/2005 01:14, fox wrote:

[snip]

the parenthetic match from the regex is *usually* interpreted as a
string value.
It should always be a string value as regular expressions only operate
on strings (other types are converted, first).

[snip]
The behavior of this technique is different than using parseInt
which *will* return a numerical value if a string *starts* with digit
characters:

parseInt("123 Rue Morgue") => 123
Though in most cases, the parseInt function should be used with its
radix argument:

parseInt('123 Rue Morgue', 10)

[snip]
Logically, it would seem that using a unary operator (+) is a faster
conversion than parseInt (which examines every character to test whether
a digit or not)
Both examine characters, but unary plus neither includes a function
call, nor does it have such a complicated algorithm.
Realize also, that in using the unary operator, it *MUST* appear
immediately adjacent to the value it is being applied to:

Nonsense.

[snip]

Mike

--
Michael Winter
Replace ".invalid" with ".uk" to reply by e-mail.

Jul 23 '05 #5

Michael Winter

On 11/06/2005 00:04, fox wrote:

Andreas Gohr wrote:

[snip]

data = data.replace(/&#(\d+);/g,
function() {
return String.fromCharCode(RegExp.$1);
});

try:

function(wholematch, parenmatch1) {
return String.fromCharCode(+parenmatch1);
}

Or

function() {
return String.fromCharCode(arguments[1]);
}

Has anyone a crossbrowser solution?

Despite what you might think from this thread, there isn't one really.
The String.prototype.replace method is broken or inadequate in some
browsers (including earlier IE versions). The behaviour of the replace
method can be examined and reimplemented in script, but I haven't done
that as yet.

Mike

--
Michael Winter
Replace ".invalid" with ".uk" to reply by e-mail.

Jul 23 '05 #6

Andreas Gohr

Thank you all for your help and clarification of the + operator :-)

About the crossbrowser replace solution: I'm happy as it works in all
browsers needed. It's used with xmlhttprequest so it concerns modern
browsers only anyway (Firefox, Opera, IE and Konquerer tested so far).

For the curious: it now powers the Spellchecker in DokuWiki
http://wiki.splitbrain.org/wiki:spell_checker

Regards
Andi

Jul 23 '05 #7

Dr John Stockton

JRS: In article <d8**********@news.datasync.com>, dated Fri, 10 Jun
2005 19:14:06, seen in news:comp.lang.javascript, fox
<sp******@fxmahoney.com> posted :

In JavaScript, because data types are "flexible", using
the plus sign (a unary operator) removes any ambiguity as to the type of
the value passed -- if the characters are all digits, then a number will
be interpreted (otherwise, you'll receive a "NaN" result, so, in most
cases, care should be taken to make sure that the string will be all
digits).
The characters can also be well-placed
space tab + - e E x X a A b B c C d D e E f F
though the last 12 are there digits too.

Logically, it would seem that using a unary operator (+) is a faster
conversion than parseInt (which examines every character to test whether
a digit or not) -- I've never benchmarked it.
Surely the unary operator will fail on reaching the first unacceptable
character, whereas parseInt will succeed at the same point. However, it
is the mechanism for looking up what to do, rather than the doing of it,
which may take more time.

Realize also, that in using the unary operator, it *MUST* appear
immediately adjacent to the value it is being applied to:

var aNumAsString = "23";
var asum = 15 + +aNumAsString;
^no space here

otherwise, it will be interpreted as a concatenation or addition
operator (or a syntax error, as in 1 + + 3)

Untrue. 1 + + 3 is 4, as is 1 + - - + 3, at least for me.

However, ISTM that one cannot have two contiguous instances of the same
one of + - but 1+-+-+-+3 happily gives -2. Note that +"+3" is OK.

--
© John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v4.00 IE 4 ©
<URL:http://www.jibbering.com/faq/> JL/RC: FAQ of news:comp.lang.javascript
<URL:http://www.merlyn.demon.co.uk/js-index.htm> jscr maths, dates, sources.
<URL:http://www.merlyn.demon.co.uk/> TP/BP/Delphi/jscr/&c, FAQ items, links.

Jul 23 '05 #8

fox

Dr John Stockton wrote:

JRS: In article <d8**********@news.datasync.com>, dated Fri, 10 Jun
2005 19:14:06, seen in news:comp.lang.javascript, fox
<sp******@fxmahoney.com> posted :

In JavaScript, because data types are "flexible", using
the plus sign (a unary operator) removes any ambiguity as to the type of
the value passed -- if the characters are all digits, then a number will
be interpreted (otherwise, you'll receive a "NaN" result, so, in most
cases, care should be taken to make sure that the string will be all
digits).

The characters can also be well-placed
space tab + - e E x X a A b B c C d D e E f F
though the last 12 are there digits too.

Logically, it would seem that using a unary operator (+) is a faster
conversion than parseInt (which examines every character to test whether
a digit or not) -- I've never benchmarked it.

Surely the unary operator will fail on reaching the first unacceptable
character, whereas parseInt will succeed at the same point. However, it
is the mechanism for looking up what to do, rather than the doing of it,
which may take more time.

Realize also, that in using the unary operator, it *MUST* appear
immediately adjacent to the value it is being applied to:

var aNumAsString = "23";
var asum = 15 + +aNumAsString;
^no space here

otherwise, it will be interpreted as a concatenation or addition
operator (or a syntax error, as in 1 + + 3)

Untrue. 1 + + 3 is 4, as is 1 + - - + 3, at least for me.

However, ISTM that one cannot have two contiguous instances of the same
one of + - but 1+-+-+-+3 happily gives -2. Note that +"+3" is OK.

It appears that I have carried my stricter "C upbringing" into
JavaScript... I apologize for the misstatement (w/r/t JavaScript, that
is). When you know more than a few languages, sometimes the lines are
blurred.

Jul 23 '05 #9

Andreas Gohr

Hi again!

I declared success too early :-(

This works everywhere - except in Safari:

data = data.replace(/&#(\d+);/g,
function(wholematch, parenmatch1) {
return String.fromCharCode(+parenmatch1);
});

Safari treats this as a simple string instead of executing the function
:-( It works in Konqueror (I thought they both use the same renderer!?)
Has anyone an idea how to get the above workin in Safari? Or any other
solution for converting numerical entities back to UTF8?

Regards
Andi

Jul 23 '05 #10

fox

Andreas Gohr wrote:

Hi again!

I declared success too early :-(

This works everywhere - except in Safari:

data = data.replace(/&#(\d+);/g,
function(wholematch, parenmatch1) {
return String.fromCharCode(+parenmatch1);
});

Safari treats this as a simple string instead of executing the function
:-( It works in Konqueror (I thought they both use the same renderer!?)
Has anyone an idea how to get the above workin in Safari? Or any other
solution for converting numerical entities back to UTF8?
I submitted this "bug" to apple -- but ECMA does not "require" the
variation (they just say the implementation MAY supply a function
argument...)
here is another "brute force" method of converting:

var matches = data.match(/&#\d+;?/g);

for(var i = 0; i < matches.length; i++)
{
// line wraps here -- be careful copy/pasting
var replacement = String.fromCharCode((matches[i]).replace(/\D/g,""));

data = data.replace(/&#\d+;?/,replacement);
}
i used the '?' on the semi-colon because the semi-colon is optional in
HTML coding (in most browser implementations). you don't need the 'g' on
the replace regex because you're stepping through each match in order.

i did this in a hurry -- hopefully you will not have any *more*
cross-browser issues.

}
Regards
Andi

Jul 23 '05 #11

by: Pieter Claerhout | last post by:

Hi all, what would be the easiest way in Python to decode HTML entities to a unicode string? I would need a function that supports both numerical as well as name based HTML entities. I...

Python

Easy way to remove HTML entities from an HTML document?

by: Robert Oschler | last post by:

Is there a module/function to remove all the HTML entities from an HTML document (e.g. - &nbsp, &amp, &apos, etc.)? If not I'll just write one myself but I figured I'd save myself some time. ...

Python

Decoding HTML entities

by: Anuj | last post by:

How can I decode HTML entities? I have an HTML tag as below? <p> Will you help me ? </p> I need to pick this tag in a variable in ASP, and decode the ? into it actual value. There can be a...

ASP / Active Server Pages

document.write, HTML entities and IE

by: Geoff Wilkins | last post by:

I must confess I only come here when I have a problem - so my apologies if this has been raised before: Using my IE v.6 browser, document.write doesn't convert HTML entities (e.g. ', &) to...

Javascript

Convert latin-1 characters to named HTML entities?

by: Joergen Bech | last post by:

Is there a function in the .Net 1.1 framework that will take, say, a string containing Scandinavian characters and output the corret HTML entities, such as æ ø å etc.

ASP.NET

HTML entities from input fields

by: chernyshevsky | last post by:

How do I force IE to encode characters outside of the current code-page as HTML entities? Right now, when I enter some Cyrillic text into a ISO-8859-1 form, the text submitted ends up being CP1251....

HTML / CSS

unescape HTML entities

by: Rares Vernica | last post by:

Hi, How can I unescape HTML entities like " "? I know about xml.sax.saxutils.unescape() but it only deals with "&", "<", and ">". Also, I know about htmlentitydefs.entitydefs, but not...

Python

Convert from unicode chars to HTML entities

by: Steven D'Aprano | last post by:

I have a string containing Latin-1 characters: s = u"© and many more..." I want to convert it to HTML entities: result => "© and many more..." Decimal/hex escapes would be...

Python

XmlTextWriter Encodes HTML Entities?

by: clintonG | last post by:

Can anybody make sense of this crazy and inconsistent results? // IE7 Feed Reading View disabled displays this raw XML <?xml version="1.0" encoding="utf-8" ?> <!-- AT&T HTML entities & XML...

.NET Framework

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

decoding numeric HTML entities

Similar topics