By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,652 Members | 1,473 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,652 IT Pros & Developers. It's quick & easy.

Splitting string into word array - regular expression

P: n/a
Hi,
What regex do I need to split a string, using javascript's split
method, into words-array?
Splitting accroding to whitespaces only is not enough, I need to split
according to whitespace, comma, hyphen, etc...
Is there a regex that does the trick?
Thanks, Anat.

May 25 '06 #1
Share this Question
Share on Google+
7 Replies


P: n/a
Anat wrote:
Hi,
What regex do I need to split a string, using javascript's split
method, into words-array?
Of course, that depends on how you define a word.

Splitting accroding to whitespaces only is not enough, I need to split
according to whitespace, comma, hyphen, etc...
Is there a regex that does the trick?


To split at one or more non-word characters (basically any character
other than a letter or number):

var words = string.split(/\W+/);

--
Rob
Group FAQ: <URL:http://www.jibbering.com/faq/>
May 25 '06 #2

P: n/a
RobG wrote:
Anat wrote:
Hi,
What regex do I need to split a string, using javascript's split
method, into words-array?


Of course, that depends on how you define a word.

Splitting accroding to whitespaces only is not enough, I need to split
according to whitespace, comma, hyphen, etc...
Is there a regex that does the trick?


To split at one or more non-word characters (basically any character
other than a letter or number):

var words = string.split(/\W+/);


Not all browsers will tolerate regular expressions in split(), it may be
safer to replace all non-word characters with a space then split on that:

var newString = string.replace(/\W+/g,' ');
var words = newString.split(' ');
For the OP to consider...

--
Zif
May 25 '06 #3

P: n/a
Thanks guys,
But actually, when I come to think of it, it's not a good solution for
what I'm trying to do.
I want to take a given string, and make certain words hyperlinks.
For example:
"Hello world, this is a wonderful day!"
I'd like the words world, wonderful and day to be hyperlinks, therefore
after my manipulation it should be:
"Hello <a href=...>world</a>, this is a <a href=...>wonderful</a> <a
href=...>day</a>!"
Using split method is not good, because the whitespaces, commas and
other punctuation marks are gone.
Instead of displaying
"Hello <a href=...>world</a>, this is a <a href=...>wonderful</a> <a
href=...>day</a>!"
I will display
"Hello <a href=...>world</a> this is a <a href=...>wonderful</a> <a
href=...>day</a>"
(note that the comma and exclamation mark are gone).
Any ideas on how I can locate words, replace them but not loose
punctuation marks on the way?
Thanks again!!!

May 25 '06 #4

P: n/a
Zifud <zi*@yahoo.com> writes:
Not all browsers will tolerate regular expressions in split(),


Can you mention one that doesn't that is more recent that Netscape 3?
I can see that both IE 4 and Netscape 4.80 does support it.

/L
--
Lasse Reichstein Nielsen - lr*@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
'Faith without judgement merely degrades the spirit divine.'
May 25 '06 #5

P: n/a
RobG wrote:
Anat wrote:
What regex do I need to split a string, using javascript's split
method, into words-array?


Of course, that depends on how you define a word.


Exactly.
Splitting accroding to whitespaces only is not enough, I need to split
according to whitespace, comma, hyphen, etc...
Is there a regex that does the trick?


To split at one or more non-word characters (basically any character
other than a letter or number):

var words = string.split(/\W+/);


Therefore, one seldom wants that (considering Unicode word characters that
match \W), and probably the OP does not. They are looking for character
classes instead:

var s = [
"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do",
"eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim",
"ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut",
"aliquip ex ea commodo consequat. Duis aute irure dolor in",
"reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla",
"pariatur. Excepteur sint occaecat cupidatat non proident, sunt in",
"culpa qui officia deserunt mollit anim id est laborum."
].join(" ");

window.alert(s);

// "etc." not included
var words = s.split(/[\s,-]+/);

window.alert(words.join(" | "));
PointedEars
May 26 '06 #6

P: n/a
Zifud wrote:
RobG wrote:
Anat wrote:
Splitting accroding to whitespaces only is not enough, I need to split
according to whitespace, comma, hyphen, etc...
Is there a regex that does the trick? To split at one or more non-word characters (basically any character
other than a letter or number):

var words = string.split(/\W+/);


Not all browsers will tolerate regular expressions in split(),


The RegExp object and Regular Expression literals were introduced with
JavaScript 1.2 (NN 4.0, June 1997), and JScript 3.0 (IE 4.0, October 1997).

Since then, the ECMA WG has produced two more editions of ECMAScript, where
Edition 3 (December 1999, March 2000) (finally) formally specified that
feature. No scriptable user agent can survive in the mid-term without
supporting it nowadays.

I'd say your information is /slightly/ outdated.
it may be safer to replace all non-word characters with a space then split
on that:
Unlikely.
var newString = string.replace(/\W+/g,' ');
That does not recognize "Überlandstraße" as one word ...
var words = newString.split(' ');
.... and makes ["", "berlandstra", "e"] out of it.
For the OP to consider...


.... and to reject.
PointedEars
--
But he had not that supreme gift of the artist, the knowledge of
when to stop.
-- Sherlock Holmes in Sir Arthur Conan Doyle's
"The Adventure of the Norwood Builder"
May 26 '06 #7

P: n/a
Anat wrote:
I want to take a given string, and make certain words hyperlinks.
For example:
"Hello world, this is a wonderful day!"
I'd like the words world, wonderful and day to be hyperlinks, therefore
after my manipulation it should be:
"Hello <a href=...>world</a>, this is a <a href=...>wonderful</a> <a
href=...>day</a>!"
Using split method is not good, because the whitespaces, commas and
other punctuation marks are gone.
[...]
Any ideas on how I can locate words, replace them but not loose
punctuation marks on the way?
From your use of the `a' element, I assume this is for `innerHTML'.
Please note that this property is proprietary, and its behavior is
both implementation-dependent and context-dependent.

You could use \b of course, but that will get you in trouble with
words containing non-ASCII characters. Therefore:

var s = ...innerHTML;
s = s.replace(
/(^|[\s-])(world|wonderful|day)([\s,;.?!-]|$)/g,
"$1<a href="http://en.wikipedia.org/wiki/$2">$2<\/a>$3");
...innerHTML = s;

Or with positive lookahead (requires JavaScript 1.5, JScript 5.5,
ECMAScript Ed. 3 [1]):

...
s = s.replace(
/([\s-]|^)(world|wonderful|day)(?=([\s,;.?!-]|$))/g,
'$1<a href="http://en.wikipedia.org/wiki/$2">$2<\/a>');
...

(Use those character classes, unless you want to code all UCS
[non-]word characters as compactly defined in the XML grammar.)

I can remember to have suggested a probably more sophisticated replacing
approach a few months ago already, that also points out the difficulties
with general replacing. Search the (Google Groups) archives for "IBM
replace author:PointedEars" or so.

When implementing this, you should additionally take into account that too
many hyperlinks in continuous text can make that text hardly legible.
Thanks again!!!


You are welcome. But please get your Exclamation Mark key repaired.
PointedEars
___________
[1] <URL:http://pointedears.de/es-matrix>
--
Indiana Jones: The Name of God. Jehovah.
Professor Henry Jones: But in the Latin alphabet,
"Jehovah" begins with an "I".
Indiana Jones: J-...
May 26 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.