By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,656 Members | 968 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,656 IT Pros & Developers. It's quick & easy.

String.Split needs an enhancement to ignore empty fields

P: n/a
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.
Nov 16 '05 #1
Share this Question
Share on Google+
19 Replies


P: n/a
If String.Split doesn't fit your needs you have to create your own split
method which isn't very complicated. String.Split is designed that it meets
the most common application needs.

--
cody

Freeware Tools, Games and Humour
http://www.deutronium.de.vu || http://www.deutronium.tk
"David Logan" <dj******@comcast.net> schrieb im Newsbeitrag
news:dtTDc.166670$3x.58747@attbi_s54...
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.

Nov 16 '05 #2

P: n/a
cody wrote:
If String.Split doesn't fit your needs you have to create your own split
method which isn't very complicated. String.Split is designed that it meets
the most common application needs.

--
cody

Freeware Tools, Games and Humour
http://www.deutronium.de.vu || http://www.deutronium.tk
"David Logan" <dj******@comcast.net> schrieb im Newsbeitrag
news:dtTDc.166670$3x.58747@attbi_s54...
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.


Which is what I have done. But parsing strings of data with multiple
whitespace characters between fields *is* a very common operation. So I
am disagreeing with the part about "meeting the most common application
needs."

Anyway, I just sent it out in case somebody thought "oh, yea, that would
be a good idea."

David Logan
Nov 16 '05 #3

P: n/a
Have you considered using regular expressions (REGEX) to split the string? I have used it to accomplish what you describe.

See System.Text.RegularExpressions

"David Logan" wrote:
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.

Nov 16 '05 #4

P: n/a
Yes, I have considered it, but I prefer not to use a very expensive
regex for an otherwise simple split. String.Split is perfect save the
fact that in something like:
"abc def ghi jkl mnop"

I get an array of 80 elements instead of 5.
I prefer to save regex for parsing strings when:

1) You don't know what you're going to get next
(in a loop of string processing), or
2) There are various optional pieces in a string
that may or may not occur.

In these instances, simple splitting and checking results is already
pretty expensive, so using regex isn't a stretch.

David Logan

Bill O'Neill wrote:
Have you considered using regular expressions (REGEX) to split the string? I have used it to accomplish what you describe.

See System.Text.RegularExpressions

"David Logan" wrote:

We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.

Nov 16 '05 #5

P: n/a
> >>We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.


Which is what I have done. But parsing strings of data with multiple
whitespace characters between fields *is* a very common operation. So I
am disagreeing with the part about "meeting the most common application
needs."

Anyway, I just sent it out in case somebody thought "oh, yea, that would
be a good idea."


In that case RegEx.Split(string delim) is your friend.
Use @"\s" as separator in your case (IIRC).

--
cody

Freeware Tools, Games and Humour
http://www.deutronium.de.vu || http://www.deutronium.tk
Nov 16 '05 #6

P: n/a
I have to agree with David on this one. Every time I looked at StringSplit
to do simple splitting I gave up on it because of all the extra empty
strings.

Philippe

"David Logan" <dj******@comcast.net> wrote in message
news:rpVDc.129576$Sw.70819@attbi_s51...
Yes, I have considered it, but I prefer not to use a very expensive
regex for an otherwise simple split. String.Split is perfect save the
fact that in something like:
"abc def ghi jkl mnop"

I get an array of 80 elements instead of 5.
I prefer to save regex for parsing strings when:

1) You don't know what you're going to get next
(in a loop of string processing), or
2) There are various optional pieces in a string
that may or may not occur.

In these instances, simple splitting and checking results is already
pretty expensive, so using regex isn't a stretch.

David Logan

Bill O'Neill wrote:
Have you considered using regular expressions (REGEX) to split the string? I have used it to accomplish what you describe.
See System.Text.RegularExpressions

"David Logan" wrote:

We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.

Nov 16 '05 #7

P: n/a
David,
In addition to the other comments.

There are three Split functions in .NET:

Use Microsoft.VisualBasic.Strings.Split if you need to split a string based
on a specific word (string). It is the Split function from VB6.

Use System.String.Split if you need to split a string based on a collection
of specific characters. Each individual character is its own delimiter.

Use System.Text.RegularExpressions.RegEx.Split to split based
on matching patterns.

In your example I would use RegEx.Split, unless it was proven via profiling
to be a performance problem in the routine you are using (remember the 80-20
rule).

Hope this helps
Jay
"David Logan" <dj******@comcast.net> wrote in message
news:dtTDc.166670$3x.58747@attbi_s54...
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.

Nov 16 '05 #8

P: n/a
I was unaware of the .VisualBasic. namespace routines.

Performance may or may not be a problem depending upon which packets I
would need to parse in this manner. I just try to avoid regex in general
unless I need its flexibility.

What is the "80/20" rule?

David Logan

Jay B. Harlow [MVP - Outlook] wrote:
David,
In addition to the other comments.

There are three Split functions in .NET:

Use Microsoft.VisualBasic.Strings.Split if you need to split a string based
on a specific word (string). It is the Split function from VB6.

Use System.String.Split if you need to split a string based on a collection
of specific characters. Each individual character is its own delimiter.

Use System.Text.RegularExpressions.RegEx.Split to split based
on matching patterns.

In your example I would use RegEx.Split, unless it was proven via profiling
to be a performance problem in the routine you are using (remember the 80-20
rule).

Hope this helps
Jay
"David Logan" <dj******@comcast.net> wrote in message
news:dtTDc.166670$3x.58747@attbi_s54...
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.


Nov 16 '05 #9

P: n/a
David,
Performance may or may not be a problem depending upon which packets I
would need to parse in this manner. I just try to avoid regex in general
unless I need its flexibility. Generally if I am going to be reusing the same RegEx, I apply the
RegexOptions.Compiled option and keep the RegEx itself in a static member.
What is the "80/20" rule? I've heard various variations of it, basically 80% of the time is spent in
20% of the code.

Basically I write "correct" code first, rather then worry how well it will
perform, I only go back & optimize routines, once those routines have proven
to be a performance problem... By "correct" I primarily mean OOP, plus using
the tools available, such as RegEx to solve a problem, if those tools fit
the requirement. Of course "correct" is subjective.

Hope this helps
Jay
"David Logan" <dj******@comcast.net> wrote in message
news:yG4Ec.169467$3x.99527@attbi_s54... I was unaware of the .VisualBasic. namespace routines.

Performance may or may not be a problem depending upon which packets I
would need to parse in this manner. I just try to avoid regex in general
unless I need its flexibility.

What is the "80/20" rule?

David Logan

Jay B. Harlow [MVP - Outlook] wrote:
David,
In addition to the other comments.

There are three Split functions in .NET:

Use Microsoft.VisualBasic.Strings.Split if you need to split a string based on a specific word (string). It is the Split function from VB6.

Use System.String.Split if you need to split a string based on a collection of specific characters. Each individual character is its own delimiter.

Use System.Text.RegularExpressions.RegEx.Split to split based
on matching patterns.

In your example I would use RegEx.Split, unless it was proven via profiling to be a performance problem in the routine you are using (remember the 80-20 rule).

Hope this helps
Jay
"David Logan" <dj******@comcast.net> wrote in message
news:dtTDc.166670$3x.58747@attbi_s54...
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.


Nov 16 '05 #10

P: n/a
David,
It appears that Whidbey (VS.NET 2005) or Longhorn will have an option on
String.Split to omit empty entries.

http://longhorn.msdn.microsoft.com/l.../m/split2.aspx

A fourth option may be to use String.Split, then "remove" the array entries
that are blank. (rather then parsing the string yourself...

Hope this helps
Jay

"David Logan" <dj******@comcast.net> wrote in message
news:yG4Ec.169467$3x.99527@attbi_s54...
I was unaware of the .VisualBasic. namespace routines.

Performance may or may not be a problem depending upon which packets I
would need to parse in this manner. I just try to avoid regex in general
unless I need its flexibility.

What is the "80/20" rule?

David Logan

Jay B. Harlow [MVP - Outlook] wrote:
David,
In addition to the other comments.

There are three Split functions in .NET:

Use Microsoft.VisualBasic.Strings.Split if you need to split a string based on a specific word (string). It is the Split function from VB6.

Use System.String.Split if you need to split a string based on a collection of specific characters. Each individual character is its own delimiter.

Use System.Text.RegularExpressions.RegEx.Split to split based
on matching patterns.

In your example I would use RegEx.Split, unless it was proven via profiling to be a performance problem in the routine you are using (remember the 80-20 rule).

Hope this helps
Jay
"David Logan" <dj******@comcast.net> wrote in message
news:dtTDc.166670$3x.58747@attbi_s54...
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.


Nov 16 '05 #11

P: n/a
I am currently using a homegrown method:

protected String[] SplitNoEmpty(String data)
{
ArrayList fieldarray = new ArrayList();
foreach (string field in data.Split(' '))
if (field.Length > 0) fieldarray.Add(field);
String[] ret = new String[fieldarray.Count];
for(int x=0;x<fieldarray.Count;x++)
ret[x]=(String)fieldarray[x];
return ret;
}

I mentioned it mainly because splitting strings over multiple whitespace
is such a common operation I think it would be worthwhile to consider
implementing in the common libraries.

David Logan

Jay B. Harlow [MVP - Outlook] wrote:
David,
It appears that Whidbey (VS.NET 2005) or Longhorn will have an option on
String.Split to omit empty entries.

http://longhorn.msdn.microsoft.com/l.../m/split2.aspx

A fourth option may be to use String.Split, then "remove" the array entries
that are blank. (rather then parsing the string yourself...

Hope this helps
Jay

"David Logan" <dj******@comcast.net> wrote in message
news:yG4Ec.169467$3x.99527@attbi_s54...
I was unaware of the .VisualBasic. namespace routines.

Performance may or may not be a problem depending upon which packets I
would need to parse in this manner. I just try to avoid regex in general
unless I need its flexibility.

What is the "80/20" rule?

David Logan

Jay B. Harlow [MVP - Outlook] wrote:
David,
In addition to the other comments.

There are three Split functions in .NET:

Use Microsoft.VisualBasic.Strings.Split if you need to split a string
based
on a specific word (string). It is the Split function from VB6.

Use System.String.Split if you need to split a string based on a
collection
of specific characters. Each individual character is its own delimiter.

Use System.Text.RegularExpressions.RegEx.Split to split based
on matching patterns.

In your example I would use RegEx.Split, unless it was proven via
profiling
to be a performance problem in the routine you are using (remember the
80-20
rule).

Hope this helps
Jay
"David Logan" <dj******@comcast.net> wrote in message
news:dtTDc.166670$3x.58747@attbi_s54...
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.


Nov 16 '05 #12

P: n/a
"David Logan" <dj******@comcast.net> wrote:
ArrayList fieldarray = new ArrayList();
[...]
String[] ret = new String[fieldarray.Count];
for(int x=0;x<fieldarray.Count;x++)
ret[x]=(String)fieldarray[x];
return ret;
FYI, there's a more concise way of doing that:

return (string[]) fieldarray.ToArray(typeof(string));
splitting strings over multiple whitespace is such
a common operation I think it would be worthwhile
to consider implementing in the common libraries.


I agree. An extra bool parameter to String.Split, indicating whether
to omit zero-length strings from the resulting array, wouldn't hurt.

P.
Nov 16 '05 #13

P: n/a
Hi,

"David Logan" <dj******@comcast.net> wrote in message
news:dtTDc.166670$3x.58747@attbi_s54...
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.


string [] fields = Regex.Split (strInput, "\\s+");

Why bother writing it yourself if it can be done as easely. There is
nothing wrong with regex.

I don't like the argument that it shouldn't be used in simple cases, for one
you shouldn't be concerned about writing an inefficient pattern.

HTH
greetings
Nov 16 '05 #14

P: n/a
I completely agree that, for now, Regex is the best solution for most of us.

I wrote a test that split David's string 10,000 times. The string.split method took 0.143 seconds while Regex took 1.104 seconds. Regex is almost an order of magnitude slower; however, it is a good solution.

Unless your application performance constraints are very strict, I would use Regex.
"BMermuys" wrote:
Hi,

"David Logan" <dj******@comcast.net> wrote in message
news:dtTDc.166670$3x.58747@attbi_s54...
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.


string [] fields = Regex.Split (strInput, "\\s+");

Why bother writing it yourself if it can be done as easely. There is
nothing wrong with regex.

I don't like the argument that it shouldn't be used in simple cases, for one
you shouldn't be concerned about writing an inefficient pattern.

HTH
greetings

Nov 16 '05 #15

P: n/a
If you would have used a compiled RegEx instead of everytime calling
RegEx.Split() which compiles the RegEx everytime again, I suspect that
RegEx.Split() would have been even fast than String.Split().

--
cody

Freeware Tools, Games and Humour
http://www.deutronium.de.vu || http://www.deutronium.tk
"Bill O'Neill" <Bi********@discussions.microsoft.com> schrieb im Newsbeitrag
news:01**********************************@microsof t.com...
I completely agree that, for now, Regex is the best solution for most of us.
I wrote a test that split David's string 10,000 times. The string.split method took 0.143 seconds while Regex took 1.104 seconds. Regex is almost an
order of magnitude slower; however, it is a good solution.
Unless your application performance constraints are very strict, I would use Regex.

"BMermuys" wrote:
Hi,

"David Logan" <dj******@comcast.net> wrote in message
news:dtTDc.166670$3x.58747@attbi_s54...
We need an additional function in the String class. We need the ability to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the resulting string array.


string [] fields = Regex.Split (strInput, "\\s+");

Why bother writing it yourself if it can be done as easely. There is
nothing wrong with regex.

I don't like the argument that it shouldn't be used in simple cases, for one you shouldn't be concerned about writing an inefficient pattern.

HTH
greetings

Nov 16 '05 #16

P: n/a
That *is* using a compiled Regex instance, and not the static Split method.

"cody" wrote:
If you would have used a compiled RegEx instead of everytime calling
RegEx.Split() which compiles the RegEx everytime again, I suspect that
RegEx.Split() would have been even fast than String.Split().

--
cody

Freeware Tools, Games and Humour
http://www.deutronium.de.vu || http://www.deutronium.tk
"Bill O'Neill" <Bi********@discussions.microsoft.com> schrieb im Newsbeitrag
news:01**********************************@microsof t.com...
I completely agree that, for now, Regex is the best solution for most of

us.

I wrote a test that split David's string 10,000 times. The string.split

method took 0.143 seconds while Regex took 1.104 seconds. Regex is almost an
order of magnitude slower; however, it is a good solution.

Unless your application performance constraints are very strict, I would

use Regex.


"BMermuys" wrote:
Hi,

"David Logan" <dj******@comcast.net> wrote in message
news:dtTDc.166670$3x.58747@attbi_s54...
> We need an additional function in the String class. We need the ability > to suppress empty fields, so that we can more effectively parse. Right
> now, multiple whitespace characters create multiple empty strings in the > resulting string array.

string [] fields = Regex.Split (strInput, "\\s+");

Why bother writing it yourself if it can be done as easely. There is
nothing wrong with regex.

I don't like the argument that it shouldn't be used in simple cases, for one you shouldn't be concerned about writing an inefficient pattern.

HTH
greetings


Nov 16 '05 #17

P: n/a
Bill O'Neill wrote:
I completely agree that, for now, Regex is the best solution for most of us.

I wrote a test that split David's string 10,000 times. The string.split method took 0.143 seconds while Regex took 1.104 seconds. Regex is almost an order of magnitude slower; however, it is a good solution.
That's exactly why I reserve regex to cases where it's really useful.

Unless your application performance constraints are very strict, I would use Regex.


Why use a very inefficient method when there is a perfectly good and
efficient one? And it *could* be supported by the library. :)

David Logan
Nov 16 '05 #18

P: n/a
David,
I wrote a test that split David's string 10,000 times.
The string.split method took 0.143 seconds while
Regex took 1.104 seconds. Regex is almost an order
of magnitude slower; however, it is a good solution.
That's exactly why I reserve regex to cases where it's really useful.
Yes the Regex took almost 10 times longer, however what happens when the
RegEx is only 1% or even .01% of the total cost of your routine, is it
really worth worrying about?

By routine I mean what you do with the array after splitting it. For example
placing the values into a DataTable. If the cost of using the DataTable is
significantly more then cost of the RegEx is it really worth worring about
avoiding the RegEx?

My concern with coding around it, is how much memory pressure (work for the
GC) are you creating to avoid the time on the RegEx. Are you simply robbing
Peter to pay Paul?

Which is where I would not avoid the RegEx, simply because RegEx is slow, I
would use the RegEx because it is quicker coding, and its a good fit for
this problem. Once the RegEx was proven to be too high a cost of the
routine, via profiling (the CLR profiler for example) then I would take the
extra time to code a quicker solution...

Granted if we get the String.Split ignore empties option in Whidbey, the
option would be the better fit in Whidbey...

For info on the 80/20 rule & optimizing only the 20% see Martin Fowler's
article "Yet Another Optimization Article" at
http://martinfowler.com/ieeeSoftware...timization.pdf

For a list of Martin's articles see:

http://martinfowler.com/articles.html

Info on the CLR Profiler:
http://msdn.microsoft.com/library/de...nethowto13.asp

http://msdn.microsoft.com/library/de...anagedapps.asp
Hope this helps
Jay

"David Logan" <dj******@comcast.net> wrote in message
news:YqeEc.125976$eu.16729@attbi_s02... Bill O'Neill wrote:
I completely agree that, for now, Regex is the best solution for most of
us.
I wrote a test that split David's string 10,000 times. The string.split method took 0.143 seconds while Regex took 1.104 seconds. Regex is almost an
order of magnitude slower; however, it is a good solution.
That's exactly why I reserve regex to cases where it's really useful.

Unless your application performance constraints are very strict, I would

use Regex.


Why use a very inefficient method when there is a perfectly good and
efficient one? And it *could* be supported by the library. :)

David Logan

Nov 16 '05 #19

P: n/a
David,
Looking at this closer the expression you use makes a huge difference!

For example BMermuy's statement:

string [] fields = Regex.Split (strInput, "\\s+");

is slower then

string [] fields = Regex.Split (strInput, " +");
String.Split: 0.105655048337149
SplitNoEmpty: 0.168633723001108
regex(" +"): 0.286259287144036
regex("\\s+"): 0.713445703294692

If you know your string only has a space as a delimiter, then the RegEx time
is only about 2x the SplitNoEmpty routine, however if you can have any white
space character (\s is short hand for [\f\n\r\t\v\x85\p{Z}]) as a delimiter
then the time is about 7x...

Times are in seconds based on QueryPerformanceCounter &
QueryPerformanceFrequency, using a loop of 10,000 iterations. I compiled the
RegEx outside the loop.

Hope this helps
Jay

"David Logan" <dj******@comcast.net> wrote in message
news:YqeEc.125976$eu.16729@attbi_s02...
Bill O'Neill wrote:
I completely agree that, for now, Regex is the best solution for most of us.
I wrote a test that split David's string 10,000 times. The string.split method took 0.143 seconds while Regex took 1.104 seconds. Regex is almost an
order of magnitude slower; however, it is a good solution.

That's exactly why I reserve regex to cases where it's really useful.

Unless your application performance constraints are very strict, I would

use Regex.


Why use a very inefficient method when there is a perfectly good and
efficient one? And it *could* be supported by the library. :)

David Logan

Nov 16 '05 #20

This discussion thread is closed

Replies have been disabled for this discussion.