473,883 Members | 2,607 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

performing a whitespace-insensitive query

Hi,

I am trying to concoct a query that will join rows on the basis of a
whitespace insensitive comparison. For instance if one row has the value
'a<space>b' and another has the value 'a<space><space >b' I want them to be
considered equal (but not the same as 'ab')

I am happy to do some T-SQL if that helps. Also I have a full-text index on
the column in question, but note that I am comparing rows against each
other, not to a fixed string.

This is for a one-off job, so if there is no obvious way to do it on the
database, I will just bcp out the data and knock up some perl or something
to do it. All other things being equal I would rather do it on the database
though.

Many thanks

Andy
Jul 20 '05 #1
11 4141

"Andy Fish" <aj****@blueyon der.co.uk> wrote in message
news:QT******** *************@n ews-text.cableinet. net...
Hi,

I am trying to concoct a query that will join rows on the basis of a
whitespace insensitive comparison. For instance if one row has the value
'a<space>b' and another has the value 'a<space><space >b' I want them to be
considered equal (but not the same as 'ab')

I am happy to do some T-SQL if that helps. Also I have a full-text index on the column in question, but note that I am comparing rows against each
other, not to a fixed string.

This is for a one-off job, so if there is no obvious way to do it on the
database, I will just bcp out the data and knock up some perl or something
to do it. All other things being equal I would rather do it on the database though.

Many thanks

Andy


If you want to it in the database, and assuming that you have SQL2000, then
one possibility is to write a user-defined function to iterate over a string
and reduce all whitespace characters to a single space, using the standard
string functions. If you only have spaces (ie no tabs or newlines), then
even some nested REPLACE() functions might work.

Another way would be to write a function which instantiates the Regex COM
object using the sp_OA% procedures, and does the same thing with a regular
expression. But both these approaches would be very slow (unless perhaps you
used your UDF as a computed column and indexed it), and the second one
requires sysadmin permissions.

I suspect, though, that your BCP and Perl approach would probably be the
quickest and easiest solution.

Simon
Jul 20 '05 #2
SELECT *
FROM A
INNER JOIN B
ON Replace(
Replace(
Replace(
Replace(
Replace(
Replace(A.col, replicate('X',1 32),'X')
, replicate('X', 21),'X')
, replicate('X', 6),'X')
, replicate('X', 3),'X')
, replicate('X', 2),'X')
, replicate('X', 2),'X')
= Replace(
Replace(
Replace(
Replace(
Replace(
Replace(B.col, replicate('X',1 32),'X')
, replicate('X', 21),'X')
, replicate('X', 6),'X')
, replicate('X', 3),'X')
, replicate('X', 2),'X')
, replicate('X', 2),'X')
Hope this helps,
Gert-Jan

Andy Fish wrote:

Hi,

I am trying to concoct a query that will join rows on the basis of a
whitespace insensitive comparison. For instance if one row has the value
'a<space>b' and another has the value 'a<space><space >b' I want them to be
considered equal (but not the same as 'ab')

I am happy to do some T-SQL if that helps. Also I have a full-text index on
the column in question, but note that I am comparing rows against each
other, not to a fixed string.

This is for a one-off job, so if there is no obvious way to do it on the
database, I will just bcp out the data and knock up some perl or something
to do it. All other things being equal I would rather do it on the database
though.

Many thanks

Andy

Jul 20 '05 #3
Hmmm, again I posted to fast. Replace all X with space in the solution
below and it should work...

Gert-Jan
Gert-Jan Strik wrote:

SELECT *
FROM A
INNER JOIN B
ON Replace(
Replace(
Replace(
Replace(
Replace(
Replace(A.col, replicate('X',1 32),'X')
, replicate('X', 21),'X')
, replicate('X', 6),'X')
, replicate('X', 3),'X')
, replicate('X', 2),'X')
, replicate('X', 2),'X')
= Replace(
Replace(
Replace(
Replace(
Replace(
Replace(B.col, replicate('X',1 32),'X')
, replicate('X', 21),'X')
, replicate('X', 6),'X')
, replicate('X', 3),'X')
, replicate('X', 2),'X')
, replicate('X', 2),'X')

Hope this helps,
Gert-Jan

Andy Fish wrote:

Hi,

I am trying to concoct a query that will join rows on the basis of a
whitespace insensitive comparison. For instance if one row has the value
'a<space>b' and another has the value 'a<space><space >b' I want them to be
considered equal (but not the same as 'ab')

I am happy to do some T-SQL if that helps. Also I have a full-text index on
the column in question, but note that I am comparing rows against each
other, not to a fixed string.

This is for a one-off job, so if there is no obvious way to do it on the
database, I will just bcp out the data and knock up some perl or something
to do it. All other things being equal I would rather do it on the database
though.

Many thanks

Andy

Jul 20 '05 #4
Thanks Gert-Jan, this definitely looks like the right way to do it.

unfortunately I neglected to mention that I have 3/4 million rows in the
table. I left the query running overnight for 9 hours (it used 100%cpu on my
athlon 1.5 ghz) but I had to cancel it. However, it did return 22,500
correct results (I have no idea what proportion this represents).

I think it must have thrown the query optimization out because I can't
beleive it really takes that long. I think I will create temporary columns
with the whitespace stripped (using your function) and then join on those
instead.

after seeing your post I was fascinated by the magic numbers used. I found a
previous thread from 2002 where there was a lot of discussion and you said
that they were an educated guess but verified empirically. The problem of
finding the optimal numbers has lodged itself into my brain now and won't
let go, so I will let you know if I get anywhere with it.

Andy

"Gert-Jan Strik" <so***@toomuchs pamalready.nl> wrote in message
news:40******** *******@toomuch spamalready.nl. ..
Hmmm, again I posted to fast. Replace all X with space in the solution
below and it should work...

Gert-Jan
Gert-Jan Strik wrote:

SELECT *
FROM A
INNER JOIN B
ON Replace(
Replace(
Replace(
Replace(
Replace(
Replace(A.col, replicate('X',1 32),'X')
, replicate('X', 21),'X')
, replicate('X', 6),'X')
, replicate('X', 3),'X')
, replicate('X', 2),'X')
, replicate('X', 2),'X')
= Replace(
Replace(
Replace(
Replace(
Replace(
Replace(B.col, replicate('X',1 32),'X')
, replicate('X', 21),'X')
, replicate('X', 6),'X')
, replicate('X', 3),'X')
, replicate('X', 2),'X')
, replicate('X', 2),'X')

Hope this helps,
Gert-Jan

Andy Fish wrote:

Hi,

I am trying to concoct a query that will join rows on the basis of a
whitespace insensitive comparison. For instance if one row has the value 'a<space>b' and another has the value 'a<space><space >b' I want them to be considered equal (but not the same as 'ab')

I am happy to do some T-SQL if that helps. Also I have a full-text index on the column in question, but note that I am comparing rows against each
other, not to a fixed string.

This is for a one-off job, so if there is no obvious way to do it on the database, I will just bcp out the data and knock up some perl or something to do it. All other things being equal I would rather do it on the database though.

Many thanks

Andy

Jul 20 '05 #5
OK, I think I have figured out the algorithm to produce the magic numbers in
the formula.

say 10 is a magic number (i.e. it's one of the ones in the list of replicate
functions), then doing a replace of 10 spaces with 1 will reduce anything up
to 108 spaces down to 18 or less. obviously it will also reduce anything up
to 109 spaces with 19 or less, but it just naturally seems that the 108->18
is giving us the best "leverage" from the number 10.

so in general, putting 'n' in the formula reduces n(n+1)-2 spaces to 2(n-1)

In the above example the next term above 10 would need to reduce to X spaces
to 108 spaces or less

solving 2(n-1)=108 gives us n=55, so 55 would reduce anything up to 3078
spaces down to 108 or less

in the general case, say the term after n is m, then

n(n+1)-2 = 2(m-1)

or

m = 1/2 (n)(n+1)

this generates the sequence 2, 3, 6, 21, 231, 26796, 359026206,
644499084768903 00, 207689535133977 000000000000000 0000. I'm sure there is
some rounding in the last couple of terms (I only used excel).

Anyone with high school math will notice that the formula for m is the sum
of all numbers adding up to m, which does make sense although I'm not sure I
could prove why. one analagy is that if you took a triangle of billiard
balls (start with 3) then spread them out so that these form the bottom row
of a bigger next triangle, then spread all those out to form the bottom row
of the next triangle and so on.

Anyway I hope that's of interest to someone. Oh who am I kidding, it was
such a buzz figuring it out I don't care if anyone ever reads this.

Andy
"Gert-Jan Strik" <so***@toomuchs pamalready.nl> wrote in message
news:40******** *******@toomuch spamalready.nl. ..
Hmmm, again I posted to fast. Replace all X with space in the solution
below and it should work...

Gert-Jan
Gert-Jan Strik wrote:

SELECT *
FROM A
INNER JOIN B
ON Replace(
Replace(
Replace(
Replace(
Replace(
Replace(A.col, replicate('X',1 32),'X')
, replicate('X', 21),'X')
, replicate('X', 6),'X')
, replicate('X', 3),'X')
, replicate('X', 2),'X')
, replicate('X', 2),'X')
= Replace(
Replace(
Replace(
Replace(
Replace(
Replace(B.col, replicate('X',1 32),'X')
, replicate('X', 21),'X')
, replicate('X', 6),'X')
, replicate('X', 3),'X')
, replicate('X', 2),'X')
, replicate('X', 2),'X')

Hope this helps,
Gert-Jan

Andy Fish wrote:

Hi,

I am trying to concoct a query that will join rows on the basis of a
whitespace insensitive comparison. For instance if one row has the value 'a<space>b' and another has the value 'a<space><space >b' I want them to be considered equal (but not the same as 'ab')

I am happy to do some T-SQL if that helps. Also I have a full-text index on the column in question, but note that I am comparing rows against each
other, not to a fixed string.

This is for a one-off job, so if there is no obvious way to do it on the database, I will just bcp out the data and knock up some perl or something to do it. All other things being equal I would rather do it on the database though.

Many thanks

Andy

Jul 20 '05 #6
Consider: literals.

Here are some solutions
with maximium literal length of 19:
[19, 18, [6,7], 3, 2, 2]
[19, 17, [6,7], 4, 3, 2]
[19, 18, [6,7], 4, 3, 2]

Bye,
Delbert Glass
Jul 20 '05 #7
In case you are interested, I just gave sort of an explanation a few
days ago. See
http://groups.google.com/groups?hl=e...spamalready.nl

Gert-Jan
Andy Fish wrote:

OK, I think I have figured out the algorithm to produce the magic numbers in
the formula.

say 10 is a magic number (i.e. it's one of the ones in the list of replicate
functions), then doing a replace of 10 spaces with 1 will reduce anything up
to 108 spaces down to 18 or less. obviously it will also reduce anything up
to 109 spaces with 19 or less, but it just naturally seems that the 108->18
is giving us the best "leverage" from the number 10.

so in general, putting 'n' in the formula reduces n(n+1)-2 spaces to 2(n-1)

In the above example the next term above 10 would need to reduce to X spaces
to 108 spaces or less

solving 2(n-1)=108 gives us n=55, so 55 would reduce anything up to 3078
spaces down to 108 or less

in the general case, say the term after n is m, then

n(n+1)-2 = 2(m-1)

or

m = 1/2 (n)(n+1)

this generates the sequence 2, 3, 6, 21, 231, 26796, 359026206,
644499084768903 00, 207689535133977 000000000000000 0000. I'm sure there is
some rounding in the last couple of terms (I only used excel).

Anyone with high school math will notice that the formula for m is the sum
of all numbers adding up to m, which does make sense although I'm not sure I
could prove why. one analagy is that if you took a triangle of billiard
balls (start with 3) then spread them out so that these form the bottom row
of a bigger next triangle, then spread all those out to form the bottom row
of the next triangle and so on.

Anyway I hope that's of interest to someone. Oh who am I kidding, it was
such a buzz figuring it out I don't care if anyone ever reads this.

Andy

"Gert-Jan Strik" <so***@toomuchs pamalready.nl> wrote in message
news:40******** *******@toomuch spamalready.nl. ..
Hmmm, again I posted to fast. Replace all X with space in the solution
below and it should work...

Gert-Jan
Gert-Jan Strik wrote:

SELECT *
FROM A
INNER JOIN B
ON Replace(
Replace(
Replace(
Replace(
Replace(
Replace(A.col, replicate('X',1 32),'X')
, replicate('X', 21),'X')
, replicate('X', 6),'X')
, replicate('X', 3),'X')
, replicate('X', 2),'X')
, replicate('X', 2),'X')
= Replace(
Replace(
Replace(
Replace(
Replace(
Replace(B.col, replicate('X',1 32),'X')
, replicate('X', 21),'X')
, replicate('X', 6),'X')
, replicate('X', 3),'X')
, replicate('X', 2),'X')
, replicate('X', 2),'X')

Hope this helps,
Gert-Jan

Andy Fish wrote:
>
> Hi,
>
> I am trying to concoct a query that will join rows on the basis of a
> whitespace insensitive comparison. For instance if one row has the value > 'a<space>b' and another has the value 'a<space><space >b' I want them to be > considered equal (but not the same as 'ab')
>
> I am happy to do some T-SQL if that helps. Also I have a full-text index on > the column in question, but note that I am comparing rows against each
> other, not to a fixed string.
>
> This is for a one-off job, so if there is no obvious way to do it on the > database, I will just bcp out the data and knock up some perl or something > to do it. All other things being equal I would rather do it on the database > though.
>
> Many thanks
>
> Andy

Jul 20 '05 #8
Here are (the?) 24 solutions
with fewer then 7 stages
and lengths less than 20.

Solution TotalLength
-------------------- -----------
[19, 17, 6, 3, 2, 2] 49
[19, 17, 6, 3, 3, 2] 50
[19, 17, 6, 4, 2, 2] 50
[19, 17, 6, 4, 3, 2] 51
[19, 17, 7, 3, 2, 2] 50
[19, 17, 7, 3, 3, 2] 51
[19, 17, 7, 4, 2, 2] 51
[19, 17, 7, 4, 3, 2] 52
[19, 18, 6, 3, 2, 2] 50
[19, 18, 6, 3, 3, 2] 51
[19, 18, 6, 4, 2, 2] 51
[19, 18, 6, 4, 3, 2] 52
[19, 18, 7, 3, 2, 2] 51
[19, 18, 7, 3, 3, 2] 52
[19, 18, 7, 4, 2, 2] 52
[19, 18, 7, 4, 3, 2] 53
[19, 19, 6, 3, 2, 2] 51
[19, 19, 6, 3, 3, 2] 52
[19, 19, 6, 4, 2, 2] 52
[19, 19, 6, 4, 3, 2] 53
[19, 19, 7, 3, 2, 2] 52
[19, 19, 7, 3, 3, 2] 53
[19, 19, 7, 4, 2, 2] 53
[19, 19, 7, 4, 3, 2] 54

Bye,
Delbert Glass
Jul 20 '05 #9
I have run just a few tests with solution (1) [19, 17, 6, 3, 2, 2] and
(2) [132, 21, 6, 3, 2, 2] on SQL7.0, and the second solution uses
significantly less CPU.

On a testset with values 'a'+space(2)+'a ' .. 'a'+space(7998) +'a', the
results were:
solution 1: approx. 3780 ms
solution 2: approx. 2750 ms

On a testset with more small values and fewer large values, the results
show similar differences: 4280 vs 3110 ms and 953 vs 719 ms

Gert-Jan
Delbert Glass wrote:

Here are (the?) 24 solutions
with fewer then 7 stages
and lengths less than 20.

Solution TotalLength
-------------------- -----------
[19, 17, 6, 3, 2, 2] 49
[19, 17, 6, 3, 3, 2] 50
[19, 17, 6, 4, 2, 2] 50
[19, 17, 6, 4, 3, 2] 51
[19, 17, 7, 3, 2, 2] 50
[19, 17, 7, 3, 3, 2] 51
[19, 17, 7, 4, 2, 2] 51
[19, 17, 7, 4, 3, 2] 52
[19, 18, 6, 3, 2, 2] 50
[19, 18, 6, 3, 3, 2] 51
[19, 18, 6, 4, 2, 2] 51
[19, 18, 6, 4, 3, 2] 52
[19, 18, 7, 3, 2, 2] 51
[19, 18, 7, 3, 3, 2] 52
[19, 18, 7, 4, 2, 2] 52
[19, 18, 7, 4, 3, 2] 53
[19, 19, 6, 3, 2, 2] 51
[19, 19, 6, 3, 3, 2] 52
[19, 19, 6, 4, 2, 2] 52
[19, 19, 6, 4, 3, 2] 53
[19, 19, 7, 3, 2, 2] 52
[19, 19, 7, 3, 3, 2] 53
[19, 19, 7, 4, 2, 2] 53
[19, 19, 7, 4, 3, 2] 54

Bye,
Delbert Glass

Jul 20 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
90031
by: Thomas Mlynarczyk | last post by:
Which is the simplest way to remove all whitespace from a string? Is there a simpler method than a regex replace? Or how can I tell a regex pattern to ignore all whitespace in my subject string? There is a global modifier to ignore all spaces in the pattern, but I couldn't find one for ignoring spaces in the subject string. Do I really have to either do a preg_replace to remove all whitespace or stick a lot of \s* into my search pattern? ...
2
2170
by: Wolfgang Jeltsch | last post by:
Hello, it is often convenient to insert whitespace into an XML document in order to format it nicely. For example, take this snippet of a notional DocBook XML document: <para> This is a longer paragraph. With <wordasword>longer</wordasword> I mean that it contains more than one sentence.
2
2031
by: Carlitos | last post by:
Hi there, A class in Xerces J-API (Java) called TextImpl contains a property that returns whether the text is ignorable whitespace (http://xml.apache.org/xerces-j/apiDocs/org/apache/xerces/dom/TextImpl.html#isIgnorableWhitespace()). ; I guess when they refer to "ignorable whitespace" in Java we may interpret that as an "insignificant whitespace" in .NET. Am I correct to say that? So, I need to manually convert some Java code to C#,...
2
1955
by: Carlitos | last post by:
Hi there, A class in Xerces J-API (Java) called TextImpl contains a property that returns whether the text is ignorable whitespace (http://xml.apache.org/xerces-j/apiDocs/org/apache/xerces/dom/TextImpl.html#isIgnorableWhitespace()). I guess when they refer to "ignorable whitespace" in Java we may interpret that as an "insignificant whitespace" in .NET. Am I correct to say that? So, I need to manually convert some Java code to C#,...
0
2265
by: Shan Plourde | last post by:
Hi everyone, I have been using various regular expressions with the ASP.NET RegularExpressionValidator for quite some time. In general it works very well. One of the common regex's that I use follows: ValidationExpression = "^\d{0,3}(\.\d{0,4})?$" The purpose of this one is to validate that numeric values input follow the syntax 999.9999. This works well. But, one thing that I have never tested previously (which has now been uncovered...
3
2630
by: David Pratt | last post by:
Hi. I am splitting a string on a non whitespace character. One or more whitespace characters can be returned as items in the list. I do not want the items in the list that are only whitespace (can be one or more characters of whitespace) and plan to use string.strip on those items that are not only whitespace (to remove any whitespace from front or back of items). What kind of efficient test can I use to obtain only list items returned...
56
3598
by: infidel | last post by:
Where are they-who-hate-us-for-our-whitespace? Are "they" really that stupid/petty? Are "they" really out there at all? "They" almost sound like a mythical caste of tasteless heathens that "we" have invented. It just sounds like so much trivial nitpickery that it's hard to believe it's as common as we've come to believe.
9
2563
by: amattie | last post by:
Does anyone have any idea on how I can strip the extra whitespace in the XML that shows up when I receive a response from an ASP.NET 2.0 webservice? This has been discussed before, but no one has ever come up with a good answer to what seems like such a common question. ...
5
3915
by: John Gordon | last post by:
My XSLT files have many occurrences of this general pattern: <a> <xsl:attribute name="href"> <xsl:value-of select="xyz" /> </xsl:attribute> </a> When I execute an XSL transform, the resulting HTML looks like this:
13
27981
by: Chaim Krause | last post by:
I am unable to figure out why the first two statements work as I expect them to and the next two do not. Namely, the first two spit the sentence into its component words, while the latter two return the whole sentence entact. import string from string import whitespace mytext = "The quick brown fox jumped over the lazy dog.\n" print mytext.split()
0
9786
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11125
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10734
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10836
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10407
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9568
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7962
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
7114
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
3
3230
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.