By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,226 Members | 1,027 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,226 IT Pros & Developers. It's quick & easy.

Aggregating text

P: n/a
I've got a whole load of data that I've been logging to record
different search terms used on a website. This all resides in a
varchar(50) column. Contents will be something like:

toaster
toasters
toasters with croissant heater
toaster with bagel warmer
croissant
lawnmower
lawnmwer
lawn mower
rake
garden implements
gardening implements
tools
garden tools
volkswagen parts
volkswagen sump
volkswagen tools

What I'd like to do is group the different terms together into groups
where the words have some similarity.

Couple of ways I can think of to do this:
1) Recursively: look at each word in the first search term and then see
what other search terms are like that word. Then do the same for the
next word, etc.
Problems
- connectives like 'with' 'and' 'or' etc will cause a problem (too many
false positives), but they can easily be excluded
- I'd have to write something procedural, and I'm keen not to unless I
*really* have to.

2) Something like:
SELECT a.search_term, b.search_term
FROM search_terms a, search_terms b
WHERE a.search_term LIKE b.search_term;[*]
Now LIKE doesn't behave the way I want it to, but is there something
similar? Or has anyone built something to do this previously? Any
thoughts on trying to bluff it with something like SOUNDEX ?
[*] I'd hope to get something along the lines of:

toaster toaster
toaster toasters
toaster toaster with croissant heater
toaster toaster with bagel warmer

Nov 12 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Are you sure you're going about this the right way? Wouldn't it make more
sense to build a list of things you *have* rather than things users have
searched for? After all, if you have a hardware store and don't envision
expanding into other lines of products, wouldn't you want to ignore the fact
that some users searched for watermelons or wedding gowns?

Rhino

"James Foreman" <ja***********@abcmail.co.uk> wrote in message
news:11**********************@f14g2000cwb.googlegr oups.com...
I've got a whole load of data that I've been logging to record
different search terms used on a website. This all resides in a
varchar(50) column. Contents will be something like:

toaster
toasters
toasters with croissant heater
toaster with bagel warmer
croissant
lawnmower
lawnmwer
lawn mower
rake
garden implements
gardening implements
tools
garden tools
volkswagen parts
volkswagen sump
volkswagen tools

What I'd like to do is group the different terms together into groups
where the words have some similarity.

Couple of ways I can think of to do this:
1) Recursively: look at each word in the first search term and then see
what other search terms are like that word. Then do the same for the
next word, etc.
Problems
- connectives like 'with' 'and' 'or' etc will cause a problem (too many
false positives), but they can easily be excluded
- I'd have to write something procedural, and I'm keen not to unless I
*really* have to.

2) Something like:
SELECT a.search_term, b.search_term
FROM search_terms a, search_terms b
WHERE a.search_term LIKE b.search_term;
[*]
Now LIKE doesn't behave the way I want it to, but is there something
similar? Or has anyone built something to do this previously? Any
thoughts on trying to bluff it with something like SOUNDEX ?

[*] I'd hope to get something along the lines of:

toaster toaster
toaster toasters
toaster toaster with croissant heater
toaster toaster with bagel warmer

Nov 12 '05 #2

P: n/a
Valid point if you're assuming we won't be expanding our product range.
However, if you want to work out what people want that you haven't
got, or (and this is the primary motivation here) want to seed your web
pages with search terms to improve relevancy when they're indexed by
search engines / add words to bid lists, you'll still want to explore
the searches used.

Eg suppose we only ever bidded on the search term "toaster" on Google
Adwords; unless we started analysing what searches people were using,
we might not realise that "4-slice toaster" was being used as a search
term, and fail to bid for it.

Nov 12 '05 #3

P: n/a
My idea is:
1. Separate text into words
2. Apply SOUNDEX to each separated words
3. Group by SOUNDEX

But, there is a shortcoming in this way. How to separate words.
In my following example, I have separated word by one or two blanks, so
"lawn mower" is recognized as two words.("lawn" and "mower").
I have no good idea to resolve this problem. Do you have?

------------------------------ Commands Entered
------------------------------
WITH Recurse (seq, id, nextpos, word, soundex, rest) AS (
SELECT 0
, id
, 0
, CAST(NULL AS VARCHAR(50))
, CAST(NULL AS CHAR(4))
, CAST(REPLACE(text,' ',' ') AS VARCHAR(50))
FROM Foreman
UNION ALL
SELECT seq
, id
, nextpos
, SUBSTR(rest,1,nextpos-1)
, SOUNDEX(SUBSTR(rest,1,nextpos-1))
, SUBSTR(rest||' ',nextpos+1)
FROM (SELECT seq + 1
, id
, POSSTR(rest||' ',' ')
, rest
FROM Recurse pre
WHERE seq < 100
AND rest <> ''
) new(seq, id, nextpos, rest)
)
SELECT
CAST(MIN(word) OVER(PARTITION BY soundex) AS CHAR(12)) word
, text
, r.id
FROM Recurse R
, Foreman F
WHERE R.id = F.id
AND seq > 0
ORDER BY soundex, r.id;
------------------------------------------------------------------------------

WORD TEXT ID
------------ -------------------------------------------------- ------
bagel toaster with bagel warmer 4
croissant toasters with croissant heater 3
croissant croissant 5
garden garden implements 10
garden gardening implements 11
garden garden tools 13
heater toasters with croissant heater 3
implements garden implements 10
implements gardening implements 11
lawn lawn mower 8
lawnmower lawnmower 6
lawnmower lawnmwer 7
mower lawn mower 8
parts volkswagen parts 14
rake rake 9
sump volkswagen sump 15
toaster toaster 1
toaster toasters 2
toaster toasters with croissant heater 3
toaster toaster with bagel warmer 4
tools tools 12
tools garden tools 13
tools volkswagen tools 16
volkswagen volkswagen parts 14
volkswagen volkswagen sump 15
volkswagen volkswagen tools 16
with toasters with croissant heater 3
with toaster with bagel warmer 4
warmer toaster with bagel warmer 4

29 record(s) selected.

Nov 12 '05 #4

P: n/a
Thanks. That looks interesting.

Wondering now about using LOCATE to identify words that are inside
different strings.

Eg:

SELECT a.name, b.name
FROM james_f.search_collation a, james_f.search_collation b
WHERE

LOCATE(LEFT(a.name, POSSTR(a.name, ' ')) , b.name) != 0

AND a.name != b.name;

gives everything where the first word in the first search phrase is
somewhere in the second phrase.

And thus:

toaster toasters
toaster toasters with croissant heater
toaster toaster with bagel warmer
....

but although this will give
tools garden tools
it won't give
garden tools tools

Looks like that recursive method you've got there will get around this
- will give that a try.

Nov 12 '05 #5

P: n/a
To put "lawn mower" into same group as "lawnmower", I thought following
logic.
(This was just come in my mind, so this may be insufficient or include
some mistake.)
1. Separete text into words.
2. Apply SOUNDEX to each separatd words.
3. Apply SOUNDEX to sequence of two words.
4. UNION 2. and 3.
5. Group 4. by SOUNDEX.
6. Eliminate unnecessary rows.

WITH Recurse (seq, no, id, nextpos, word, soundex, rest) AS (
SELECT 0
, 0
, id
, 0
, CAST(NULL AS VARCHAR(50))
, CAST(NULL AS CHAR(4))
, CAST(REPLACE(text,' ',' ') AS VARCHAR(50))
FROM Foreman
UNION ALL
SELECT seq
, 1
, id
, nextpos
, SUBSTR(rest,1,nextpos-1)
, SOUNDEX(SUBSTR(rest,1,nextpos-1))
, SUBSTR(rest||' ',nextpos+1)
FROM (SELECT seq + 1
, id
, POSSTR(rest||' ',' ')
, rest
FROM Recurse pre
WHERE seq < 100
AND rest <> ''
) new(seq, id, nextpos, rest)
)
,
SepWord (no, id, word, soundex) AS (
SELECT 2
, f.id
, f.word || s.word
, SOUNDEX(f.word || ' ' || s.word)
FROM Recurse F
, Recurse S
WHERE f.id = s.id
AND f.seq + 1 = s.seq
AND f.seq > 0
)
SELECT
CAST(MIN(word) OVER(PARTITION BY soundex) AS CHAR(12)) word
, CAST(text AS CHAR(40)) text
, r.id
, MIN(r.id) OVER(PARTITION BY soundex) min_id
FROM((SELECT no, id, word, soundex
FROM Recurse
WHERE seq > 0
)
UNION
SELECT * FROM SepWord
) AS R
, Foreman F
WHERE R.id = F.id
AND ( no = 1
OR no = 2
AND EXISTS
(SELECT *
FROM Recurse S
WHERE s.id <> r.id
AND s.soundex = r.soundex
AND s.no = 1
)
AND NOT EXISTS
(SELECT *
FROM Recurse S
WHERE s.id = r.id
AND s.soundex = r.soundex
AND s.no = 1
)
)
ORDER BY min_id, soundex, r.id;
------------------------------------------------------------------------------

WORD TEXT ID MIN_ID
------------ ---------------------------------------- ------ ------
toaster toaster 1 1
toaster toasters 2 1
toaster toasters with croissant heater 3 1
toaster toaster with bagel warmer 4 1
croissant toasters with croissant heater 3 3
croissant croissant 5 3
heater toasters with croissant heater 3 3
with toasters with croissant heater 3 3
with toaster with bagel warmer 4 3
bagel toaster with bagel warmer 4 4
warmer toaster with bagel warmer 4 4
lawnmower lawnmower 6 6
lawnmower lawnmwer 7 6
lawnmower lawn mower 8 6
lawn lawn mower 8 8
mower lawn mower 8 8
rake rake 9 9
garden garden implements 10 10
garden gardening implements 11 10
garden garden tools 13 10
implements garden implements 10 10
implements gardening implements 11 10
tools tools 12 12
tools garden tools 13 12
tools volkswagen tools 16 12
parts volkswagen parts 14 14
volkswagen volkswagen parts 14 14
volkswagen volkswagen sump 15 14
volkswagen volkswagen tools 16 14
sump volkswagen sump 15 15

30 record(s) selected.

-- DDL and Data --

CREATE TABLE Foreman
(id SMALLINT NOT NULL
,text VARCHAR(50) NOT NULL
)
INSERT INTO Foreman
VALUES
( 1, 'toaster')
,( 2, 'toasters')
,( 3, 'toasters with croissant heater')
,( 4, 'toaster with bagel warmer')
,( 5, 'croissant')
,( 6, 'lawnmower')
,( 7, 'lawnmwer')
,( 8, 'lawn mower')
,( 9, 'rake')
,(10, 'garden implements')
,(11, 'gardening implements')
,(12, 'tools')
,(13, 'garden tools')
,(14, 'volkswagen parts')
,(15, 'volkswagen sump')
,(16, 'volkswagen tools')

Nov 12 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.