473,480 Members | 1,874 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Aggregating text

I've got a whole load of data that I've been logging to record
different search terms used on a website. This all resides in a
varchar(50) column. Contents will be something like:

toaster
toasters
toasters with croissant heater
toaster with bagel warmer
croissant
lawnmower
lawnmwer
lawn mower
rake
garden implements
gardening implements
tools
garden tools
volkswagen parts
volkswagen sump
volkswagen tools

What I'd like to do is group the different terms together into groups
where the words have some similarity.

Couple of ways I can think of to do this:
1) Recursively: look at each word in the first search term and then see
what other search terms are like that word. Then do the same for the
next word, etc.
Problems
- connectives like 'with' 'and' 'or' etc will cause a problem (too many
false positives), but they can easily be excluded
- I'd have to write something procedural, and I'm keen not to unless I
*really* have to.

2) Something like:
SELECT a.search_term, b.search_term
FROM search_terms a, search_terms b
WHERE a.search_term LIKE b.search_term;[*]
Now LIKE doesn't behave the way I want it to, but is there something
similar? Or has anyone built something to do this previously? Any
thoughts on trying to bluff it with something like SOUNDEX ?
[*] I'd hope to get something along the lines of:

toaster toaster
toaster toasters
toaster toaster with croissant heater
toaster toaster with bagel warmer

Nov 12 '05 #1
5 1517
Are you sure you're going about this the right way? Wouldn't it make more
sense to build a list of things you *have* rather than things users have
searched for? After all, if you have a hardware store and don't envision
expanding into other lines of products, wouldn't you want to ignore the fact
that some users searched for watermelons or wedding gowns?

Rhino

"James Foreman" <ja***********@abcmail.co.uk> wrote in message
news:11**********************@f14g2000cwb.googlegr oups.com...
I've got a whole load of data that I've been logging to record
different search terms used on a website. This all resides in a
varchar(50) column. Contents will be something like:

toaster
toasters
toasters with croissant heater
toaster with bagel warmer
croissant
lawnmower
lawnmwer
lawn mower
rake
garden implements
gardening implements
tools
garden tools
volkswagen parts
volkswagen sump
volkswagen tools

What I'd like to do is group the different terms together into groups
where the words have some similarity.

Couple of ways I can think of to do this:
1) Recursively: look at each word in the first search term and then see
what other search terms are like that word. Then do the same for the
next word, etc.
Problems
- connectives like 'with' 'and' 'or' etc will cause a problem (too many
false positives), but they can easily be excluded
- I'd have to write something procedural, and I'm keen not to unless I
*really* have to.

2) Something like:
SELECT a.search_term, b.search_term
FROM search_terms a, search_terms b
WHERE a.search_term LIKE b.search_term;
[*]
Now LIKE doesn't behave the way I want it to, but is there something
similar? Or has anyone built something to do this previously? Any
thoughts on trying to bluff it with something like SOUNDEX ?

[*] I'd hope to get something along the lines of:

toaster toaster
toaster toasters
toaster toaster with croissant heater
toaster toaster with bagel warmer

Nov 12 '05 #2
Valid point if you're assuming we won't be expanding our product range.
However, if you want to work out what people want that you haven't
got, or (and this is the primary motivation here) want to seed your web
pages with search terms to improve relevancy when they're indexed by
search engines / add words to bid lists, you'll still want to explore
the searches used.

Eg suppose we only ever bidded on the search term "toaster" on Google
Adwords; unless we started analysing what searches people were using,
we might not realise that "4-slice toaster" was being used as a search
term, and fail to bid for it.

Nov 12 '05 #3
My idea is:
1. Separate text into words
2. Apply SOUNDEX to each separated words
3. Group by SOUNDEX

But, there is a shortcoming in this way. How to separate words.
In my following example, I have separated word by one or two blanks, so
"lawn mower" is recognized as two words.("lawn" and "mower").
I have no good idea to resolve this problem. Do you have?

------------------------------ Commands Entered
------------------------------
WITH Recurse (seq, id, nextpos, word, soundex, rest) AS (
SELECT 0
, id
, 0
, CAST(NULL AS VARCHAR(50))
, CAST(NULL AS CHAR(4))
, CAST(REPLACE(text,' ',' ') AS VARCHAR(50))
FROM Foreman
UNION ALL
SELECT seq
, id
, nextpos
, SUBSTR(rest,1,nextpos-1)
, SOUNDEX(SUBSTR(rest,1,nextpos-1))
, SUBSTR(rest||' ',nextpos+1)
FROM (SELECT seq + 1
, id
, POSSTR(rest||' ',' ')
, rest
FROM Recurse pre
WHERE seq < 100
AND rest <> ''
) new(seq, id, nextpos, rest)
)
SELECT
CAST(MIN(word) OVER(PARTITION BY soundex) AS CHAR(12)) word
, text
, r.id
FROM Recurse R
, Foreman F
WHERE R.id = F.id
AND seq > 0
ORDER BY soundex, r.id;
------------------------------------------------------------------------------

WORD TEXT ID
------------ -------------------------------------------------- ------
bagel toaster with bagel warmer 4
croissant toasters with croissant heater 3
croissant croissant 5
garden garden implements 10
garden gardening implements 11
garden garden tools 13
heater toasters with croissant heater 3
implements garden implements 10
implements gardening implements 11
lawn lawn mower 8
lawnmower lawnmower 6
lawnmower lawnmwer 7
mower lawn mower 8
parts volkswagen parts 14
rake rake 9
sump volkswagen sump 15
toaster toaster 1
toaster toasters 2
toaster toasters with croissant heater 3
toaster toaster with bagel warmer 4
tools tools 12
tools garden tools 13
tools volkswagen tools 16
volkswagen volkswagen parts 14
volkswagen volkswagen sump 15
volkswagen volkswagen tools 16
with toasters with croissant heater 3
with toaster with bagel warmer 4
warmer toaster with bagel warmer 4

29 record(s) selected.

Nov 12 '05 #4
Thanks. That looks interesting.

Wondering now about using LOCATE to identify words that are inside
different strings.

Eg:

SELECT a.name, b.name
FROM james_f.search_collation a, james_f.search_collation b
WHERE

LOCATE(LEFT(a.name, POSSTR(a.name, ' ')) , b.name) != 0

AND a.name != b.name;

gives everything where the first word in the first search phrase is
somewhere in the second phrase.

And thus:

toaster toasters
toaster toasters with croissant heater
toaster toaster with bagel warmer
....

but although this will give
tools garden tools
it won't give
garden tools tools

Looks like that recursive method you've got there will get around this
- will give that a try.

Nov 12 '05 #5
To put "lawn mower" into same group as "lawnmower", I thought following
logic.
(This was just come in my mind, so this may be insufficient or include
some mistake.)
1. Separete text into words.
2. Apply SOUNDEX to each separatd words.
3. Apply SOUNDEX to sequence of two words.
4. UNION 2. and 3.
5. Group 4. by SOUNDEX.
6. Eliminate unnecessary rows.

WITH Recurse (seq, no, id, nextpos, word, soundex, rest) AS (
SELECT 0
, 0
, id
, 0
, CAST(NULL AS VARCHAR(50))
, CAST(NULL AS CHAR(4))
, CAST(REPLACE(text,' ',' ') AS VARCHAR(50))
FROM Foreman
UNION ALL
SELECT seq
, 1
, id
, nextpos
, SUBSTR(rest,1,nextpos-1)
, SOUNDEX(SUBSTR(rest,1,nextpos-1))
, SUBSTR(rest||' ',nextpos+1)
FROM (SELECT seq + 1
, id
, POSSTR(rest||' ',' ')
, rest
FROM Recurse pre
WHERE seq < 100
AND rest <> ''
) new(seq, id, nextpos, rest)
)
,
SepWord (no, id, word, soundex) AS (
SELECT 2
, f.id
, f.word || s.word
, SOUNDEX(f.word || ' ' || s.word)
FROM Recurse F
, Recurse S
WHERE f.id = s.id
AND f.seq + 1 = s.seq
AND f.seq > 0
)
SELECT
CAST(MIN(word) OVER(PARTITION BY soundex) AS CHAR(12)) word
, CAST(text AS CHAR(40)) text
, r.id
, MIN(r.id) OVER(PARTITION BY soundex) min_id
FROM((SELECT no, id, word, soundex
FROM Recurse
WHERE seq > 0
)
UNION
SELECT * FROM SepWord
) AS R
, Foreman F
WHERE R.id = F.id
AND ( no = 1
OR no = 2
AND EXISTS
(SELECT *
FROM Recurse S
WHERE s.id <> r.id
AND s.soundex = r.soundex
AND s.no = 1
)
AND NOT EXISTS
(SELECT *
FROM Recurse S
WHERE s.id = r.id
AND s.soundex = r.soundex
AND s.no = 1
)
)
ORDER BY min_id, soundex, r.id;
------------------------------------------------------------------------------

WORD TEXT ID MIN_ID
------------ ---------------------------------------- ------ ------
toaster toaster 1 1
toaster toasters 2 1
toaster toasters with croissant heater 3 1
toaster toaster with bagel warmer 4 1
croissant toasters with croissant heater 3 3
croissant croissant 5 3
heater toasters with croissant heater 3 3
with toasters with croissant heater 3 3
with toaster with bagel warmer 4 3
bagel toaster with bagel warmer 4 4
warmer toaster with bagel warmer 4 4
lawnmower lawnmower 6 6
lawnmower lawnmwer 7 6
lawnmower lawn mower 8 6
lawn lawn mower 8 8
mower lawn mower 8 8
rake rake 9 9
garden garden implements 10 10
garden gardening implements 11 10
garden garden tools 13 10
implements garden implements 10 10
implements gardening implements 11 10
tools tools 12 12
tools garden tools 13 12
tools volkswagen tools 16 12
parts volkswagen parts 14 14
volkswagen volkswagen parts 14 14
volkswagen volkswagen sump 15 14
volkswagen volkswagen tools 16 14
sump volkswagen sump 15 15

30 record(s) selected.

-- DDL and Data --

CREATE TABLE Foreman
(id SMALLINT NOT NULL
,text VARCHAR(50) NOT NULL
)
INSERT INTO Foreman
VALUES
( 1, 'toaster')
,( 2, 'toasters')
,( 3, 'toasters with croissant heater')
,( 4, 'toaster with bagel warmer')
,( 5, 'croissant')
,( 6, 'lawnmower')
,( 7, 'lawnmwer')
,( 8, 'lawn mower')
,( 9, 'rake')
,(10, 'garden implements')
,(11, 'gardening implements')
,(12, 'tools')
,(13, 'garden tools')
,(14, 'volkswagen parts')
,(15, 'volkswagen sump')
,(16, 'volkswagen tools')

Nov 12 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
22991
by: Xerxes | last post by:
Hi, I need help in setting up a page where the text wraps around an image. Right now, I am using table, with text in one <td> and the image in the adjacent <td>. The problem is when the text is...
2
7555
by: Macsicarr | last post by:
Hi All Wonder if you could help me. I have created a CMS system that allows the user to enter text and pic 'tags' for their own About us page, eg text.... text.... text.... text.......
4
2582
by: Arif Çimen | last post by:
Hi to everybody, I have chnged a button text in design mode. But After compiling and executing the program the text of the button do not change to new value. Any Ideas? Thaks for helps.
3
2551
by: jweinberg1975 | last post by:
I would like for users to be able to select from a small number of options that come from a little drop down menu which then closes. .....
3
2146
by: bbepristis | last post by:
Hey all I have this code that reads from one text file writes to another unless im on a certian line then it writes the new data however it only seems to do about 40 lines then quits and I cant...
3
3961
by: acecraig100 | last post by:
I am fairly new to Javascript. I have a form that users fill out to enter an animal to exhibit at a fair. Because we have no way of knowing, how many animals a user may enter, I created a table...
3
2565
by: jonniethecodeprince | last post by:
Hi all, I have trouble getting an array of data stored in a separate javascript file i.e. a file called books.js into a table of data for a .xhtml file. There are 50 Records in this file....
7
1697
by: FiveFootUnder | last post by:
Hi. Ow. I have a bruised forehead from banging my head against a brick wall and would really appreciate some help here. As long as it's not a suggestion that I shouldn't be trying this!! I'm...
3
1486
by: Piotr Sobolewski | last post by:
Hello, are there any aggregating functions in Access that can concatenate (glue together) text?
0
7046
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7088
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
6956
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
1
4783
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
2997
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
2986
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1300
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
563
muto222
php
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
183
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.