473,414 Members | 1,623 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,414 software developers and data experts.

SP to parse string adding synonimes from table

inspireuk
The easiest way for me to explain what I'm trying to do is show an example.

I have the following tables:

WORDS
WordID Word
1 1,000,000
2 a million
3 one million

WORDLINKS
WordID WordLinkID
1 2
1 3
2 1
2 3
3 1
3 2

I would then pass the following sentance into the SP
a million times a day

What the SP needs to do is parse the string and find all the phrases within that match the words in the WORDS table. It would then take the WordID and do a lookup in the WORDLINKS table to find all the possible alternatives/synonimes.

The result would be in the following form:
{a million|1,000,000|one million} times a day

the matched phrase has been wrapped in {} with each matched alternatives/synonimes seperated by a pipe |.

there would be multiple phrases in each input string and possibly nested words
eg in the example above the phrase 'one million', 'one' could have an alternative of '1' in which case the output would be:
{a million|1,000,000|{one|1} million} times a day

i hope this makes sense. any help would really be appreciated.

Thanks

Josh
Aug 9 '07 #1
4 1954
azimmer
200 Expert 100+
It's going to be a long one... (But a promise is a promise.) What makes it a bit more difficult is the nested replacements (whose proper handling I'm still not 100% sure of):
  • have it greedy or non-greedy (if you have replacements both for 'one million' and 'one' which one has precedence)?
  • If any replacement would result in an expanded string that could be further replaced, what to do? (I.e.: replacements: '1'->'one', 'one million'->'a million', what should the output be for '1 million'? And if replacements are '1'->'one', 'one million'->'1,000,000', what should the output be for '1 million'?)
The code below is greedy and processes nested replacements only to the length of the replaced term and not including the term being replaced (see examples at the end). It has two functions because the SQL Server wouldn't take the default value (don't know why) and I've had no more time to play around with it. You only need to use the one without the underscore.

So here goes the code:
Expand|Select|Wrap|Line Numbers
  1. create function _syno (@instr as varchar(1000), @term_replace as bit = 0)
  2. returns varchar(1000)
  3. as
  4. begin
  5. declare @pos as int
  6. declare @current_str as varchar(1000)
  7. declare @syno_term as varchar(1000)
  8. declare @replaced_length as int
  9. declare @outstr as varchar(1000)
  10. set @outstr=''
  11.  
  12. set @pos=1
  13. while @pos<=len(@instr)
  14. begin
  15.     -- try from current position on
  16.     set @current_str=right(@instr,len(@instr)-@pos+1)
  17.     declare current_replacements cursor local for
  18.         -- this is tricky
  19.         select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
  20.         from words w, wordlinks l, words w2
  21.         where w.wordid=l.wordlinkid
  22.         and (l.wordid=w2.wordid)
  23.         and charindex(w2.word,@current_str)=1
  24.         and (@term_replace=0 or (w2.word<>@current_str))
  25.         order by len(w2.word) desc
  26.     open current_replacements
  27.     fetch next from current_replacements into @syno_term, @replaced_length
  28.     if @@fetch_status = 0    -- there are replacements
  29.     begin
  30.         set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
  31.         -- iterate on current set of replacements (recursivity)
  32.         while @@fetch_status = 0
  33.         begin
  34.             set @syno_term=dbo._syno(@syno_term, 1)
  35.             set @outstr=@outstr+'|'+@syno_term
  36.             fetch next from current_replacements into @syno_term, @replaced_length
  37.         end
  38.         set @outstr=@outstr+'}'
  39.         set @pos=@pos+@replaced_length
  40.     end
  41.     else
  42.     begin
  43.         set @outstr=@outstr+left(@current_str,1)
  44.         set @pos=@pos+1
  45.     end
  46.     deallocate current_replacements
  47. end
  48. return @outstr 
  49. end
  50. GO
  51.  
  52. create function syno (@instr as varchar(1000))
  53. returns varchar(1000)
  54. as
  55. begin
  56. return dbo._syno(@instr,0)
  57. end
  58. GO
  59.  
Note: I found the easiest way to get the code itself (without line numbers) is to "Reply" to the message and copy-paste whatever is between '[ CODE]' and '[ /CODE]'.

My WORDS and WORDLINKS tables are as follows:
Expand|Select|Wrap|Line Numbers
  1. WordID      Word                                               
  2. ----------- -------------------
  3. 1           1,000,000
  4. 2           a million
  5. 3           one million
  6. 4           one
  7. 5           1
  8.  
  9. WordID      WordLinkID  
  10. ----------- ----------- 
  11. 1           2
  12. 1           3
  13. 2           1
  14. 2           3
  15. 3           1
  16. 3           2
  17. 4           5
  18.  
Examples and usage:
Expand|Select|Wrap|Line Numbers
  1. print dbo.syno('I hope you run it a million times')
  2.  
results: I hope you run it {a million|1,000,000|{one|1} million} times
Expand|Select|Wrap|Line Numbers
  1. print dbo.syno('I hope you run it one million times')
  2.  
results: I hope you run it {one million|1,000,000|a million} times
(note how it will replace 'one million', and will process the 'one' in the 'one million' because that's the whole term to be replaced in the first place)
Expand|Select|Wrap|Line Numbers
  1. print dbo.syno('I hope someone runs it one million times')
  2.  
results: I hope some{one|1} runs it {one million|1,000,000|a million} times

Hope it helps. I don't say it's impossible to refine it (especially in the handling of nested replacements) but I believe it's a good start.

P.S.: It, indeed, would have been a rather steep learning curve from a very basic level...
Aug 10 '07 #2
This is ammmaazzing! I can't thank you enough :)

However :) just a couple more things if you have the time. Would it be possible only to match full words and with regards to the nested replacements i think i may just programmically recall the sp for small parts if needed so no recursion needed (sorry).

the precedence i think should be for the longer phrase.

let me know if there is anything else.

Josh
Aug 10 '07 #3
so i've been playing with what you gave me. i've changed it to a stored procedure as i'm no longer doing recursion. if you get a chance can you look at the code below - it does seem to be working but i'm not sure whether it is written the best way. Thanks again for all your time.

Expand|Select|Wrap|Line Numbers
  1. ALTER PROCEDURE ParseString
  2.     @instr as varchar(max)
  3. AS
  4.  
  5.     declare @pos as int
  6.     declare @current_str as varchar(max)
  7.     declare @syno_term as varchar(max)
  8.     declare @replaced_length as int
  9.     declare @outstr as varchar(max)
  10.  
  11.     set @outstr = ''
  12.     set @pos=1
  13.     while @pos<=len(@instr)
  14.     begin
  15.         -- try from current position on
  16.         set @current_str=right(@instr,len(@instr)-@pos+1)
  17.         declare current_replacements cursor local for
  18.             -- this is tricky
  19.             select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
  20.             from words w, LinkedWords l, words w2
  21.             where w.wordid=l.LinkedWordID
  22.             and (l.wordid=w2.wordid)
  23.             and charindex(w2.word,@current_str)=1
  24.             and right(left(@current_str, len(w2.word)+1),1) = ' '
  25.             order by len(w2.word) desc
  26.         open current_replacements
  27.         fetch next from current_replacements into @syno_term, @replaced_length
  28.         if @@fetch_status = 0    -- there are replacements
  29.         begin
  30.             set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
  31.             -- iterate on current set of replacements (recursivity)
  32.             while @@fetch_status = 0
  33.             begin
  34.                 --set @syno_term=ammuser._syno(@syno_term, 1)
  35.                 set @outstr=@outstr+'|'+@syno_term
  36.                 fetch next from current_replacements into @syno_term, @replaced_length
  37.             end
  38.             set @outstr=@outstr+'} '
  39.             set @pos=@pos+@replaced_length+1
  40.         end
  41.         else
  42.         begin
  43.             if charindex(' ',@current_str) = 0
  44.                 begin
  45.                     set @outstr=@outstr+@current_str
  46.                     set @pos=@pos+len(@current_str)                
  47.                 end
  48.             else
  49.                 begin
  50.                     set @outstr=@outstr+left(@current_str,charindex(' ',@current_str))
  51.                     set @pos=@pos+charindex(' ',@current_str)
  52.                 end
  53.         end
  54.         deallocate current_replacements
  55.     end
  56.     SELECT @outstr
  57.  
Aug 13 '07 #4
azimmer
200 Expert 100+
so i've been playing with what you gave me. i've changed it to a stored procedure as i'm no longer doing recursion. if you get a chance can you look at the code below - it does seem to be working but i'm not sure whether it is written the best way. Thanks again for all your time.

Expand|Select|Wrap|Line Numbers
  1. ALTER PROCEDURE ParseString
  2.     @instr as varchar(max)
  3. AS
  4.  
  5.     declare @pos as int
  6.     declare @current_str as varchar(max)
  7.     declare @syno_term as varchar(max)
  8.     declare @replaced_length as int
  9.     declare @outstr as varchar(max)
  10.  
  11.     set @outstr = ''
  12.     set @pos=1
  13.     while @pos<=len(@instr)
  14.     begin
  15.         -- try from current position on
  16.         set @current_str=right(@instr,len(@instr)-@pos+1)
  17.         declare current_replacements cursor local for
  18.             -- this is tricky
  19.             select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
  20.             from words w, LinkedWords l, words w2
  21.             where w.wordid=l.LinkedWordID
  22.             and (l.wordid=w2.wordid)
  23.             and charindex(w2.word,@current_str)=1
  24.             and right(left(@current_str, len(w2.word)+1),1) = ' '
  25.             order by len(w2.word) desc
  26.         open current_replacements
  27.         fetch next from current_replacements into @syno_term, @replaced_length
  28.         if @@fetch_status = 0    -- there are replacements
  29.         begin
  30.             set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
  31.             -- iterate on current set of replacements (recursivity)
  32.             while @@fetch_status = 0
  33.             begin
  34.                 --set @syno_term=ammuser._syno(@syno_term, 1)
  35.                 set @outstr=@outstr+'|'+@syno_term
  36.                 fetch next from current_replacements into @syno_term, @replaced_length
  37.             end
  38.             set @outstr=@outstr+'} '
  39.             set @pos=@pos+@replaced_length+1
  40.         end
  41.         else
  42.         begin
  43.             if charindex(' ',@current_str) = 0
  44.                 begin
  45.                     set @outstr=@outstr+@current_str
  46.                     set @pos=@pos+len(@current_str)                
  47.                 end
  48.             else
  49.                 begin
  50.                     set @outstr=@outstr+left(@current_str,charindex(' ',@current_str))
  51.                     set @pos=@pos+charindex(' ',@current_str)
  52.                 end
  53.         end
  54.         deallocate current_replacements
  55.     end
  56.     SELECT @outstr
  57.  
Is good that you have the ambition and time to play around with it as I have very little time at the moment. Still I've been thinking about whole word replacements. One of the tough issues is that you need to have a word terminator at both ends of the term to be replaced - and word terminators come in many forms (not only space, but period, comma, semicolon, dash, new line, tab, etc as well - perhaps whatever is not a letter (but then what about '@' and digits?).) It should somehow be taken care of in the cursor select to make sure no false replacements are signaled (an that's not easy). Alternatively it may be possible to modify the select so that it sometimes gives false possibilities and then check and disregard them (but that's ain't easy either).

Slowly we're venturing into the realms of grep and awk - implemented in a SQL Server SP :-) (Just kidding; we're miles away.)
Aug 14 '07 #5

Sign in to post your reply or Sign up for a free account.

Similar topics

2
by: N | last post by:
Hi, I would like to parse out each value that is seperated by a comma in a field and use that value to join to another table. What would be the easiest way to do so without having to write a...
22
by: Ram Laxman | last post by:
Hi all, I have a text file which have data in CSV format. "empno","phonenumber","wardnumber" 12345,2234353,1000202 12326,2243653,1000098 Iam a beginner of C/C++ programming. I don't know how to...
24
by: | last post by:
Hi, I need to read a big CSV file, where different fields should be converted to different types, such as int, double, datetime, SqlMoney, etc. I have an array, which describes the fields and...
0
by: Steve Jorgensen | last post by:
I'm wondering if there's an approach to writing consistent code to read/write XML data in arbitrary order that I'm simply missing. It seems to be easy getting stuff -out- of a DOM via XPath, but...
11
by: hoopsho | last post by:
Hi Everyone, I am trying to write a program that does a few things very fast and with efficient use of memory... a) I need to parse a space-delimited file that is really large, upwards fo a...
5
by: Doug Bell | last post by:
Hi, I have some data in the format of yyyymmdd (Int32) that I want to display, in a datagrid, formatted in International Date Format dd MM yyyy. Any suggestions on an efficient approach? ...
5
by: BMeyer | last post by:
I have been losing my mind trying to parse an XML document (with nested child elements, not all of which appear in each parent node) into a DataGrid object. What I want to do is "flatten" the XML...
1
by: Dan Somdahl | last post by:
Hi, I am new to ASP but have what should be a fairly simple task that I can't figure out. I need to parse a string from a single, semi-colon delimited, 60 character field (el_text) in a recordset...
3
by: Leighya | last post by:
Im currently working on this xml file but when i load it to Mozilla, i got an error "Error Loading Stylesheet: Xpath parse failure: invalid variable name" It loads on IE properly. Only with the...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.