473,788 Members | 3,068 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

SP to parse string adding synonimes from table

inspireuk
12 New Member
The easiest way for me to explain what I'm trying to do is show an example.

I have the following tables:

WORDS
WordID Word
1 1,000,000
2 a million
3 one million

WORDLINKS
WordID WordLinkID
1 2
1 3
2 1
2 3
3 1
3 2

I would then pass the following sentance into the SP
a million times a day

What the SP needs to do is parse the string and find all the phrases within that match the words in the WORDS table. It would then take the WordID and do a lookup in the WORDLINKS table to find all the possible alternatives/synonimes.

The result would be in the following form:
{a million|1,000,0 00|one million} times a day

the matched phrase has been wrapped in {} with each matched alternatives/synonimes seperated by a pipe |.

there would be multiple phrases in each input string and possibly nested words
eg in the example above the phrase 'one million', 'one' could have an alternative of '1' in which case the output would be:
{a million|1,000,0 00|{one|1} million} times a day

i hope this makes sense. any help would really be appreciated.

Thanks

Josh
Aug 9 '07 #1
4 1965
azimmer
200 Recognized Expert New Member
It's going to be a long one... (But a promise is a promise.) What makes it a bit more difficult is the nested replacements (whose proper handling I'm still not 100% sure of):
  • have it greedy or non-greedy (if you have replacements both for 'one million' and 'one' which one has precedence)?
  • If any replacement would result in an expanded string that could be further replaced, what to do? (I.e.: replacements: '1'->'one', 'one million'->'a million', what should the output be for '1 million'? And if replacements are '1'->'one', 'one million'->'1,000,000', what should the output be for '1 million'?)
The code below is greedy and processes nested replacements only to the length of the replaced term and not including the term being replaced (see examples at the end). It has two functions because the SQL Server wouldn't take the default value (don't know why) and I've had no more time to play around with it. You only need to use the one without the underscore.

So here goes the code:
Expand|Select|Wrap|Line Numbers
  1. create function _syno (@instr as varchar(1000), @term_replace as bit = 0)
  2. returns varchar(1000)
  3. as
  4. begin
  5. declare @pos as int
  6. declare @current_str as varchar(1000)
  7. declare @syno_term as varchar(1000)
  8. declare @replaced_length as int
  9. declare @outstr as varchar(1000)
  10. set @outstr=''
  11.  
  12. set @pos=1
  13. while @pos<=len(@instr)
  14. begin
  15.     -- try from current position on
  16.     set @current_str=right(@instr,len(@instr)-@pos+1)
  17.     declare current_replacements cursor local for
  18.         -- this is tricky
  19.         select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
  20.         from words w, wordlinks l, words w2
  21.         where w.wordid=l.wordlinkid
  22.         and (l.wordid=w2.wordid)
  23.         and charindex(w2.word,@current_str)=1
  24.         and (@term_replace=0 or (w2.word<>@current_str))
  25.         order by len(w2.word) desc
  26.     open current_replacements
  27.     fetch next from current_replacements into @syno_term, @replaced_length
  28.     if @@fetch_status = 0    -- there are replacements
  29.     begin
  30.         set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
  31.         -- iterate on current set of replacements (recursivity)
  32.         while @@fetch_status = 0
  33.         begin
  34.             set @syno_term=dbo._syno(@syno_term, 1)
  35.             set @outstr=@outstr+'|'+@syno_term
  36.             fetch next from current_replacements into @syno_term, @replaced_length
  37.         end
  38.         set @outstr=@outstr+'}'
  39.         set @pos=@pos+@replaced_length
  40.     end
  41.     else
  42.     begin
  43.         set @outstr=@outstr+left(@current_str,1)
  44.         set @pos=@pos+1
  45.     end
  46.     deallocate current_replacements
  47. end
  48. return @outstr 
  49. end
  50. GO
  51.  
  52. create function syno (@instr as varchar(1000))
  53. returns varchar(1000)
  54. as
  55. begin
  56. return dbo._syno(@instr,0)
  57. end
  58. GO
  59.  
Note: I found the easiest way to get the code itself (without line numbers) is to "Reply" to the message and copy-paste whatever is between '[ CODE]' and '[ /CODE]'.

My WORDS and WORDLINKS tables are as follows:
Expand|Select|Wrap|Line Numbers
  1. WordID      Word                                               
  2. ----------- -------------------
  3. 1           1,000,000
  4. 2           a million
  5. 3           one million
  6. 4           one
  7. 5           1
  8.  
  9. WordID      WordLinkID  
  10. ----------- ----------- 
  11. 1           2
  12. 1           3
  13. 2           1
  14. 2           3
  15. 3           1
  16. 3           2
  17. 4           5
  18.  
Examples and usage:
Expand|Select|Wrap|Line Numbers
  1. print dbo.syno('I hope you run it a million times')
  2.  
results: I hope you run it {a million|1,000,0 00|{one|1} million} times
Expand|Select|Wrap|Line Numbers
  1. print dbo.syno('I hope you run it one million times')
  2.  
results: I hope you run it {one million|1,000,0 00|a million} times
(note how it will replace 'one million', and will process the 'one' in the 'one million' because that's the whole term to be replaced in the first place)
Expand|Select|Wrap|Line Numbers
  1. print dbo.syno('I hope someone runs it one million times')
  2.  
results: I hope some{one|1} runs it {one million|1,000,0 00|a million} times

Hope it helps. I don't say it's impossible to refine it (especially in the handling of nested replacements) but I believe it's a good start.

P.S.: It, indeed, would have been a rather steep learning curve from a very basic level...
Aug 10 '07 #2
inspireuk
12 New Member
This is ammmaazzing! I can't thank you enough :)

However :) just a couple more things if you have the time. Would it be possible only to match full words and with regards to the nested replacements i think i may just programmically recall the sp for small parts if needed so no recursion needed (sorry).

the precedence i think should be for the longer phrase.

let me know if there is anything else.

Josh
Aug 10 '07 #3
inspireuk
12 New Member
so i've been playing with what you gave me. i've changed it to a stored procedure as i'm no longer doing recursion. if you get a chance can you look at the code below - it does seem to be working but i'm not sure whether it is written the best way. Thanks again for all your time.

Expand|Select|Wrap|Line Numbers
  1. ALTER PROCEDURE ParseString
  2.     @instr as varchar(max)
  3. AS
  4.  
  5.     declare @pos as int
  6.     declare @current_str as varchar(max)
  7.     declare @syno_term as varchar(max)
  8.     declare @replaced_length as int
  9.     declare @outstr as varchar(max)
  10.  
  11.     set @outstr = ''
  12.     set @pos=1
  13.     while @pos<=len(@instr)
  14.     begin
  15.         -- try from current position on
  16.         set @current_str=right(@instr,len(@instr)-@pos+1)
  17.         declare current_replacements cursor local for
  18.             -- this is tricky
  19.             select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
  20.             from words w, LinkedWords l, words w2
  21.             where w.wordid=l.LinkedWordID
  22.             and (l.wordid=w2.wordid)
  23.             and charindex(w2.word,@current_str)=1
  24.             and right(left(@current_str, len(w2.word)+1),1) = ' '
  25.             order by len(w2.word) desc
  26.         open current_replacements
  27.         fetch next from current_replacements into @syno_term, @replaced_length
  28.         if @@fetch_status = 0    -- there are replacements
  29.         begin
  30.             set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
  31.             -- iterate on current set of replacements (recursivity)
  32.             while @@fetch_status = 0
  33.             begin
  34.                 --set @syno_term=ammuser._syno(@syno_term, 1)
  35.                 set @outstr=@outstr+'|'+@syno_term
  36.                 fetch next from current_replacements into @syno_term, @replaced_length
  37.             end
  38.             set @outstr=@outstr+'} '
  39.             set @pos=@pos+@replaced_length+1
  40.         end
  41.         else
  42.         begin
  43.             if charindex(' ',@current_str) = 0
  44.                 begin
  45.                     set @outstr=@outstr+@current_str
  46.                     set @pos=@pos+len(@current_str)                
  47.                 end
  48.             else
  49.                 begin
  50.                     set @outstr=@outstr+left(@current_str,charindex(' ',@current_str))
  51.                     set @pos=@pos+charindex(' ',@current_str)
  52.                 end
  53.         end
  54.         deallocate current_replacements
  55.     end
  56.     SELECT @outstr
  57.  
Aug 13 '07 #4
azimmer
200 Recognized Expert New Member
so i've been playing with what you gave me. i've changed it to a stored procedure as i'm no longer doing recursion. if you get a chance can you look at the code below - it does seem to be working but i'm not sure whether it is written the best way. Thanks again for all your time.

Expand|Select|Wrap|Line Numbers
  1. ALTER PROCEDURE ParseString
  2.     @instr as varchar(max)
  3. AS
  4.  
  5.     declare @pos as int
  6.     declare @current_str as varchar(max)
  7.     declare @syno_term as varchar(max)
  8.     declare @replaced_length as int
  9.     declare @outstr as varchar(max)
  10.  
  11.     set @outstr = ''
  12.     set @pos=1
  13.     while @pos<=len(@instr)
  14.     begin
  15.         -- try from current position on
  16.         set @current_str=right(@instr,len(@instr)-@pos+1)
  17.         declare current_replacements cursor local for
  18.             -- this is tricky
  19.             select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
  20.             from words w, LinkedWords l, words w2
  21.             where w.wordid=l.LinkedWordID
  22.             and (l.wordid=w2.wordid)
  23.             and charindex(w2.word,@current_str)=1
  24.             and right(left(@current_str, len(w2.word)+1),1) = ' '
  25.             order by len(w2.word) desc
  26.         open current_replacements
  27.         fetch next from current_replacements into @syno_term, @replaced_length
  28.         if @@fetch_status = 0    -- there are replacements
  29.         begin
  30.             set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
  31.             -- iterate on current set of replacements (recursivity)
  32.             while @@fetch_status = 0
  33.             begin
  34.                 --set @syno_term=ammuser._syno(@syno_term, 1)
  35.                 set @outstr=@outstr+'|'+@syno_term
  36.                 fetch next from current_replacements into @syno_term, @replaced_length
  37.             end
  38.             set @outstr=@outstr+'} '
  39.             set @pos=@pos+@replaced_length+1
  40.         end
  41.         else
  42.         begin
  43.             if charindex(' ',@current_str) = 0
  44.                 begin
  45.                     set @outstr=@outstr+@current_str
  46.                     set @pos=@pos+len(@current_str)                
  47.                 end
  48.             else
  49.                 begin
  50.                     set @outstr=@outstr+left(@current_str,charindex(' ',@current_str))
  51.                     set @pos=@pos+charindex(' ',@current_str)
  52.                 end
  53.         end
  54.         deallocate current_replacements
  55.     end
  56.     SELECT @outstr
  57.  
Is good that you have the ambition and time to play around with it as I have very little time at the moment. Still I've been thinking about whole word replacements. One of the tough issues is that you need to have a word terminator at both ends of the term to be replaced - and word terminators come in many forms (not only space, but period, comma, semicolon, dash, new line, tab, etc as well - perhaps whatever is not a letter (but then what about '@' and digits?).) It should somehow be taken care of in the cursor select to make sure no false replacements are signaled (an that's not easy). Alternatively it may be possible to modify the select so that it sometimes gives false possibilities and then check and disregard them (but that's ain't easy either).

Slowly we're venturing into the realms of grep and awk - implemented in a SQL Server SP :-) (Just kidding; we're miles away.)
Aug 14 '07 #5

Sign in to post your reply or Sign up for a free account.

Similar topics

2
22044
by: N | last post by:
Hi, I would like to parse out each value that is seperated by a comma in a field and use that value to join to another table. What would be the easiest way to do so without having to write a function or routine ? EX. Table AAA COL1 COL2
22
872
by: Ram Laxman | last post by:
Hi all, I have a text file which have data in CSV format. "empno","phonenumber","wardnumber" 12345,2234353,1000202 12326,2243653,1000098 Iam a beginner of C/C++ programming. I don't know how to tokenize the comma separated values.I used strtok function reading line by line using fgets.but it gives some weird behavior.It doesnot stripout the "" fully.Could any body have sample code for the same so that it will be helfful for my...
24
3176
by: | last post by:
Hi, I need to read a big CSV file, where different fields should be converted to different types, such as int, double, datetime, SqlMoney, etc. I have an array, which describes the fields and their types. I would like to somehow store a reference to parsing operations in this array (such as Int32.Parse, Double.Parse, SqlMoney.Parse, etc), so I can invoke the appropriate one without writing a long switch.
0
1105
by: Steve Jorgensen | last post by:
I'm wondering if there's an approach to writing consistent code to read/write XML data in arbitrary order that I'm simply missing. It seems to be easy getting stuff -out- of a DOM via XPath, but it's much tougher building a DOM document in arbitrary order. Yes - I can get the parent context element first, using XPath, but then I build custom wrappers and helpers to simplify the building and adding fragments in the correct namespace,...
11
3621
by: hoopsho | last post by:
Hi Everyone, I am trying to write a program that does a few things very fast and with efficient use of memory... a) I need to parse a space-delimited file that is really large, upwards fo a million lines. b) I need to store the contents into a unique hash. c) I need to then sort the data on a specific field. d) I need to pull out certain fields and report them to the user.
5
2251
by: Doug Bell | last post by:
Hi, I have some data in the format of yyyymmdd (Int32) that I want to display, in a datagrid, formatted in International Date Format dd MM yyyy. Any suggestions on an efficient approach? Thanks, Doug
5
2986
by: BMeyer | last post by:
I have been losing my mind trying to parse an XML document (with nested child elements, not all of which appear in each parent node) into a DataGrid object. What I want to do is "flatten" the XML document into a text document with a single row for each parent node (that has all of the values from all of the child nodes for that row) The DataView within VS 2005 IDE displays my 15 or so child tables - and knows that some parent rows...
1
2713
by: Dan Somdahl | last post by:
Hi, I am new to ASP but have what should be a fairly simple task that I can't figure out. I need to parse a string from a single, semi-colon delimited, 60 character field (el_text) in a recordset and display the results in a table on a webpage (ASP) I can retrieve the recordset from the database and display the field data results in rows of a table but have the entire 60 character string in one cell. I need to break that string apart...
3
6929
by: Leighya | last post by:
Im currently working on this xml file but when i load it to Mozilla, i got an error "Error Loading Stylesheet: Xpath parse failure: invalid variable name" It loads on IE properly. Only with the mozilla browser i get an error. Would someone help me pls. Here is the code: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="html"...
0
10363
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10172
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10110
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9964
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8993
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6749
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5398
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5535
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3670
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.