The easiest way for me to explain what I'm trying to do is show an example.
I have the following tables:
WORDS
WordID Word
1 1,000,000
2 a million
3 one million
WORDLINKS
WordID WordLinkID
1 2
1 3
2 1
2 3
3 1
3 2
I would then pass the following sentance into the SP
a million times a day
What the SP needs to do is parse the string and find all the phrases within that match the words in the WORDS table. It would then take the WordID and do a lookup in the WORDLINKS table to find all the possible alternatives/synonimes.
The result would be in the following form:
{a million|1,000,000|one million} times a day
the matched phrase has been wrapped in {} with each matched alternatives/synonimes seperated by a pipe |.
there would be multiple phrases in each input string and possibly nested words
eg in the example above the phrase 'one million', 'one' could have an alternative of '1' in which case the output would be:
{a million|1,000,000|{one|1} million} times a day
i hope this makes sense. any help would really be appreciated.
Thanks
Josh
4 1954
It's going to be a long one... (But a promise is a promise.) What makes it a bit more difficult is the nested replacements (whose proper handling I'm still not 100% sure of): - have it greedy or non-greedy (if you have replacements both for 'one million' and 'one' which one has precedence)?
- If any replacement would result in an expanded string that could be further replaced, what to do? (I.e.: replacements: '1'->'one', 'one million'->'a million', what should the output be for '1 million'? And if replacements are '1'->'one', 'one million'->'1,000,000', what should the output be for '1 million'?)
The code below is greedy and processes nested replacements only to the length of the replaced term and not including the term being replaced (see examples at the end). It has two functions because the SQL Server wouldn't take the default value (don't know why) and I've had no more time to play around with it. You only need to use the one without the underscore.
So here goes the code: -
create function _syno (@instr as varchar(1000), @term_replace as bit = 0)
-
returns varchar(1000)
-
as
-
begin
-
declare @pos as int
-
declare @current_str as varchar(1000)
-
declare @syno_term as varchar(1000)
-
declare @replaced_length as int
-
declare @outstr as varchar(1000)
-
set @outstr=''
-
-
set @pos=1
-
while @pos<=len(@instr)
-
begin
-
-- try from current position on
-
set @current_str=right(@instr,len(@instr)-@pos+1)
-
declare current_replacements cursor local for
-
-- this is tricky
-
select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
-
from words w, wordlinks l, words w2
-
where w.wordid=l.wordlinkid
-
and (l.wordid=w2.wordid)
-
and charindex(w2.word,@current_str)=1
-
and (@term_replace=0 or (w2.word<>@current_str))
-
order by len(w2.word) desc
-
open current_replacements
-
fetch next from current_replacements into @syno_term, @replaced_length
-
if @@fetch_status = 0 -- there are replacements
-
begin
-
set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
-
-- iterate on current set of replacements (recursivity)
-
while @@fetch_status = 0
-
begin
-
set @syno_term=dbo._syno(@syno_term, 1)
-
set @outstr=@outstr+'|'+@syno_term
-
fetch next from current_replacements into @syno_term, @replaced_length
-
end
-
set @outstr=@outstr+'}'
-
set @pos=@pos+@replaced_length
-
end
-
else
-
begin
-
set @outstr=@outstr+left(@current_str,1)
-
set @pos=@pos+1
-
end
-
deallocate current_replacements
-
end
-
return @outstr
-
end
-
GO
-
-
create function syno (@instr as varchar(1000))
-
returns varchar(1000)
-
as
-
begin
-
return dbo._syno(@instr,0)
-
end
-
GO
-
Note: I found the easiest way to get the code itself (without line numbers) is to "Reply" to the message and copy-paste whatever is between '[ CODE]' and '[ /CODE]'.
My WORDS and WORDLINKS tables are as follows: -
WordID Word
-
----------- -------------------
-
1 1,000,000
-
2 a million
-
3 one million
-
4 one
-
5 1
-
-
WordID WordLinkID
-
----------- -----------
-
1 2
-
1 3
-
2 1
-
2 3
-
3 1
-
3 2
-
4 5
-
Examples and usage: -
print dbo.syno('I hope you run it a million times')
-
results: I hope you run it {a million|1,000,000|{one|1} million} times -
print dbo.syno('I hope you run it one million times')
-
results: I hope you run it {one million|1,000,000|a million} times
(note how it will replace 'one million', and will process the 'one' in the 'one million' because that's the whole term to be replaced in the first place) -
print dbo.syno('I hope someone runs it one million times')
-
results: I hope some{one|1} runs it {one million|1,000,000|a million} times
Hope it helps. I don't say it's impossible to refine it (especially in the handling of nested replacements) but I believe it's a good start.
P.S.: It, indeed, would have been a rather steep learning curve from a very basic level...
This is ammmaazzing! I can't thank you enough :)
However :) just a couple more things if you have the time. Would it be possible only to match full words and with regards to the nested replacements i think i may just programmically recall the sp for small parts if needed so no recursion needed (sorry).
the precedence i think should be for the longer phrase.
let me know if there is anything else.
Josh
so i've been playing with what you gave me. i've changed it to a stored procedure as i'm no longer doing recursion. if you get a chance can you look at the code below - it does seem to be working but i'm not sure whether it is written the best way. Thanks again for all your time. -
ALTER PROCEDURE ParseString
-
@instr as varchar(max)
-
AS
-
-
declare @pos as int
-
declare @current_str as varchar(max)
-
declare @syno_term as varchar(max)
-
declare @replaced_length as int
-
declare @outstr as varchar(max)
-
-
set @outstr = ''
-
set @pos=1
-
while @pos<=len(@instr)
-
begin
-
-- try from current position on
-
set @current_str=right(@instr,len(@instr)-@pos+1)
-
declare current_replacements cursor local for
-
-- this is tricky
-
select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
-
from words w, LinkedWords l, words w2
-
where w.wordid=l.LinkedWordID
-
and (l.wordid=w2.wordid)
-
and charindex(w2.word,@current_str)=1
-
and right(left(@current_str, len(w2.word)+1),1) = ' '
-
order by len(w2.word) desc
-
open current_replacements
-
fetch next from current_replacements into @syno_term, @replaced_length
-
if @@fetch_status = 0 -- there are replacements
-
begin
-
set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
-
-- iterate on current set of replacements (recursivity)
-
while @@fetch_status = 0
-
begin
-
--set @syno_term=ammuser._syno(@syno_term, 1)
-
set @outstr=@outstr+'|'+@syno_term
-
fetch next from current_replacements into @syno_term, @replaced_length
-
end
-
set @outstr=@outstr+'} '
-
set @pos=@pos+@replaced_length+1
-
end
-
else
-
begin
-
if charindex(' ',@current_str) = 0
-
begin
-
set @outstr=@outstr+@current_str
-
set @pos=@pos+len(@current_str)
-
end
-
else
-
begin
-
set @outstr=@outstr+left(@current_str,charindex(' ',@current_str))
-
set @pos=@pos+charindex(' ',@current_str)
-
end
-
end
-
deallocate current_replacements
-
end
-
SELECT @outstr
-
so i've been playing with what you gave me. i've changed it to a stored procedure as i'm no longer doing recursion. if you get a chance can you look at the code below - it does seem to be working but i'm not sure whether it is written the best way. Thanks again for all your time. -
ALTER PROCEDURE ParseString
-
@instr as varchar(max)
-
AS
-
-
declare @pos as int
-
declare @current_str as varchar(max)
-
declare @syno_term as varchar(max)
-
declare @replaced_length as int
-
declare @outstr as varchar(max)
-
-
set @outstr = ''
-
set @pos=1
-
while @pos<=len(@instr)
-
begin
-
-- try from current position on
-
set @current_str=right(@instr,len(@instr)-@pos+1)
-
declare current_replacements cursor local for
-
-- this is tricky
-
select distinct top 1 with ties w.word [syno_term], len(w2.word) [replaced_length]
-
from words w, LinkedWords l, words w2
-
where w.wordid=l.LinkedWordID
-
and (l.wordid=w2.wordid)
-
and charindex(w2.word,@current_str)=1
-
and right(left(@current_str, len(w2.word)+1),1) = ' '
-
order by len(w2.word) desc
-
open current_replacements
-
fetch next from current_replacements into @syno_term, @replaced_length
-
if @@fetch_status = 0 -- there are replacements
-
begin
-
set @outstr=@outstr+'{'+left(@current_str,@replaced_length)
-
-- iterate on current set of replacements (recursivity)
-
while @@fetch_status = 0
-
begin
-
--set @syno_term=ammuser._syno(@syno_term, 1)
-
set @outstr=@outstr+'|'+@syno_term
-
fetch next from current_replacements into @syno_term, @replaced_length
-
end
-
set @outstr=@outstr+'} '
-
set @pos=@pos+@replaced_length+1
-
end
-
else
-
begin
-
if charindex(' ',@current_str) = 0
-
begin
-
set @outstr=@outstr+@current_str
-
set @pos=@pos+len(@current_str)
-
end
-
else
-
begin
-
set @outstr=@outstr+left(@current_str,charindex(' ',@current_str))
-
set @pos=@pos+charindex(' ',@current_str)
-
end
-
end
-
deallocate current_replacements
-
end
-
SELECT @outstr
-
Is good that you have the ambition and time to play around with it as I have very little time at the moment. Still I've been thinking about whole word replacements. One of the tough issues is that you need to have a word terminator at both ends of the term to be replaced - and word terminators come in many forms (not only space, but period, comma, semicolon, dash, new line, tab, etc as well - perhaps whatever is not a letter (but then what about '@' and digits?).) It should somehow be taken care of in the cursor select to make sure no false replacements are signaled (an that's not easy). Alternatively it may be possible to modify the select so that it sometimes gives false possibilities and then check and disregard them (but that's ain't easy either).
Slowly we're venturing into the realms of grep and awk - implemented in a SQL Server SP :-) (Just kidding; we're miles away.)
Sign in to post your reply or Sign up for a free account.
Similar topics
by: N |
last post by:
Hi,
I would like to parse out each value that is seperated
by a comma in a field and use that value to join to another table.
What would be the easiest way to do so without having to
write a...
|
by: Ram Laxman |
last post by:
Hi all,
I have a text file which have data in CSV format.
"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098
Iam a beginner of C/C++ programming.
I don't know how to...
|
by: |
last post by:
Hi,
I need to read a big CSV file, where different fields should be converted to
different types,
such as int, double, datetime, SqlMoney, etc.
I have an array, which describes the fields and...
|
by: Steve Jorgensen |
last post by:
I'm wondering if there's an approach to writing consistent code to read/write
XML data in arbitrary order that I'm simply missing.
It seems to be easy getting stuff -out- of a DOM via XPath, but...
|
by: hoopsho |
last post by:
Hi Everyone,
I am trying to write a program that does a few things very fast
and with efficient use of memory...
a) I need to parse a space-delimited file that is really large,
upwards fo a...
|
by: Doug Bell |
last post by:
Hi,
I have some data in the format of yyyymmdd (Int32) that I want to display,
in a datagrid, formatted in International Date Format dd MM yyyy.
Any suggestions on an efficient approach?
...
|
by: BMeyer |
last post by:
I have been losing my mind trying to parse an XML document (with nested child
elements, not all of which appear in each parent node) into a DataGrid
object. What I want to do is "flatten" the XML...
|
by: Dan Somdahl |
last post by:
Hi, I am new to ASP but have what should be a fairly simple task that I
can't figure out.
I need to parse a string from a single, semi-colon delimited, 60 character
field (el_text) in a recordset...
|
by: Leighya |
last post by:
Im currently working on this xml file but when i load it to Mozilla, i got an error
"Error Loading Stylesheet: Xpath parse failure: invalid variable name"
It loads on IE properly. Only with the...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
| |