473,776 Members | 1,652 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Need for Speed

I have a process that I want to speed up. It first was written in
Microsoft Access then converted to VB.NET. I would like to hear some
suggestions on how to speed it up. The process is to match names using
a Soundex function. It reads through a table of about 20,000 records
and matches it with a table of about 9,000 records. That is all done
in code. The tables are stored in SQL Server 2005.

Soundex is a numerical formula to generate a number based off of a
string. The first thing done is to generate the Soundex value for the
20,000 records. The matching table of 9,000 records already has it's
Soundex value calculated for it. Like I said, this part is done in
code. It runs for about 6 minutes.

Based off of the Soundex values, it generates about 300,000 matches.
Each of these 300,000 matches then has to compare the names to get a
percent match. We have tried to write a stored proc to handle this,
but it is really slow. The 300,000 records are first being written to
a table and then each of the 300,000 get their percent match
calculated. This takes anywhere from 10 to 15 minutes.

Any suggestions on how to speed this up?

Nov 20 '06 #1
9 1604
"Paul" <pw****@hotmail .comwrote in news:1164037715 .384036.201420
@e3g2000cwe.goo glegroups.com:
I have a process that I want to speed up. It first was written in
Microsoft Access then converted to VB.NET. I would like to hear some
suggestions on how to speed it up. The process is to match names using
a Soundex function. It reads through a table of about 20,000 records
and matches it with a table of about 9,000 records. That is all done
in code. The tables are stored in SQL Server 2005.
Why aren't you using SQL Server's Soundex or Full-Text Search capabilities?
Based off of the Soundex values, it generates about 300,000 matches.
Each of these 300,000 matches then has to compare the names to get a
percent match. We have tried to write a stored proc to handle this,
but it is really slow. The 300,000 records are first being written to
a table and then each of the 300,000 get their percent match
calculated. This takes anywhere from 10 to 15 minutes.
Why is your function returning 300,000 matches? Perhaps your filters aren't
specific enough?
Any suggestions on how to speed this up?
Anything wrong with SQL Server's Soundex function?

http://msdn2.microsoft.com/en-us/library/ms187384.aspx
Nov 20 '06 #2
Yes, I thought about using that, but that is not really where the time
problem is. Plus, I do a lot other massaging of the data prior to when
the Soundex values are calculated. The real time problems are when the
300,000 are first INSERTed and then the same 300,000 are processed to
calculate the percent match.

We have two ways of doing this. The first was through code and the
second was through a stored proc. Both take about the same amount of
time.

Nov 20 '06 #3
Assuming it doesn't change unless the data stored therein
changes, store the Soundex numbers for the 20,000 records
instead of calculating it on the fly.

How are you comparing the names? Are they identical?
Could you do an inner join to pick up the ones that
match and compare the Soundex values at the same time?

If you can't do the compares at the same time, which
gives you a smaller sample? Can you start with that
and then apply the other compare? Do they have to be
done in a specific order?

If you have 20,000 records and 9,000 records and matching
them gives you 300,000 records, you have a serious cartesian
join going on. Is there any way you can reduce that?

Not knowing your data or your data structures, those are
my first ideas.

Robin S.
---------------------
"Paul" <pw****@hotmail .comwrote in message
news:11******** **************@ e3g2000cwe.goog legroups.com...
>I have a process that I want to speed up. It first was written in
Microsoft Access then converted to VB.NET. I would like to hear some
suggestions on how to speed it up. The process is to match names using
a Soundex function. It reads through a table of about 20,000 records
and matches it with a table of about 9,000 records. That is all done
in code. The tables are stored in SQL Server 2005.

Soundex is a numerical formula to generate a number based off of a
string. The first thing done is to generate the Soundex value for the
20,000 records. The matching table of 9,000 records already has it's
Soundex value calculated for it. Like I said, this part is done in
code. It runs for about 6 minutes.

Based off of the Soundex values, it generates about 300,000 matches.
Each of these 300,000 matches then has to compare the names to get a
percent match. We have tried to write a stored proc to handle this,
but it is really slow. The 300,000 records are first being written to
a table and then each of the 300,000 get their percent match
calculated. This takes anywhere from 10 to 15 minutes.

Any suggestions on how to speed this up?

Nov 20 '06 #4
Thanks for the reply Robin. Below are my answers...

RobinS wrote:
Assuming it doesn't change unless the data stored therein
changes, store the Soundex numbers for the 20,000 records
instead of calculating it on the fly.
I will consider that. It would save time, but require more of a
reengineering since we do not store the Soundex value for future use.
We use it and then discard it.
>
How are you comparing the names? Are they identical?
Could you do an inner join to pick up the ones that
match and compare the Soundex values at the same time?
99.99% of the names (or 100%) will not match exactly. So that will not
help.
>
If you can't do the compares at the same time, which
gives you a smaller sample? Can you start with that
and then apply the other compare? Do they have to be
done in a specific order?
I'm not sure what you are getting at here. But the matches are not
done in any order.
>
If you have 20,000 records and 9,000 records and matching
them gives you 300,000 records, you have a serious cartesian
join going on. Is there any way you can reduce that?
What you said is correct and that is my problem. The 300,000 records
are what I get when the join is done between the 20,000 Soundex values
and the 9,000 Soundex values. There really is no way to get around the
300,000 records. The 300,000 records are valid matches according to
the Soundex values. Then I calculate the percent match between the
names. We need to record the highest percent match for those that
match below 80% and store all matches that are 80% and above.
>
Not knowing your data or your data structures, those are
my first ideas.

Robin S.
Nov 20 '06 #5
"Paul" <pw****@hotmail .comwrote in news:1164045599 .255769.59430
@h54g2000cwb.go oglegroups.com:
Yes, I thought about using that, but that is not really where the time
problem is. Plus, I do a lot other massaging of the data prior to when
the Soundex values are calculated. The real time problems are when the
300,000 are first INSERTed and then the same 300,000 are processed to
calculate the percent match.
Inserting 300,000 records is ALOT of data. How are you doing this right
now? Perhaps take a look at SQL DTS or SQL Integration Services.

Or even better - massage the data during the load process... so that you
have clean data to work with : )
Nov 20 '06 #6

Seems like the Bottle neck is the IO of the 300.000 records
wich is actually not so much for SQL server , i wrote ones a proggy that
inserted millions of records in a sql database this was 12 GB mysql dump
file i neede to do this import as fast as possible as it was production data
coming from a third party .
I tried everyhting , ODBC to a seperate mysql DB took 14 hours to complete ,
DTS wasn`t possible as the file contained SQL statements ( DDL , and inserts
in MYSQL native format ) ....and ....... not possible ........Long story
........:-)

In the end i converted the data in memory , and created sperate SQL insert
startements on the fly , i saved these statements in batches of 1000 inserts
seperated by a ; ( dot comma ; ) in a string builder and ececuted the
SQL on a command object , the hole proggy took now minutes to complete.

so my advise batch your IO in this way and you will see a hughe performance
benefit
regards

Michel Posseth
"Paul" <pw****@hotmail .comschreef in bericht
news:11******** **************@ f16g2000cwb.goo glegroups.com.. .
Thanks for the reply Robin. Below are my answers...

RobinS wrote:
>Assuming it doesn't change unless the data stored therein
changes, store the Soundex numbers for the 20,000 records
instead of calculating it on the fly.

I will consider that. It would save time, but require more of a
reengineering since we do not store the Soundex value for future use.
We use it and then discard it.
>>
How are you comparing the names? Are they identical?
Could you do an inner join to pick up the ones that
match and compare the Soundex values at the same time?

99.99% of the names (or 100%) will not match exactly. So that will not
help.
>>
If you can't do the compares at the same time, which
gives you a smaller sample? Can you start with that
and then apply the other compare? Do they have to be
done in a specific order?

I'm not sure what you are getting at here. But the matches are not
done in any order.
>>
If you have 20,000 records and 9,000 records and matching
them gives you 300,000 records, you have a serious cartesian
join going on. Is there any way you can reduce that?

What you said is correct and that is my problem. The 300,000 records
are what I get when the join is done between the 20,000 Soundex values
and the 9,000 Soundex values. There really is no way to get around the
300,000 records. The 300,000 records are valid matches according to
the Soundex values. Then I calculate the percent match between the
names. We need to record the highest percent match for those that
match below 80% and store all matches that are 80% and above.
>>
Not knowing your data or your data structures, those are
my first ideas.

Robin S.

Nov 26 '06 #7
Michel,

It took me a while to find that what you wrote was what I wanted to advice
as well in a way.

Get the data using a datareader.
Process it in memory of a seperated client (or even on the server)
Set it back using the command.Execute NonQuery using an Insert command.

Be aware that SQL code even as it is a stored procedure is non builded code.
(although it seems that it can reuse some things on the fly).

Is it the same?

Cor

"Michel Posseth [MCP]" <MS**@posseth.c omschreef in bericht
news:ur******** **********@TK2M SFTNGP03.phx.gb l...
>
Seems like the Bottle neck is the IO of the 300.000 records
wich is actually not so much for SQL server , i wrote ones a proggy that
inserted millions of records in a sql database this was 12 GB mysql dump
file i neede to do this import as fast as possible as it was production
data coming from a third party .
I tried everyhting , ODBC to a seperate mysql DB took 14 hours to complete
, DTS wasn`t possible as the file contained SQL statements ( DDL , and
inserts in MYSQL native format ) ....and ....... not possible
........Long story .......:-)

In the end i converted the data in memory , and created sperate SQL
insert startements on the fly , i saved these statements in batches of
1000 inserts seperated by a ; ( dot comma ; ) in a string builder and
ececuted the SQL on a command object , the hole proggy took now minutes to
complete.

so my advise batch your IO in this way and you will see a hughe
performance benefit
regards

Michel Posseth
"Paul" <pw****@hotmail .comschreef in bericht
news:11******** **************@ f16g2000cwb.goo glegroups.com.. .
>Thanks for the reply Robin. Below are my answers...

RobinS wrote:
>>Assuming it doesn't change unless the data stored therein
changes, store the Soundex numbers for the 20,000 records
instead of calculating it on the fly.

I will consider that. It would save time, but require more of a
reengineerin g since we do not store the Soundex value for future use.
We use it and then discard it.
>>>
How are you comparing the names? Are they identical?
Could you do an inner join to pick up the ones that
match and compare the Soundex values at the same time?

99.99% of the names (or 100%) will not match exactly. So that will not
help.
>>>
If you can't do the compares at the same time, which
gives you a smaller sample? Can you start with that
and then apply the other compare? Do they have to be
done in a specific order?

I'm not sure what you are getting at here. But the matches are not
done in any order.
>>>
If you have 20,000 records and 9,000 records and matching
them gives you 300,000 records, you have a serious cartesian
join going on. Is there any way you can reduce that?

What you said is correct and that is my problem. The 300,000 records
are what I get when the join is done between the 20,000 Soundex values
and the 9,000 Soundex values. There really is no way to get around the
300,000 records. The 300,000 records are valid matches according to
the Soundex values. Then I calculate the percent match between the
names. We need to record the highest percent match for those that
match below 80% and store all matches that are 80% and above.
>>>
Not knowing your data or your data structures, those are
my first ideas.

Robin S.


Nov 26 '06 #8
not really where the problem is?

are you still using MDB?

MDB is for fucking retards lose the training wheels, jackass

-Aaron
SQL DBA / Architect
Paul wrote:
Yes, I thought about using that, but that is not really where the time
problem is. Plus, I do a lot other massaging of the data prior to when
the Soundex values are calculated. The real time problems are when the
300,000 are first INSERTed and then the same 300,000 are processed to
calculate the percent match.

We have two ways of doing this. The first was through code and the
second was through a stored proc. Both take about the same amount of
time.
Nov 26 '06 #9
Aaron, its like trying to educate pork. Let the silly fucker go and
make a fool of himeself.

The Grand Master
aa*********@gma il.com wrote:
not really where the problem is?

are you still using MDB?

MDB is for fucking retards lose the training wheels, jackass

-Aaron
SQL DBA / Architect
Paul wrote:
Yes, I thought about using that, but that is not really where the time
problem is. Plus, I do a lot other massaging of the data prior to when
the Soundex values are calculated. The real time problems are when the
300,000 are first INSERTed and then the same 300,000 are processed to
calculate the percent match.

We have two ways of doing this. The first was through code and the
second was through a stored proc. Both take about the same amount of
time.
Nov 27 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
2653
by: Beach Potato | last post by:
Dear Y'all: I'm about to start porting a big old project written in anscient version of Delphi to something more stable, robust, supportable and maybe even portable. Since I haven't seriously touched C for large implementations, I'm seeking advice on what to use for development. My ultimate goal is to spend as less time on it as possible. I'll be writing it in Windows 32-bit environment, probably Win2000 or Win98. Planning to use a...
3
1244
by: Patric | last post by:
Hi I'm having some problem getting the following result. I have 2 tables Drivers -------- DriverID int (PK) DriverName varchar(50)
7
1546
by: borges2003xx | last post by:
hi everyone can someone suggest me where find a lot programming tricks for achieving the top speed in python? thanks everyone for patience
4
2618
by: Beeman | last post by:
I am looking for a good control that would display/print JPEG images in Access 97. The existing Image controls, even with the JPEG filters, are very slow - and I know there are better ones out there (preferably freeware). TIA... - Blaine ========================================
5
2393
by: MFC | last post by:
Ok, after three C# books, (C# How to Program, Programming in the Key of C#, and C# Weekend Crash Course) and three weeks, I believe I have tried everything to make a certain form function correctly. I am trying to learn C# after playing around for a bit with procedural programming with PHP, not OOP, and believe I have learned quite a bit in three weeks, just not enough to accomplish this one task. If anyone has a bit of free time and...
1
2231
by: Brett Hofer | last post by:
Does anyone know of a good component for audio(.WAV) playback that supports double-speed/normal/half-speed? I need to provide this control in an .aspx page and control it using C#. I have tried using the DirectX playback but it is very limited and doesn't support variable speed from what I have seen. I've also read postings that say to avoid it... Any suggestions would be appreciated :) Thanks, Brett++
6
2032
by: Ham | last post by:
Yeah, Gotto work with my VB.Net graphic application for days, do any possible type of code optimization, check for unhandled errors and finally come up with sth that can't process 2D graphics and photos at an acceptable speed. I have heard things about the virtual machine of Mr. Net, that it can run my app at a high speed....but could never compare it with Java VM and its speed. Then, what should i do? Go and learn C++ ? Do i have time for...
3
2531
by: mistral | last post by:
Here is javscript clock: http://javascript.internet.com/time-date/mousetrailclock.html which I want adjust a little: 1. I want replace the days of week/year/date in external circle with just custom text: 'www.company.com' 2. I want to reduce a little rotation speed of this text. 3. I want to fix a clock in some place, no need in mouse trail effect.
5
1702
by: ra7l | last post by:
Hi All .. First Thanks to All For Help Me .. ok ..This Code it Move Train but one errore small.. Where Correct Cods Thanks All .. :)
11
1608
by: lakshmiram.saikia | last post by:
Hi, I need to do the following operation : '" I have two mac addresses, say X and Y,where X is the base mac address, and Y is the nth mac address from X, each incremented by one. Now,I want to check if Z falls within . I need to do some check only when Z is one of the mac addresses in this range. I am looking for an optimized method to do this as this check is going to be called in a high priority callback task, and
0
9628
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9464
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10289
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9923
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7471
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5367
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4031
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3622
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2860
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.