473,799 Members | 2,954 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Cleaning data - performance issue

I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man :o(

The method is:

public static StringBuilder RemoveChars(Str ingBuilder dataToClean_, string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.L ength > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s , @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Re place((char)int .Parse(reg[i].Groups[1].Value, NumberStyles.He xNumber), ' ');
}
}
}
return dataToClean_;
}

The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as

\u0000;
\u0009;
\u000A;

The config file is read using EnterpriseLibra ry.

The problem is not that it takes some time to use this method - the
problem is that the timespan increases as the method is called which
indicates to me that there might be some string-issues that I have
not taken care of.

So my question is roughly:

How do I most efficiently clean a string for unwanted chars?
Should I work on the individual bytes instead of using a
StringBuilder? The system creates roughly 1Gb of CSV-files
as it performs its largest job, so we really need to be able
to clean this amount of data most efficiently.

Any help will be greatly appreciated.

:o)

--
Jesper Stocholm
http://stocholm.dk
Apr 2 '06 #1
3 1951
Jesper Stocholm <j@stocholm.inv alid> wrote:
I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man :o(

The method is:

public static StringBuilder RemoveChars(Str ingBuilder dataToClean_, string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.L ength > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s , @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Re place((char)int .Parse(reg[i].
Groups[1].Value, NumberStyles.He xNumber), ' ');
}
}
}
return dataToClean_;
}

The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as

\u0000;
\u0009;
\u000A;


It strikes me that the principle problem here is that you're parsing
the illegal characters every time you call the method. You should have
one method which converts the list of illegal characters from a string
array into a char array, and then you can reuse that char array each
time you call the method.

I expect that will be *much* faster than using regular expressions on
each iteration.

One of the key things to spot here is that you're doing the same work
every time the method is called - you're matching the same strings with
the same regular expressions each time. Any time you're looking for
performance gains and you find yourself duplicating effort, that's
somewhere to start.

--
Jon Skeet - <sk***@pobox.co m>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Apr 2 '06 #2
Another thing here, an internal RegEx object is being created through
every iteration of the loop. The OP would see much better performance by
creating one RegEx instance and setting the options on it to do a
pre-compile of the expression. The performance would probably increase
dramatically.

Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard. caspershouse.co m

"Jon Skeet [C# MVP]" <sk***@pobox.co m> wrote in message
news:MP******** *************** *@msnews.micros oft.com...
Jesper Stocholm <j@stocholm.inv alid> wrote:
I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man :o(

The method is:

public static StringBuilder RemoveChars(Str ingBuilder dataToClean_,
string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.L ength > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s , @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Re place((char)int .Parse(reg[i].
Groups[1].Value, NumberStyles.He xNumber), ' ');
}
}
}
return dataToClean_;
}

The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as

\u0000;
\u0009;
\u000A;


It strikes me that the principle problem here is that you're parsing
the illegal characters every time you call the method. You should have
one method which converts the list of illegal characters from a string
array into a char array, and then you can reuse that char array each
time you call the method.

I expect that will be *much* faster than using regular expressions on
each iteration.

One of the key things to spot here is that you're doing the same work
every time the method is called - you're matching the same strings with
the same regular expressions each time. Any time you're looking for
performance gains and you find yourself duplicating effort, that's
somewhere to start.

--
Jon Skeet - <sk***@pobox.co m>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Apr 2 '06 #3
"Nicholas Paldino [.NET/C# MVP]" <mv*@spam.guard .caspershouse.c om> wrote
in news:Op******** ******@TK2MSFTN GP12.phx.gbl:
Another thing here, an internal RegEx object is being created
through
every iteration of the loop. The OP would see much better performance
by creating one RegEx instance and setting the options on it to do a
pre-compile of the expression. The performance would probably
increase dramatically.


Hi guys,

I did as suggested and moved the parsing of chars outside the method itself
and I now pass this as a parameter. Also I have skipped the check on
StringBuilder.L ength, since it was basically not needed.

The method now works as intended and memory consumption is moderate.

Thanks for your input.

:o)

--
Jesper Stocholm
http://stocholm.dk
Hvor køber du slik, cola eller smøger online?
Send linket til mig via http://ekiosk.dk
Apr 4 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
23608
by: Sebastian Kemi | last post by:
How should a write a class to a file? Would this example work: object *myobject = 0; tfile.write(reinterpret_cast<char *>(myobject), sizeof(*object)); / sebek
10
4836
by: Ellen K. | last post by:
What tools has everyone used for cleaning name and address data (including identifying not-immediately-obvious duplicates) in connection with a CRM project or the Customer dimension of a data warehouse? What did you like/dislike about the tool you used? How customizable was the tool you used?
4
2019
by: Jaans | last post by:
I have a problem that relates to running "cleanup" code when an application is forcibly ended using the "End Process" of "Task Manager" (Please note that this is very different from "End Task" since end task sends a message to the application, requesting it to close) My real problem is that our application makes entries into a database when the application starts, and then corresponding entries when the application closes ("cleanup")....
13
2768
by: bjarne | last post by:
Willy Denoyette wrote; > ... it > was not the intention of StrousTrup to the achieve the level of efficiency > of C when he invented C++, ... Ahmmm. It was my aim to match the performance of C and I achieved that aim very early on. See, for example "The Design and Evolution of C++". -- Bjarne Stroustrup; http://www.research.att.com/~bs
5
575
by: pt | last post by:
Hi, i am wonderng what is faster according to accessing speed to read these data structure from the disk in c/c++ including alignment handling if we access it on little endian system 32 bits system + OS e.g. Windows, Linux, WinCE. I am not quite sure about the alignment of the memory.... soln. 1: should be faster, I am not sure. idx size (bytes) 1 4
24
2199
by: Earl | last post by:
I have all of my data operations in a separate library, so I'm looking for what might be termed "best practices" on a return type from those classes. For example, let's say I send an update from the UI layer to a method in a library class that calls the stored procedure. Best to return a boolean indicating success/failure, return a string with the exception message, or just return the entire exception?
2
2437
by: Tom Shelton | last post by:
On 2008-04-15, DR <softwareengineer98037@yahoo.comwrote: Where are you seeing that? In the task manager? If so, then you are looking in the wrong place. Let me tell you a little something about windows memory management - just because memory is freed, does not mean that the OS instantly removes it from your process. You need to be looking at the performance counters for this - to find out the actuall amount of memory your using....
1
1455
by: jehugaleahsa | last post by:
Hello: I am experiencing performance related issues when my custom data structures work with value types. I use generics to prevent boxing wherever I can. For instance, I use IEqualityComparer, etc. I have gone through most of my data structures and verified that I don't compare to null or call methods that would box my value types. However, I am still experiencing performance problems. I can process strings faster than I can process...
0
1207
by: Now You Know | last post by:
Carpet Cleaners Los Angeles Home Carpet Rug Upholstery Cleaning Phone 1 310 925 1720 OR 1-818-386-1022 Local Call California Wide We offer carpet cleaning services such as; Steam Cleaning, Dry Cleaning, Fabric Lounge Suite Cleaning, Leather Lounge Suite Cleaning, Tile & Grout Cleaning, Mattress Cleaning, Wet Carpet / Water Damage Restoration for: offices, homes, restaurants, clubs and hotels http://carpetcleanersorangecounty.blogspot.com/...
0
9540
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10475
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10250
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9068
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7564
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5463
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5585
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4139
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3757
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.