Large text file import: MVP question

Avi

Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi

May 7 '06 #1

Subscribe Post Reply

2070

Michel Posseth [MCP]

MVP Question ,,, what is MVP about it ????

As you do not provide enough information it is impossiible to give you the
answer you probably expect .

nice things to know are :

how is the data delivered ? ( are conversions necessary)
what is the target database
what do you currently use
etc etc etc

To give you an idea what is possible :

I had one year ago the task to import MYSQL dump files to MS SQL server 2000
these files were + - 7 GB big and contained the MYSQL dialect
so using DTS was not an option as i had a few million data rows and then a
block of DDL SQL with the MYSQL dialect etc etc etc

so i wrote a simple tool in VB.Net 2005 containing a file reader a
stringbuilder and a SQL command object , i read the file line by line and
first searched all DDL SQL , i converted this SQL Dialect to MS T-SQL and
execuited it on the command object ( so now i had all the required tables )
, now i read the file again
and read in all data blocks ( Insert statements, that were translated on
the fly ) i executed these lines in batches of 1000 rows in a time
the hole import took 3 minutes ,,, my collegue setup a MYSQL server and
connected with ODBC the same operation took 4 Hours :-)
so i hope to have given you some ideas

regards

Michel Posseth [MCP]
"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...

Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi

May 7 '06 #2

Cor Ligthert [MVP]

Avi,

For this kind of question you can better ask a plumber to give you the
answer, water cannot go faster throug a pipe than it goes through the
smallest part of the pipe.

The same is with data. So in my opinion you should always concentrate on
that smallest part. Any other optimisation is waste of time.

That the workstation software is to slow, is in my idea seldom or you should
have old workstations P3 < 600Mhz.

That the newwork trafic is slow is no bad chance and is completely dependend
from the typologic of the network and its throughput speed mostly you are
with big chunks faster in trouble.

That the server is the place which is in trouble. That is the place were I
would look first, I would try to bring as much work as possible to the
client.

Where the last part as Michael wrote is exactly as I think about it.

This course assuming that it is not Citrix or other Terminal Server.

Just my thought,

Cor

"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...

Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi

May 7 '06 #3

Avi

With all due respect to each of the respondents who have assumed that I am
asking a simple question, let me be more explicit:

I am not interested in a simple asnwer - been there, done that : SQL BCP/
DTS, filest/stream readers/writers splits, sub strings , mids - the whole
nine yards.

I am testing this on a quad xeon with 4gb ram, dual fibre channel san array
- not some small, home PC. Network issues have been accounted for (and are
not part of the baseline improvement that I am trying to find).

The issue is simply this: what is the fastest way to read in a fixed width
text file and split it apart by its constituent fields. don't worry about the
return to the DB, thats irrevelant for now. So far the benchmark has been ok
- but its not brilliant. the disk I/O is well short of the capabilities. cpu
and memory are barely in use. That means either our code is crap or the .net
capabilitiy is crap.

We have simplified our code to a simple basic test. So, now I'm focusing on
the underlying .net solutions.

I'm looking for real performance answers. If you don't have them - fine,
I'll look elsewhere. I just thought a high end question would require a high
end programmer with high end experience.

Hope this clears it up. If not, perhaps you can recommend a plumber who
might know something....... :)

Avi
"Cor Ligthert [MVP]" wrote:

Avi,

For this kind of question you can better ask a plumber to give you the
answer, water cannot go faster throug a pipe than it goes through the
smallest part of the pipe.

The same is with data. So in my opinion you should always concentrate on
that smallest part. Any other optimisation is waste of time.

That the workstation software is to slow, is in my idea seldom or you should
have old workstations P3 < 600Mhz.

That the newwork trafic is slow is no bad chance and is completely dependend
from the typologic of the network and its throughput speed mostly you are
with big chunks faster in trouble.

That the server is the place which is in trouble. That is the place were I
would look first, I would try to bring as much work as possible to the
client.

Where the last part as Michael wrote is exactly as I think about it.

This course assuming that it is not Citrix or other Terminal Server.

Just my thought,

Cor

"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...
Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi

May 7 '06 #4

Avi

Thanks for the response - but DTS is not an option.

"Spam Catcher" wrote:

=?Utf-8?B?QXZp?= <Av*@discussions.microsoft.com> wrote in
news:20**********************************@microsof t.com:
There are several ways to complete this, but I am wondering if anyone
has insight into the FASTEST way to read in the data and chop it up
ahead of posting it into the DB.

You can do this via SQL BCP, Bulk Inserts or DTS.

We have a project in which we build a dynamic DTS package via VB.NET to do
imports. The advantage of this is that the DTS graph can be dynamically
built to handle a variety of file layouts. DTS can also handle special
characters during import - something Bulk Inserts doesn't do. With BCP
you'll need to create a format file which is a bit of a hassle if your file
import specifications change a lot.

May 7 '06 #5

Avi

Read my response to Cor. This is very MVP.
"Michel Posseth [MCP]" wrote:

MVP Question ,,, what is MVP about it ????

As you do not provide enough information it is impossiible to give you the
answer you probably expect .

nice things to know are :

how is the data delivered ? ( are conversions necessary)
what is the target database
what do you currently use
etc etc etc

To give you an idea what is possible :

I had one year ago the task to import MYSQL dump files to MS SQL server 2000
these files were + - 7 GB big and contained the MYSQL dialect
so using DTS was not an option as i had a few million data rows and then a
block of DDL SQL with the MYSQL dialect etc etc etc

so i wrote a simple tool in VB.Net 2005 containing a file reader a
stringbuilder and a SQL command object , i read the file line by line and
first searched all DDL SQL , i converted this SQL Dialect to MS T-SQL and
execuited it on the command object ( so now i had all the required tables )
, now i read the file again
and read in all data blocks ( Insert statements, that were translated on
the fly ) i executed these lines in batches of 1000 rows in a time
the hole import took 3 minutes ,,, my collegue setup a MYSQL server and
connected with ODBC the same operation took 4 Hours :-)
so i hope to have given you some ideas

regards

Michel Posseth [MCP]
"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...
Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi

May 7 '06 #6

Cor Ligthert [MVP]

Avi,

And still the answer should be simple. Simple solutions give fast processing
applications.

One thing you have to keep in mind. Although there can be one ILS
instruction than still can this mean thousand times of looping. The 8086
processor famillie and its descendants are not that advanced that they can
do things in one cycle so behind your code is a lot of looping or whatever.
All methods from the Visual Basic Namespace are therefore using more
througput time than from the basic Net namespace. Normally no problem,
humans cannot see milliseconds.

The second thing you have to keep in mind because that you are saying
splitting that the normal string is not mutable. This means that every
change, with mid, split, tolower, or whatever will result in a new string.
However overdoing this can of course cost you a lot of time.

But giving you any reasonable answer withouth seeing even a piece of your
current code is in my opinion impossible. If it was, then there would not
have been so many possibilities.

By the way, you have selected the chance on an answer very much by writting
that this is an MVP question. Probably somebody who is dedicated to your
problem however not a MVP has
said.............................................. ........... Two words
mostly said with the middle finger in the air.

I hope this helps something,

Cor

"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:EE**********************************@microsof t.com...

With all due respect to each of the respondents who have assumed that I am
asking a simple question, let me be more explicit:

I am not interested in a simple asnwer - been there, done that : SQL BCP/
DTS, filest/stream readers/writers splits, sub strings , mids - the whole
nine yards.

I am testing this on a quad xeon with 4gb ram, dual fibre channel san
array
- not some small, home PC. Network issues have been accounted for (and are
not part of the baseline improvement that I am trying to find).

The issue is simply this: what is the fastest way to read in a fixed width
text file and split it apart by its constituent fields. don't worry about
the
return to the DB, thats irrevelant for now. So far the benchmark has been
ok
- but its not brilliant. the disk I/O is well short of the capabilities.
cpu
and memory are barely in use. That means either our code is crap or the
.net
capabilitiy is crap.

We have simplified our code to a simple basic test. So, now I'm focusing
on
the underlying .net solutions.

I'm looking for real performance answers. If you don't have them - fine,
I'll look elsewhere. I just thought a high end question would require a
high
end programmer with high end experience.

Hope this clears it up. If not, perhaps you can recommend a plumber who
might know something....... :)

Avi
"Cor Ligthert [MVP]" wrote:
Avi,

For this kind of question you can better ask a plumber to give you the
answer, water cannot go faster throug a pipe than it goes through the
smallest part of the pipe.

The same is with data. So in my opinion you should always concentrate on
that smallest part. Any other optimisation is waste of time.

That the workstation software is to slow, is in my idea seldom or you
should
have old workstations P3 < 600Mhz.

That the newwork trafic is slow is no bad chance and is completely
dependend
from the typologic of the network and its throughput speed mostly you are
with big chunks faster in trouble.

That the server is the place which is in trouble. That is the place were
I
would look first, I would try to bring as much work as possible to the
client.

Where the last part as Michael wrote is exactly as I think about it.

This course assuming that it is not Citrix or other Terminal Server.

Just my thought,

Cor

"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...
> Hi
>
> I need to read in a large set of text files (9GB+ each) into a database
> table based on fixed width lengths.
>
> There are several ways to complete this, but I am wondering if anyone
> has
> insight into the FASTEST way to read in the data and chop it up ahead
> of
> posting it into the DB.
>
> So far, things work, but they are far slower than expected.
>
> Do I need to consider a VC++ app to get a large improvement here?
>
> thx
> Avi

May 8 '06 #7

M. Posseth

Read my response to Cor. This is very MVP.
No it isn`t , I have never seen a MVP signature at a post of Francesco
Balena
for instance :-)

Having read the answer of Cor, i can only fully concur with him

By the way ,,, i believe i have earned my stripes as a professional
programmer
( made programs for thousands of users throughout europe ) in my previous
Job i worked at a Automotive data provider ( the market leader in there
business ) where i was not only a lead developer , but also the sql server (
2000 , 2005 ) database administrator .

One of my responsability`s was optimizing data conversions ( imports ) from
external delivered data ( Mysql , flat text files etc etc )

I worked only with Poweredges , with at least 8 gigs of mem , in raid 10
( and no i did not have one of them we had i believe 25 of those beasts in
our server room , besides the other servers the companny had )

So finished beating my chest , i can tell you that nobody MVP, MCP , MCAD ,
MCSD or Steve and / or even Bill himself
can`t answer your question correctly unless you provide some detailed info

and you did not do that ... so .....

regards

Michel Posseth
just a Software developer

"Avi" wrote:
Read my response to Cor. This is very MVP.
"Michel Posseth [MCP]" wrote:

MVP Question ,,, what is MVP about it ????

As you do not provide enough information it is impossiible to give you the
answer you probably expect .

nice things to know are :

how is the data delivered ? ( are conversions necessary)
what is the target database
what do you currently use
etc etc etc

To give you an idea what is possible :

I had one year ago the task to import MYSQL dump files to MS SQL server 2000
these files were + - 7 GB big and contained the MYSQL dialect
so using DTS was not an option as i had a few million data rows and then a
block of DDL SQL with the MYSQL dialect etc etc etc

so i wrote a simple tool in VB.Net 2005 containing a file reader a
stringbuilder and a SQL command object , i read the file line by line and
first searched all DDL SQL , i converted this SQL Dialect to MS T-SQL and
execuited it on the command object ( so now i had all the required tables )
, now i read the file again
and read in all data blocks ( Insert statements, that were translated on
the fly ) i executed these lines in batches of 1000 rows in a time
the hole import took 3 minutes ,,, my collegue setup a MYSQL server and
connected with ODBC the same operation took 4 Hours :-)
so i hope to have given you some ideas

regards

Michel Posseth [MCP]
"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...
Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi

May 8 '06 #8

Avi

Thanks for the incoherent response..... I guess if I look beyond the
defensive language, I'll find something useful.... Oh. I can't.

In many other posts to other newsgroups, if I have not posted enough info, I
simply get asked for it. I have never before been so acosted for asking a
question the wrong way. I think thats the idea of newsgroups in the first
place.

Time to get off your high perch. I'm not perfect. And I'll bet you aren't
either - MVP or not.

"Cor Ligthert [MVP]" wrote:

Avi,

And still the answer should be simple. Simple solutions give fast processing
applications.

One thing you have to keep in mind. Although there can be one ILS
instruction than still can this mean thousand times of looping. The 8086
processor famillie and its descendants are not that advanced that they can
do things in one cycle so behind your code is a lot of looping or whatever.
All methods from the Visual Basic Namespace are therefore using more
througput time than from the basic Net namespace. Normally no problem,
humans cannot see milliseconds.

The second thing you have to keep in mind because that you are saying
splitting that the normal string is not mutable. This means that every
change, with mid, split, tolower, or whatever will result in a new string.
However overdoing this can of course cost you a lot of time.

But giving you any reasonable answer withouth seeing even a piece of your
current code is in my opinion impossible. If it was, then there would not
have been so many possibilities.

By the way, you have selected the chance on an answer very much by writting
that this is an MVP question. Probably somebody who is dedicated to your
problem however not a MVP has
said.............................................. ........... Two words
mostly said with the middle finger in the air.

I hope this helps something,

Cor

"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:EE**********************************@microsof t.com...
With all due respect to each of the respondents who have assumed that I am
asking a simple question, let me be more explicit:

I am not interested in a simple asnwer - been there, done that : SQL BCP/
DTS, filest/stream readers/writers splits, sub strings , mids - the whole
nine yards.

I am testing this on a quad xeon with 4gb ram, dual fibre channel san
array
- not some small, home PC. Network issues have been accounted for (and are
not part of the baseline improvement that I am trying to find).

The issue is simply this: what is the fastest way to read in a fixed width
text file and split it apart by its constituent fields. don't worry about
the
return to the DB, thats irrevelant for now. So far the benchmark has been
ok
- but its not brilliant. the disk I/O is well short of the capabilities.
cpu
and memory are barely in use. That means either our code is crap or the
.net
capabilitiy is crap.

We have simplified our code to a simple basic test. So, now I'm focusing
on
the underlying .net solutions.

I'm looking for real performance answers. If you don't have them - fine,
I'll look elsewhere. I just thought a high end question would require a
high
end programmer with high end experience.

Hope this clears it up. If not, perhaps you can recommend a plumber who
might know something....... :)

Avi
"Cor Ligthert [MVP]" wrote:
Avi,

For this kind of question you can better ask a plumber to give you the
answer, water cannot go faster throug a pipe than it goes through the
smallest part of the pipe.

The same is with data. So in my opinion you should always concentrate on
that smallest part. Any other optimisation is waste of time.

That the workstation software is to slow, is in my idea seldom or you
should
have old workstations P3 < 600Mhz.

That the newwork trafic is slow is no bad chance and is completely
dependend
from the typologic of the network and its throughput speed mostly you are
with big chunks faster in trouble.

That the server is the place which is in trouble. That is the place were
I
would look first, I would try to bring as much work as possible to the
client.

Where the last part as Michael wrote is exactly as I think about it.

This course assuming that it is not Citrix or other Terminal Server.

Just my thought,

Cor

"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...
> Hi
>
> I need to read in a large set of text files (9GB+ each) into a database
> table based on fixed width lengths.
>
> There are several ways to complete this, but I am wondering if anyone
> has
> insight into the FASTEST way to read in the data and chop it up ahead
> of
> posting it into the DB.
>
> So far, things work, but they are far slower than expected.
>
> Do I need to consider a VC++ app to get a large improvement here?
>
> thx
> Avi

May 8 '06 #9

vbnetdev

Email me off list (ad***@kjmsolutions.com ). I may have a solution for you
however will require a dll.

Kelly

--
Get a powerful web, database, application, and email hosting with KJM
Solutions
http://www.kjmsolutions.com

"Avi" <Av*@discussions.microsoft.com> wrote in message
news:20**********************************@microsof t.com...

Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi

May 8 '06 #10

Paul Clement

On Sun, 7 May 2006 07:46:01 -0700, Avi <Av*@discussions.microsoft.com> wrote:

¤ Hi
¤
¤ I need to read in a large set of text files (9GB+ each) into a database
¤ table based on fixed width lengths.
¤
¤ There are several ways to complete this, but I am wondering if anyone has
¤ insight into the FASTEST way to read in the data and chop it up ahead of
¤ posting it into the DB.
¤
¤ So far, things work, but they are far slower than expected.
¤
¤ Do I need to consider a VC++ app to get a large improvement here?

I don't know what type of database you are working with but the fastest methods are typically the
native database bulk operations as Spam Catcher mentioned. If you process a 9 GB file line by line
it's going to take a while regardless of what programmatic method you use.
Paul
~~~~
Microsoft MVP (Visual Basic)

May 8 '06 #11

Large text file import: MVP question

Similar topics