473,394 Members | 1,759 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,394 software developers and data experts.

Large text file import: MVP question

Avi
Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi
May 7 '06 #1
10 2070

MVP Question ,,, what is MVP about it ????

As you do not provide enough information it is impossiible to give you the
answer you probably expect .

nice things to know are :

how is the data delivered ? ( are conversions necessary)
what is the target database
what do you currently use
etc etc etc

To give you an idea what is possible :

I had one year ago the task to import MYSQL dump files to MS SQL server 2000
these files were + - 7 GB big and contained the MYSQL dialect
so using DTS was not an option as i had a few million data rows and then a
block of DDL SQL with the MYSQL dialect etc etc etc

so i wrote a simple tool in VB.Net 2005 containing a file reader a
stringbuilder and a SQL command object , i read the file line by line and
first searched all DDL SQL , i converted this SQL Dialect to MS T-SQL and
execuited it on the command object ( so now i had all the required tables )
, now i read the file again
and read in all data blocks ( Insert statements, that were translated on
the fly ) i executed these lines in batches of 1000 rows in a time
the hole import took 3 minutes ,,, my collegue setup a MYSQL server and
connected with ODBC the same operation took 4 Hours :-)
so i hope to have given you some ideas

regards

Michel Posseth [MCP]
"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...
Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi

May 7 '06 #2
Avi,

For this kind of question you can better ask a plumber to give you the
answer, water cannot go faster throug a pipe than it goes through the
smallest part of the pipe.

The same is with data. So in my opinion you should always concentrate on
that smallest part. Any other optimisation is waste of time.

That the workstation software is to slow, is in my idea seldom or you should
have old workstations P3 < 600Mhz.

That the newwork trafic is slow is no bad chance and is completely dependend
from the typologic of the network and its throughput speed mostly you are
with big chunks faster in trouble.

That the server is the place which is in trouble. That is the place were I
would look first, I would try to bring as much work as possible to the
client.

Where the last part as Michael wrote is exactly as I think about it.

This course assuming that it is not Citrix or other Terminal Server.

Just my thought,

Cor

"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...
Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi

May 7 '06 #3
Avi
With all due respect to each of the respondents who have assumed that I am
asking a simple question, let me be more explicit:

I am not interested in a simple asnwer - been there, done that : SQL BCP/
DTS, filest/stream readers/writers splits, sub strings , mids - the whole
nine yards.

I am testing this on a quad xeon with 4gb ram, dual fibre channel san array
- not some small, home PC. Network issues have been accounted for (and are
not part of the baseline improvement that I am trying to find).

The issue is simply this: what is the fastest way to read in a fixed width
text file and split it apart by its constituent fields. don't worry about the
return to the DB, thats irrevelant for now. So far the benchmark has been ok
- but its not brilliant. the disk I/O is well short of the capabilities. cpu
and memory are barely in use. That means either our code is crap or the .net
capabilitiy is crap.

We have simplified our code to a simple basic test. So, now I'm focusing on
the underlying .net solutions.

I'm looking for real performance answers. If you don't have them - fine,
I'll look elsewhere. I just thought a high end question would require a high
end programmer with high end experience.

Hope this clears it up. If not, perhaps you can recommend a plumber who
might know something....... :)

Avi
"Cor Ligthert [MVP]" wrote:
Avi,

For this kind of question you can better ask a plumber to give you the
answer, water cannot go faster throug a pipe than it goes through the
smallest part of the pipe.

The same is with data. So in my opinion you should always concentrate on
that smallest part. Any other optimisation is waste of time.

That the workstation software is to slow, is in my idea seldom or you should
have old workstations P3 < 600Mhz.

That the newwork trafic is slow is no bad chance and is completely dependend
from the typologic of the network and its throughput speed mostly you are
with big chunks faster in trouble.

That the server is the place which is in trouble. That is the place were I
would look first, I would try to bring as much work as possible to the
client.

Where the last part as Michael wrote is exactly as I think about it.

This course assuming that it is not Citrix or other Terminal Server.

Just my thought,

Cor

"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...
Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi


May 7 '06 #4
Avi
Thanks for the response - but DTS is not an option.

"Spam Catcher" wrote:
=?Utf-8?B?QXZp?= <Av*@discussions.microsoft.com> wrote in
news:20**********************************@microsof t.com:
There are several ways to complete this, but I am wondering if anyone
has insight into the FASTEST way to read in the data and chop it up
ahead of posting it into the DB.


You can do this via SQL BCP, Bulk Inserts or DTS.

We have a project in which we build a dynamic DTS package via VB.NET to do
imports. The advantage of this is that the DTS graph can be dynamically
built to handle a variety of file layouts. DTS can also handle special
characters during import - something Bulk Inserts doesn't do. With BCP
you'll need to create a format file which is a bit of a hassle if your file
import specifications change a lot.

May 7 '06 #5
Avi
Read my response to Cor. This is very MVP.
"Michel Posseth [MCP]" wrote:

MVP Question ,,, what is MVP about it ????

As you do not provide enough information it is impossiible to give you the
answer you probably expect .

nice things to know are :

how is the data delivered ? ( are conversions necessary)
what is the target database
what do you currently use
etc etc etc

To give you an idea what is possible :

I had one year ago the task to import MYSQL dump files to MS SQL server 2000
these files were + - 7 GB big and contained the MYSQL dialect
so using DTS was not an option as i had a few million data rows and then a
block of DDL SQL with the MYSQL dialect etc etc etc

so i wrote a simple tool in VB.Net 2005 containing a file reader a
stringbuilder and a SQL command object , i read the file line by line and
first searched all DDL SQL , i converted this SQL Dialect to MS T-SQL and
execuited it on the command object ( so now i had all the required tables )
, now i read the file again
and read in all data blocks ( Insert statements, that were translated on
the fly ) i executed these lines in batches of 1000 rows in a time
the hole import took 3 minutes ,,, my collegue setup a MYSQL server and
connected with ODBC the same operation took 4 Hours :-)
so i hope to have given you some ideas

regards

Michel Posseth [MCP]
"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...
Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi


May 7 '06 #6
Avi,

And still the answer should be simple. Simple solutions give fast processing
applications.

One thing you have to keep in mind. Although there can be one ILS
instruction than still can this mean thousand times of looping. The 8086
processor famillie and its descendants are not that advanced that they can
do things in one cycle so behind your code is a lot of looping or whatever.
All methods from the Visual Basic Namespace are therefore using more
througput time than from the basic Net namespace. Normally no problem,
humans cannot see milliseconds.

The second thing you have to keep in mind because that you are saying
splitting that the normal string is not mutable. This means that every
change, with mid, split, tolower, or whatever will result in a new string.
However overdoing this can of course cost you a lot of time.

But giving you any reasonable answer withouth seeing even a piece of your
current code is in my opinion impossible. If it was, then there would not
have been so many possibilities.

By the way, you have selected the chance on an answer very much by writting
that this is an MVP question. Probably somebody who is dedicated to your
problem however not a MVP has
said.............................................. ........... Two words
mostly said with the middle finger in the air.

I hope this helps something,

Cor

"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:EE**********************************@microsof t.com...
With all due respect to each of the respondents who have assumed that I am
asking a simple question, let me be more explicit:

I am not interested in a simple asnwer - been there, done that : SQL BCP/
DTS, filest/stream readers/writers splits, sub strings , mids - the whole
nine yards.

I am testing this on a quad xeon with 4gb ram, dual fibre channel san
array
- not some small, home PC. Network issues have been accounted for (and are
not part of the baseline improvement that I am trying to find).

The issue is simply this: what is the fastest way to read in a fixed width
text file and split it apart by its constituent fields. don't worry about
the
return to the DB, thats irrevelant for now. So far the benchmark has been
ok
- but its not brilliant. the disk I/O is well short of the capabilities.
cpu
and memory are barely in use. That means either our code is crap or the
.net
capabilitiy is crap.

We have simplified our code to a simple basic test. So, now I'm focusing
on
the underlying .net solutions.

I'm looking for real performance answers. If you don't have them - fine,
I'll look elsewhere. I just thought a high end question would require a
high
end programmer with high end experience.

Hope this clears it up. If not, perhaps you can recommend a plumber who
might know something....... :)

Avi
"Cor Ligthert [MVP]" wrote:
Avi,

For this kind of question you can better ask a plumber to give you the
answer, water cannot go faster throug a pipe than it goes through the
smallest part of the pipe.

The same is with data. So in my opinion you should always concentrate on
that smallest part. Any other optimisation is waste of time.

That the workstation software is to slow, is in my idea seldom or you
should
have old workstations P3 < 600Mhz.

That the newwork trafic is slow is no bad chance and is completely
dependend
from the typologic of the network and its throughput speed mostly you are
with big chunks faster in trouble.

That the server is the place which is in trouble. That is the place were
I
would look first, I would try to bring as much work as possible to the
client.

Where the last part as Michael wrote is exactly as I think about it.

This course assuming that it is not Citrix or other Terminal Server.

Just my thought,

Cor

"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...
> Hi
>
> I need to read in a large set of text files (9GB+ each) into a database
> table based on fixed width lengths.
>
> There are several ways to complete this, but I am wondering if anyone
> has
> insight into the FASTEST way to read in the data and chop it up ahead
> of
> posting it into the DB.
>
> So far, things work, but they are far slower than expected.
>
> Do I need to consider a VC++ app to get a large improvement here?
>
> thx
> Avi


May 8 '06 #7
Read my response to Cor. This is very MVP.
No it isn`t , I have never seen a MVP signature at a post of Francesco
Balena
for instance :-)

Having read the answer of Cor, i can only fully concur with him

By the way ,,, i believe i have earned my stripes as a professional
programmer
( made programs for thousands of users throughout europe ) in my previous
Job i worked at a Automotive data provider ( the market leader in there
business ) where i was not only a lead developer , but also the sql server (
2000 , 2005 ) database administrator .

One of my responsability`s was optimizing data conversions ( imports ) from
external delivered data ( Mysql , flat text files etc etc )

I worked only with Poweredges , with at least 8 gigs of mem , in raid 10
( and no i did not have one of them we had i believe 25 of those beasts in
our server room , besides the other servers the companny had )

So finished beating my chest , i can tell you that nobody MVP, MCP , MCAD ,
MCSD or Steve and / or even Bill himself
can`t answer your question correctly unless you provide some detailed info

and you did not do that ... so .....

regards

Michel Posseth
just a Software developer

"Avi" wrote:
Read my response to Cor. This is very MVP.
"Michel Posseth [MCP]" wrote:

MVP Question ,,, what is MVP about it ????

As you do not provide enough information it is impossiible to give you the
answer you probably expect .

nice things to know are :

how is the data delivered ? ( are conversions necessary)
what is the target database
what do you currently use
etc etc etc

To give you an idea what is possible :

I had one year ago the task to import MYSQL dump files to MS SQL server 2000
these files were + - 7 GB big and contained the MYSQL dialect
so using DTS was not an option as i had a few million data rows and then a
block of DDL SQL with the MYSQL dialect etc etc etc

so i wrote a simple tool in VB.Net 2005 containing a file reader a
stringbuilder and a SQL command object , i read the file line by line and
first searched all DDL SQL , i converted this SQL Dialect to MS T-SQL and
execuited it on the command object ( so now i had all the required tables )
, now i read the file again
and read in all data blocks ( Insert statements, that were translated on
the fly ) i executed these lines in batches of 1000 rows in a time
the hole import took 3 minutes ,,, my collegue setup a MYSQL server and
connected with ODBC the same operation took 4 Hours :-)
so i hope to have given you some ideas

regards

Michel Posseth [MCP]
"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...
Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi


May 8 '06 #8
Avi
Thanks for the incoherent response..... I guess if I look beyond the
defensive language, I'll find something useful.... Oh. I can't.

In many other posts to other newsgroups, if I have not posted enough info, I
simply get asked for it. I have never before been so acosted for asking a
question the wrong way. I think thats the idea of newsgroups in the first
place.

Time to get off your high perch. I'm not perfect. And I'll bet you aren't
either - MVP or not.

"Cor Ligthert [MVP]" wrote:
Avi,

And still the answer should be simple. Simple solutions give fast processing
applications.

One thing you have to keep in mind. Although there can be one ILS
instruction than still can this mean thousand times of looping. The 8086
processor famillie and its descendants are not that advanced that they can
do things in one cycle so behind your code is a lot of looping or whatever.
All methods from the Visual Basic Namespace are therefore using more
througput time than from the basic Net namespace. Normally no problem,
humans cannot see milliseconds.

The second thing you have to keep in mind because that you are saying
splitting that the normal string is not mutable. This means that every
change, with mid, split, tolower, or whatever will result in a new string.
However overdoing this can of course cost you a lot of time.

But giving you any reasonable answer withouth seeing even a piece of your
current code is in my opinion impossible. If it was, then there would not
have been so many possibilities.

By the way, you have selected the chance on an answer very much by writting
that this is an MVP question. Probably somebody who is dedicated to your
problem however not a MVP has
said.............................................. ........... Two words
mostly said with the middle finger in the air.

I hope this helps something,

Cor

"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:EE**********************************@microsof t.com...
With all due respect to each of the respondents who have assumed that I am
asking a simple question, let me be more explicit:

I am not interested in a simple asnwer - been there, done that : SQL BCP/
DTS, filest/stream readers/writers splits, sub strings , mids - the whole
nine yards.

I am testing this on a quad xeon with 4gb ram, dual fibre channel san
array
- not some small, home PC. Network issues have been accounted for (and are
not part of the baseline improvement that I am trying to find).

The issue is simply this: what is the fastest way to read in a fixed width
text file and split it apart by its constituent fields. don't worry about
the
return to the DB, thats irrevelant for now. So far the benchmark has been
ok
- but its not brilliant. the disk I/O is well short of the capabilities.
cpu
and memory are barely in use. That means either our code is crap or the
.net
capabilitiy is crap.

We have simplified our code to a simple basic test. So, now I'm focusing
on
the underlying .net solutions.

I'm looking for real performance answers. If you don't have them - fine,
I'll look elsewhere. I just thought a high end question would require a
high
end programmer with high end experience.

Hope this clears it up. If not, perhaps you can recommend a plumber who
might know something....... :)

Avi
"Cor Ligthert [MVP]" wrote:
Avi,

For this kind of question you can better ask a plumber to give you the
answer, water cannot go faster throug a pipe than it goes through the
smallest part of the pipe.

The same is with data. So in my opinion you should always concentrate on
that smallest part. Any other optimisation is waste of time.

That the workstation software is to slow, is in my idea seldom or you
should
have old workstations P3 < 600Mhz.

That the newwork trafic is slow is no bad chance and is completely
dependend
from the typologic of the network and its throughput speed mostly you are
with big chunks faster in trouble.

That the server is the place which is in trouble. That is the place were
I
would look first, I would try to bring as much work as possible to the
client.

Where the last part as Michael wrote is exactly as I think about it.

This course assuming that it is not Citrix or other Terminal Server.

Just my thought,

Cor

"Avi" <Av*@discussions.microsoft.com> schreef in bericht
news:20**********************************@microsof t.com...
> Hi
>
> I need to read in a large set of text files (9GB+ each) into a database
> table based on fixed width lengths.
>
> There are several ways to complete this, but I am wondering if anyone
> has
> insight into the FASTEST way to read in the data and chop it up ahead
> of
> posting it into the DB.
>
> So far, things work, but they are far slower than expected.
>
> Do I need to consider a VC++ app to get a large improvement here?
>
> thx
> Avi


May 8 '06 #9
Email me off list (ad***@kjmsolutions.com ). I may have a solution for you
however will require a dll.

Kelly

--
Get a powerful web, database, application, and email hosting with KJM
Solutions
http://www.kjmsolutions.com

"Avi" <Av*@discussions.microsoft.com> wrote in message
news:20**********************************@microsof t.com...
Hi

I need to read in a large set of text files (9GB+ each) into a database
table based on fixed width lengths.

There are several ways to complete this, but I am wondering if anyone has
insight into the FASTEST way to read in the data and chop it up ahead of
posting it into the DB.

So far, things work, but they are far slower than expected.

Do I need to consider a VC++ app to get a large improvement here?

thx
Avi

May 8 '06 #10
On Sun, 7 May 2006 07:46:01 -0700, Avi <Av*@discussions.microsoft.com> wrote:

¤ Hi
¤
¤ I need to read in a large set of text files (9GB+ each) into a database
¤ table based on fixed width lengths.
¤
¤ There are several ways to complete this, but I am wondering if anyone has
¤ insight into the FASTEST way to read in the data and chop it up ahead of
¤ posting it into the DB.
¤
¤ So far, things work, but they are far slower than expected.
¤
¤ Do I need to consider a VC++ app to get a large improvement here?

I don't know what type of database you are working with but the fastest methods are typically the
native database bulk operations as Spam Catcher mentioned. If you process a 9 GB file line by line
it's going to take a while regardless of what programmatic method you use.
Paul
~~~~
Microsoft MVP (Visual Basic)
May 8 '06 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Pai | last post by:
Hello there, I have and small excel file, which when I try to import into SQl Server will give an error "Data for source column 4 is too large for the specified buffer size" I have four...
5
by: Louis LeBlanc | last post by:
Hey folks. I'm new to the list, and not quite what you'd call a DB Guru, so please be patient with me. I'm afraid the lead up here is a bit verbose . . . I am working on an application that...
16
by: Claudio Grondi | last post by:
What started as a simple test if it is better to load uncompressed data directly from the harddisk or load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz system with 3 GByte...
6
by: Yi Xing | last post by:
Hi, I need to read specific lines of huge text files. Each time, I know exactly which line(s) I want to read. readlines() or readline() in a loop is just too slow. Since different lines have...
14
by: mfrsousa | last post by:
hi there, i have a huge large text file (350.000 lines) that i want to import to a MS Acccess Database, of course i don't want to use Access, but do it with C#. i already have tried the...
4
by: johnporter123 | last post by:
Does anyone have a method of importing a large "FLAT" CSV file into access. The file has well over 255 columns (Fields). Before anyone flames me over normalization, I do not have access to the...
25
by: tekctrl | last post by:
Anyone: I have a simple MSAccess DB which was created from an old ASCII flatfile. It works fine except for something that just started happening. I'll enter info in a record, save the record,...
6
by: Terry Carroll | last post by:
I am trying to do something with a very large tarfile from within Python, and am running into memory constraints. The tarfile in question is a 4-gigabyte datafile from freedb.org,...
20
by: Simon Strobl | last post by:
Hello, I tried to load a 6.8G large dictionary on a server that has 128G of memory. I got a memory error. I used Python 2.5.2. How can I load my data? SImon
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.