473,320 Members | 1,974 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Parser for CSV files

Hi,

Is there a parser which parses CSV files?

Thanks for your help.
Reza
Nov 16 '05 #1
25 12726
you mean something like String.split() on a comma ??

--Saurabh

"Ali-R" <Al**@microsft.com> wrote in message
news:eE**************@TK2MSFTNGP10.phx.gbl...
Hi,

Is there a parser which parses CSV files?

Thanks for your help.
Reza

Nov 16 '05 #2
No,I'm getting a CSV file from a client and I need to open it up and do some
checks on it it before handing it over to another programme.I'm looking for
something similar to get some ideas how to pars a CSV file in the most
efficient way.

Thanks for your quick help.

"saurabh" <sa*****@nagpurcity.net> wrote in message
news:OL**************@TK2MSFTNGP12.phx.gbl...
you mean something like String.split() on a comma ??

--Saurabh

"Ali-R" <Al**@microsft.com> wrote in message
news:eE**************@TK2MSFTNGP10.phx.gbl...
Hi,

Is there a parser which parses CSV files?

Thanks for your help.
Reza


Nov 16 '05 #3
There's loads of CSV parsers out there:

http://www.google.co.uk/search?sourc...sv+parser+c%23

Good luck.

"Ali-R" <Al**@microsft.com> wrote in message
news:u0**************@TK2MSFTNGP12.phx.gbl...
No,I'm getting a CSV file from a client and I need to open it up and do
some checks on it it before handing it over to another programme.I'm
looking for something similar to get some ideas how to pars a CSV file in
the most efficient way.

Thanks for your quick help.

"saurabh" <sa*****@nagpurcity.net> wrote in message
news:OL**************@TK2MSFTNGP12.phx.gbl...
you mean something like String.split() on a comma ??

--Saurabh

"Ali-R" <Al**@microsft.com> wrote in message
news:eE**************@TK2MSFTNGP10.phx.gbl...
Hi,

Is there a parser which parses CSV files?

Thanks for your help.
Reza



Nov 16 '05 #4
Nothing built-in for this. The two solutions most commonly given to this
question is to use String.Spit() (too simplistic for general use; it doesn't
handle field values that are quoted and have a comma within them, for
instance) ... or, use ADO.NET to read/write a CSV file via, say, a generic
text ODBC driver or the Excel driver. The ADO.NET solution has the downside
of configuration issues to iron out, and I believe a 254 field limit in some
cases. Particularly if deployment will be on many arbitrary machines, you
have to worry about whether your chosen drive is installed, what version it
is, etc. It does make life more complicated and finicky.

Since I read and write a variety of CSV and CSV-like formats all the time
(pipe-delimited, tab-delimited, space-delimited, and various bizarre formats
such as field label....value, one field to a line, and even scraping of text
reports), I've found it much more efficient and trouble free to write my own
classes for this purpose. For example I have a DelimitedTextFile() class
whose constructor takes a file name (or a FileInfo instance), that has a
ReadRecord() method that returns an arraylist of record values, and a
WriteRecord() method that takes an arraylist of values and writes them out.
And a FieldInfo property consisting of an arraylist of FieldInfo structs
that defines the field name/data type of each field. All of this works very
fast and has proven to be highly adaptable to anything that gets thrown at
it.

--Bob

"Ali-R" <Al**@microsft.com> wrote in message
news:u0**************@TK2MSFTNGP12.phx.gbl...
No,I'm getting a CSV file from a client and I need to open it up and do
some checks on it it before handing it over to another programme.I'm
looking for something similar to get some ideas how to pars a CSV file in
the most efficient way.

Thanks for your quick help.

"saurabh" <sa*****@nagpurcity.net> wrote in message
news:OL**************@TK2MSFTNGP12.phx.gbl...
you mean something like String.split() on a comma ??

--Saurabh

"Ali-R" <Al**@microsft.com> wrote in message
news:eE**************@TK2MSFTNGP10.phx.gbl...
Hi,

Is there a parser which parses CSV files?

Thanks for your help.
Reza

Nov 16 '05 #5
k
I ended up writing a CSV parser based upon the Microsoft.Jet.OLEDB.4.0
driver, which I use all the time! It's not hard and you can then build
the validator(s) as a seperate component.

Kirsty

Nov 16 '05 #6
Thanks for your help
"Dan Bass" <Not Listed> wrote in message
news:eW**************@TK2MSFTNGP14.phx.gbl...
There's loads of CSV parsers out there:

http://www.google.co.uk/search?sourc...sv+parser+c%23

Good luck.

"Ali-R" <Al**@microsft.com> wrote in message
news:u0**************@TK2MSFTNGP12.phx.gbl...
No,I'm getting a CSV file from a client and I need to open it up and do
some checks on it it before handing it over to another programme.I'm
looking for something similar to get some ideas how to pars a CSV file in
the most efficient way.

Thanks for your quick help.

"saurabh" <sa*****@nagpurcity.net> wrote in message
news:OL**************@TK2MSFTNGP12.phx.gbl...
you mean something like String.split() on a comma ??

--Saurabh

"Ali-R" <Al**@microsft.com> wrote in message
news:eE**************@TK2MSFTNGP10.phx.gbl...
Hi,

Is there a parser which parses CSV files?

Thanks for your help.
Reza



Nov 16 '05 #7
Excellent ,very nice idea.here is some questions in regards to your
solution:

1) you said :
use ADO.NET to read/write a CSV file via, say, a generic text ODBC driver
or the Excel driver What do yuo mean by a generic text ODBC driver or the Excel driver?

2)I didn't underestand what's the usage of And a FieldInfo property consisting of an arraylist of FieldInfo structs
that defines the field name/data type of each field.
3)I need to validate each record value against some complicated rules ,what
have you done about that?

4) Can you send me that example by any chance?
my email is Fa**********@yahoo.com (remove Fake from the beggining of my
userid:-)

Thanks very much for your nice help.

Ali


"Bob Grommes" <bo*@bobgrommes.com> wrote in message
news:un**************@TK2MSFTNGP15.phx.gbl... Nothing built-in for this. The two solutions most commonly given to this
question is to use String.Spit() (too simplistic for general use; it
doesn't handle field values that are quoted and have a comma within them,
for instance) ... or, use ADO.NET to read/write a CSV file via, say, a
generic text ODBC driver or the Excel driver. The ADO.NET solution has
the downside of configuration issues to iron out, and I believe a 254
field limit in some cases. Particularly if deployment will be on many
arbitrary machines, you have to worry about whether your chosen drive is
installed, what version it is, etc. It does make life more complicated
and finicky.

Since I read and write a variety of CSV and CSV-like formats all the time
(pipe-delimited, tab-delimited, space-delimited, and various bizarre
formats such as field label....value, one field to a line, and even
scraping of text reports), I've found it much more efficient and trouble
free to write my own classes for this purpose. For example I have a
DelimitedTextFile() class whose constructor takes a file name (or a
FileInfo instance), that has a ReadRecord() method that returns an
arraylist of record values, and a WriteRecord() method that takes an
arraylist of values and writes them out. And a FieldInfo property
consisting of an arraylist of FieldInfo structs that defines the field
name/data type of each field. All of this works very fast and has proven
to be highly adaptable to anything that gets thrown at it.

--Bob

"Ali-R" <Al**@microsft.com> wrote in message
news:u0**************@TK2MSFTNGP12.phx.gbl...
No,I'm getting a CSV file from a client and I need to open it up and do
some checks on it it before handing it over to another programme.I'm
looking for something similar to get some ideas how to pars a CSV file in
the most efficient way.

Thanks for your quick help.

"saurabh" <sa*****@nagpurcity.net> wrote in message
news:OL**************@TK2MSFTNGP12.phx.gbl...
you mean something like String.split() on a comma ??

--Saurabh

"Ali-R" <Al**@microsft.com> wrote in message
news:eE**************@TK2MSFTNGP10.phx.gbl...
Hi,

Is there a parser which parses CSV files?

Thanks for your help.
Reza


Nov 16 '05 #8
Hi,

Take a look at www.opennetcf.org they provide one free with source code,
I'm using it and works great

cheers,

--
Ignacio Machin,
ignacio.machin AT dot.state.fl.us
Florida Department Of Transportation

"Ali-R" <Al**@microsft.com> wrote in message
news:eE**************@TK2MSFTNGP10.phx.gbl...
Hi,

Is there a parser which parses CSV files?

Thanks for your help.
Reza

Nov 16 '05 #9
Where is it ? I can't find it there.

Thanks for your help.
"Ignacio Machin ( .NET/ C# MVP )" <ignacio.machin AT dot.state.fl.us> wrote
in message news:%2******************@TK2MSFTNGP10.phx.gbl...
Hi,

Take a look at www.opennetcf.org they provide one free with source code,
I'm using it and works great

cheers,

--
Ignacio Machin,
ignacio.machin AT dot.state.fl.us
Florida Department Of Transportation

"Ali-R" <Al**@microsft.com> wrote in message
news:eE**************@TK2MSFTNGP10.phx.gbl...
Hi,

Is there a parser which parses CSV files?

Thanks for your help.
Reza


Nov 16 '05 #10
Are you using the Visual Basic sample showing how to use the CsvDataAdapter
to read and write a table of data to a csv text file?

Thanks
"Ignacio Machin ( .NET/ C# MVP )" <ignacio.machin AT dot.state.fl.us> wrote
in message news:%2******************@TK2MSFTNGP10.phx.gbl...
Hi,

Take a look at www.opennetcf.org they provide one free with source code,
I'm using it and works great

cheers,

--
Ignacio Machin,
ignacio.machin AT dot.state.fl.us
Florida Department Of Transportation

"Ali-R" <Al**@microsft.com> wrote in message
news:eE**************@TK2MSFTNGP10.phx.gbl...
Hi,

Is there a parser which parses CSV files?

Thanks for your help.
Reza


Nov 16 '05 #11
Here's a deep link:

http://www.opennetcf.org/SourceBrows...DataAdapter.cs

--Bob

"Ali-R" <Al**@microsft.com> wrote in message
news:Of**************@tk2msftngp13.phx.gbl...
Where is it ? I can't find it there.

Thanks for your help.
"Ignacio Machin ( .NET/ C# MVP )" <ignacio.machin AT dot.state.fl.us>
wrote in message news:%2******************@TK2MSFTNGP10.phx.gbl...
Hi,

Take a look at www.opennetcf.org they provide one free with source code,
I'm using it and works great

cheers,

--
Ignacio Machin,
ignacio.machin AT dot.state.fl.us
Florida Department Of Transportation

"Ali-R" <Al**@microsft.com> wrote in message
news:eE**************@TK2MSFTNGP10.phx.gbl...
Hi,

Is there a parser which parses CSV files?

Thanks for your help.
Reza



Nov 16 '05 #12
"Ali-R" <Al**@microsft.com> wrote in message
news:ul*************@TK2MSFTNGP14.phx.gbl...
1) you said :
use ADO.NET to read/write a CSV file via, say, a generic text ODBC driver
or the Excel driver What do yuo mean by a generic text ODBC driver or the Excel driver?


Microsoft operating systems used to come with "desktop ODBC drivers" amongst
which was one called (if I recall correctly) Generic Text or just the "text
driver", for reading CSVs and fixed field length text files as if they were
database tables. There is also an ODBC driver that reads Excel files
including CSVs. I think it's part of the JET driver. I've not personally
used either one but they would work like any other ADO.NET data source, via
the ODBC provider.
2)I didn't underestand what's the usage of
And a FieldInfo property consisting of an arraylist of FieldInfo structs
that defines the field name/data type of each field.

Unless you want all of your fields to be strings and don't care to handle /
validate field name header records that are part of some CSV files, you need
some mechanism to define field names, field order, field data types, and
perhaps width and basic formatting instructions. You just create a struct
or class that defines these items.
3)I need to validate each record value against some complicated rules
,what have you done about that?
Whatever is necessary. That is up to the client program. A basic loop
looks something like:
using (f = new DelimitedTextFile("somefile.csv")) {
// code here to set up the FieldInfo arraylist
ArrayList fieldData = new ArrayList();

while (fieldData = f.ReadRecord() != null) {
// Do whatever you need to with the current record, which is in
fieldData.
}

}
4) Can you send me that example by any chance?
my email is Fa**********@yahoo.com (remove Fake from the beggining of my
userid:-)


Sorry, the code was written under nondisclosure and I am not permitted to
release the source code. But it isn't rocket science to put something like
this together and it's well worth the effort.

--Bob
Nov 16 '05 #13
I have been using a very good one from
http://www.codeproject.com/useritems/CsvReader.asp

This is free and extremely fast, particluarly when handling large CSV
files.

Nov 16 '05 #14
Bob ,thanks very much for your nice replies
Unless you want all of your fields to be strings and don't care to handle
/ validate field name header records that are part of some CSV files, you
need some mechanism to define field names, field order, field data types,
and perhaps width and basic formatting instructions. You just create a
struct or class that defines these items.

I don't think I need to be worried about fieldNames,but datatype of the
field (e.g ,some times it should be only a number and no letters allowed)
,field order is really important to me ,so I have to use FieldInfo.
I'm going to implement it this weekend ,if I had small questions ,can I
email it to you using the email address you have provided?

Thanks again
"Bob Grommes" <bo*@bobgrommes.com> wrote in message
news:eX**************@tk2msftngp13.phx.gbl... "Ali-R" <Al**@microsft.com> wrote in message
news:ul*************@TK2MSFTNGP14.phx.gbl...
1) you said :
use ADO.NET to read/write a CSV file via, say, a generic text ODBC driver
or the Excel driver

What do yuo mean by a generic text ODBC driver or the Excel driver?


Microsoft operating systems used to come with "desktop ODBC drivers"
amongst which was one called (if I recall correctly) Generic Text or just
the "text driver", for reading CSVs and fixed field length text files as
if they were database tables. There is also an ODBC driver that reads
Excel files including CSVs. I think it's part of the JET driver. I've
not personally used either one but they would work like any other ADO.NET
data source, via the ODBC provider.
2)I didn't underestand what's the usage of
And a FieldInfo property consisting of an arraylist of FieldInfo structs
that defines the field name/data type of each field.


Unless you want all of your fields to be strings and don't care to handle
/ validate field name header records that are part of some CSV files, you
need some mechanism to define field names, field order, field data types,
and perhaps width and basic formatting instructions. You just create a
struct or class that defines these items.
3)I need to validate each record value against some complicated rules
,what have you done about that?


Whatever is necessary. That is up to the client program. A basic loop
looks something like:
using (f = new DelimitedTextFile("somefile.csv")) {
// code here to set up the FieldInfo arraylist
ArrayList fieldData = new ArrayList();

while (fieldData = f.ReadRecord() != null) {
// Do whatever you need to with the current record, which is in
fieldData.
}

}
4) Can you send me that example by any chance?
my email is Fa**********@yahoo.com (remove Fake from the beggining of my
userid:-)


Sorry, the code was written under nondisclosure and I am not permitted to
release the source code. But it isn't rocket science to put something
like this together and it's well worth the effort.

--Bob

Nov 16 '05 #15
Field names can become important when you have a header record in the CSV
and need to check that the header record does not change from what's
expected, since change in field order is significant, and change in field
names *may* be significant. It has been my experience that data from
outside sources can change at any time without notice, no matter whether the
provider of the data guarantees they will or will not. So it pays to check
that you're getting what you expect before you go and put the wrong data
into the wrong fields or something.

In any case I like to name all the fields just for debugging purposes, even
when there is no header record. Field names can also be useful for
populating default data grid headers and the like, in the UI, though I've
never actually needed to do it. Finally you could create indexers or
methods that would return field values based on field name rather than just
field index. Again, something I haven't actually needed to do, but it could
be handy at times.

My FieldInfo concept is not magical, your parsing / reading / writing class
has to use that info to do the appropriate transformations and enforce rules
or raise errors; it's just a container for some limited schema information.
There are probably lots of ways you could structure it, this is just the way
I chose.

One thing I have found is that although I built in support for converting
fields to and from specific .NET data types as part of reading and writing
CSVs, in practice I haven't used that nearly as much as I thought I would,
and tend to just define all fields as strings. For the work I do anyway, it
seems better to let client applications of these text file I/O classes be
responsible for whatever scrubbing and conversion they need to do. If you
wanted to be brutally simplistic you could just design a class that treats
all fields as strings and is just responsible for correctly separating out
each field string in order, and doesn't care about names, data types, or
anything else but perhaps how many fields to expect. The decision here is a
matter of where you want to replace the responsibility for data integrity
and how you want to handle data that doesn't conform to expectations.

You can always drop me a note. I can't guarantee I'll have time to respond,
at least right away, but I will try.

Best,

--Bob

"Ali-R" <Al**@microsft.com> wrote in message
news:OV**************@TK2MSFTNGP09.phx.gbl...
Bob ,thanks very much for your nice replies
Unless you want all of your fields to be strings and don't care to handle
/ validate field name header records that are part of some CSV files, you
need some mechanism to define field names, field order, field data types,
and perhaps width and basic formatting instructions. You just create a
struct or class that defines these items.

I don't think I need to be worried about fieldNames,but datatype of the
field (e.g ,some times it should be only a number and no letters allowed)
,field order is really important to me ,so I have to use FieldInfo.
I'm going to implement it this weekend ,if I had small questions ,can I
email it to you using the email address you have provided?

Thanks again

Nov 16 '05 #16
Thanks for your suggessions.

"Bob Grommes" <bo*@bobgrommes.com> wrote in message
news:ON**************@TK2MSFTNGP09.phx.gbl...
Field names can become important when you have a header record in the CSV
and need to check that the header record does not change from what's
expected, since change in field order is significant, and change in field
names *may* be significant. It has been my experience that data from
outside sources can change at any time without notice, no matter whether
the provider of the data guarantees they will or will not. So it pays to
check that you're getting what you expect before you go and put the wrong
data into the wrong fields or something.

In any case I like to name all the fields just for debugging purposes,
even when there is no header record. Field names can also be useful for
populating default data grid headers and the like, in the UI, though I've
never actually needed to do it. Finally you could create indexers or
methods that would return field values based on field name rather than
just field index. Again, something I haven't actually needed to do, but
it could be handy at times.

My FieldInfo concept is not magical, your parsing / reading / writing
class has to use that info to do the appropriate transformations and
enforce rules or raise errors; it's just a container for some limited
schema information. There are probably lots of ways you could structure
it, this is just the way I chose.

One thing I have found is that although I built in support for converting
fields to and from specific .NET data types as part of reading and writing
CSVs, in practice I haven't used that nearly as much as I thought I would,
and tend to just define all fields as strings. For the work I do anyway,
it seems better to let client applications of these text file I/O classes
be responsible for whatever scrubbing and conversion they need to do. If
you wanted to be brutally simplistic you could just design a class that
treats all fields as strings and is just responsible for correctly
separating out each field string in order, and doesn't care about names,
data types, or anything else but perhaps how many fields to expect. The
decision here is a matter of where you want to replace the responsibility
for data integrity and how you want to handle data that doesn't conform to
expectations.

You can always drop me a note. I can't guarantee I'll have time to
respond, at least right away, but I will try.

Best,

--Bob

"Ali-R" <Al**@microsft.com> wrote in message
news:OV**************@TK2MSFTNGP09.phx.gbl...
Bob ,thanks very much for your nice replies
Unless you want all of your fields to be strings and don't care to
handle / validate field name header records that are part of some CSV
files, you need some mechanism to define field names, field order, field
data types, and perhaps width and basic formatting instructions. You
just create a struct or class that defines these items.

I don't think I need to be worried about fieldNames,but datatype of the
field (e.g ,some times it should be only a number and no letters
allowed) ,field order is really important to me ,so I have to use
FieldInfo.
I'm going to implement it this weekend ,if I had small questions ,can I
email it to you using the email address you have provided?

Thanks again


Nov 16 '05 #17
Bob,

I forgot to ask ,what 's your sterategy usually for transformig CSV files
into DB?
Here is what I'm going to do:
================================================== =====================================
I'm recieving some CSV files from our clients and they are supposed to be
written in the Database.Considering the following issues:
1) The CSV files should be hughly validated,each field against a lot of
bussiness rules
2) Everything must be logged
I) Invlaid Field Values
II) Errors and Warnings happened during pumping data from CSV into
DB
III)All other Error and Exceptions
I choose to have a Windows Service which monitors a folder for CSV files and
picks them up as soon as they are created and after validation( Parsing +
Validating) ,executes a DTS package (anything happenes in DTS ,windows
services gets notified) .there is also a log component in the Windows
Service which logs everything.
I just wanted to know if anybody has the experience.I'm looking for the best
strategy actually.
Thanks
Ali-R

"Bob Grommes" <bo*@bobgrommes.com> wrote in message
news:ON**************@TK2MSFTNGP09.phx.gbl...
Field names can become important when you have a header record in the CSV
and need to check that the header record does not change from what's
expected, since change in field order is significant, and change in field
names *may* be significant. It has been my experience that data from
outside sources can change at any time without notice, no matter whether
the provider of the data guarantees they will or will not. So it pays to
check that you're getting what you expect before you go and put the wrong
data into the wrong fields or something.

In any case I like to name all the fields just for debugging purposes,
even when there is no header record. Field names can also be useful for
populating default data grid headers and the like, in the UI, though I've
never actually needed to do it. Finally you could create indexers or
methods that would return field values based on field name rather than
just field index. Again, something I haven't actually needed to do, but
it could be handy at times.

My FieldInfo concept is not magical, your parsing / reading / writing
class has to use that info to do the appropriate transformations and
enforce rules or raise errors; it's just a container for some limited
schema information. There are probably lots of ways you could structure
it, this is just the way I chose.

One thing I have found is that although I built in support for converting
fields to and from specific .NET data types as part of reading and writing
CSVs, in practice I haven't used that nearly as much as I thought I would,
and tend to just define all fields as strings. For the work I do anyway,
it seems better to let client applications of these text file I/O classes
be responsible for whatever scrubbing and conversion they need to do. If
you wanted to be brutally simplistic you could just design a class that
treats all fields as strings and is just responsible for correctly
separating out each field string in order, and doesn't care about names,
data types, or anything else but perhaps how many fields to expect. The
decision here is a matter of where you want to replace the responsibility
for data integrity and how you want to handle data that doesn't conform to
expectations.

You can always drop me a note. I can't guarantee I'll have time to
respond, at least right away, but I will try.

Best,

--Bob

"Ali-R" <Al**@microsft.com> wrote in message
news:OV**************@TK2MSFTNGP09.phx.gbl...
Bob ,thanks very much for your nice replies
Unless you want all of your fields to be strings and don't care to
handle / validate field name header records that are part of some CSV
files, you need some mechanism to define field names, field order, field
data types, and perhaps width and basic formatting instructions. You
just create a struct or class that defines these items.

I don't think I need to be worried about fieldNames,but datatype of the
field (e.g ,some times it should be only a number and no letters
allowed) ,field order is really important to me ,so I have to use
FieldInfo.
I'm going to implement it this weekend ,if I had small questions ,can I
email it to you using the email address you have provided?

Thanks again


Nov 16 '05 #18
In the case where I'm getting a csv file from another source, I
normally read in the csv file, validate it as I go, and write good
records back out to another csv or delimited file. That way I limit the
number of errors thrown by my DTS package, which should only really be
worry about getting data into the database as fast as possible and not
worrying about parsing errors, etc. How big are these files you're
processing?

Nov 16 '05 #19
There are lots of records in each CSV with 33 fields in each record.My
problem are as follows

1)How to design the validator for the best performance

2)what kind of application to use to host validator and other componenets
(Windows Service,Consol Application,....)

3) Another problem is that they zip the CSV files and put it in our database
,I probably need another DTS package to unzip the data from DTS and Write it
somewhere as a CSV file

I have some questions about yours:

1) Is your application multi-threaded or not?

2) Do yuo process CSV Files in memory or you put them on File system and you
just pass the path to the CSV to your DTS package.

3)Some errors like (record already exists in Database) are kind a errors
which can't be validate till you actually insert them into the database ,how
would deal with them (I have to log those errors too)

Thanks for your help

"shriop" <sh****@hotmail.com> wrote in message
news:11**********************@l41g2000cwc.googlegr oups.com...
In the case where I'm getting a csv file from another source, I
normally read in the csv file, validate it as I go, and write good
records back out to another csv or delimited file. That way I limit the
number of errors thrown by my DTS package, which should only really be
worry about getting data into the database as fast as possible and not
worrying about parsing errors, etc. How big are these files you're
processing?

Nov 16 '05 #20
1)How to design the validator for the best performance
For best performance, don't design your own, go find a product already
made and tuned. It will save you hours of work, hours of support, and
will probably handle more issues than you can even think of.
2)what kind of application to use to host validator and other componenets (Windows Service,Consol Application,....)
You said you're just planning on monitoring a directory right? If so,
I'd say use a windows service.
3) Another problem is that they zip the CSV files and put it in our database ,I probably need another DTS package to unzip the data from DTS and Write it somewhere as a CSV file
I'm not sure why you'd use DTS to unzip files. I'd just use C# products
to do that.
http://www.icsharpcode.net/OpenSourc...b/Default.aspx
1) Is your application multi-threaded or not?
My applications that have done this type of work have actually been web
page file uploads, so they've of course been multithreaded because of
the way web pages run.
2) Do yuo process CSV Files in memory or you put them on File system and you just pass the path to the CSV to your DTS package.
I would not recommend loading everything into memory. Again, I wouldn't
be using DTS to validate, I'd be using a CSV parser product to read
over the file line by line. It's scalable, efficient, and fast.
3)Some errors like (record already exists in Database) are kind a errors which can't be validate till you actually insert them into the database ,how would deal with them (I have to log those errors too)


Most people I know that have these types of uploads do not load records
directly into the table that they'll end up at. They load data into a
working table first. Then, you can run queries against the working
table and the end table if you want to log duplicate records for
instance. Then, when they're all done, they move the records from the
working table into the live table.

Nov 16 '05 #21
"ALI-R" <Ra*@Alirezaei.com> wrote in message
news:u8**************@TK2MSFTNGP14.phx.gbl...
There are lots of records in each CSV with 33 fields in each record.My
problem are as follows

1)How to design the validator for the best performance
That is a very broad question and a completely separate issue from parsing
the CSV. There is no one "correct" answer. However, you will be amazed at
how much detailed parsing and checking a finished C# application can crank
out in the blink of an eye. Parsing and rewriting CSVs with thousands of
records takes just seconds including program load time. I don't even bother
to do a release build, it's not worth the effort to maintain two builds (of
course all of my stuff is for internal release to one machine).
2)what kind of application to use to host validator and other componenets
(Windows Service,Consol Application,....)
Depends on your needs. Generally a console app is my starting point for
anything like this. My apps tend to stay as console applications, some of
which run via a scheduler. That works fine for me since my apps all run on
dedicated worker machines. With very little extra work you can turn a
debugged console app into a Windows Service, or call the mainline processing
from a WindowsForms application (although you may want to feed all your
console output through a central routine so that you can modify it to
properly update the UI on the UI thread, if you think you will be talking to
a WinForms app to show detailed progress messages; otherwise, and
Console.WriteLine() calls are just ignored in a normal Windows app since no
console is present at runtime).
3) Another problem is that they zip the CSV files and put it in our
database ,I probably need another DTS package to unzip the data from DTS
and Write it somewhere as a CSV file
If the data is zipped and put into a database field, you can just grab the
zip file image as a byte[] from your C# app and then feed it to a zip
library. There are several free or inexpensive zip libraries for .NET.
I have some questions about yours:

1) Is your application multi-threaded or not?
No, it scans a drop folder and sequentially processes whatever files it
finds there. A scheduler runs it every 5 minutes. This may or may not be
suitable to your needs, it happens to work for my particular scenario.

Even in .NET, where threads are much simpler, they greatly complicate
testing and debugging and I use multi threading only when the application
really demands it, which is rarely. After all, running a console app on a
dedicated box, or a background Windows Service, amounts to a dedicated
thread anyway.
2) Do yuo process CSV Files in memory or you put them on File system and
you just pass the path to the CSV to your DTS package.
Most of the time I process one record at a time -- one in, one out. My
objective is to convert a variety of source file formats (including CSV)
into a standard CSV format with fields in a particular order. From there it
goes into our database (temp table) for scrubbing and parsing and error
logging, and a certain amount of human intervention as well (our source data
is very poorly validated and some human judgment calls are required). Only
then does it go into the live database.

I have one particular special case where I load the entire file into memory
because of a complex record merge that is necessary part of the parsing /
conversion process; it's analagous to loading an entire XML document and
selecting data from different parts of the document and rearranging it.
This uses string interning to keep memory consumption manageable, but that
slows things way down. It works however and is still acceptably fast. But
this is a really unusual special case and I doubt very much that I would
encounter a need like this once in a decade of work.
3)Some errors like (record already exists in Database) are kind a errors
which can't be validate till you actually insert them into the database
,how would deal with them (I have to log those errors too)


I think I pretty much covered that question in my last answer above.

Let me just say, I get the sense that you are trying to bite off too much of
this problem at once. Don't look at it monolithically. Break it down into
smaller problems:

1) Parse the source CSV
2) Put the parsed data into a temp table in the DB
3) Scrub / validate / log
4) Commit fully validated records to the DB

Write the low level routines (such as parsing the CSV) for the most part
without regard to what kind of app they will go into or what thread they
will execute in. Then assemble the pieces as appropriate. You may well end
up with two or more cooperating applications. For example I have one
console app for converting source data into a *standard* CSV format; a
scheduler to run that; another WinForms app takes the standard CSV and does
the scrubbing / validating and then finally exports the scrubbed data as a
CSV which is then bulk loaded into the DB by another console app. The
database then has an ASP.NET front end for general management and querying,
and the queries feed into a report server that queues up reporting jobs and
distributes them (Windows Service).

The right tools for each task -- of course "right" is relative, and the
above decisions were the result of unique requirements of a particular
situation.

--Bob
Nov 16 '05 #22
Hi Bob,
thanks for monitoring this thread.
yes you're right,I'm kind of complicating it ;-)
you said:

1) Parse the source CSV
2) Put the parsed data into a temp table in the DB
3) Scrub / validate / log
4) Commit fully validated records to the DB

I can't do the section 2 because they want it to be validated through the
code ,that's why I am looking for a way to validate each line and as some
people suggest make another CVS file which dosen't have those problematic
records on it and then through a DTS package ,CSV is trasformed to the
table.I searched internet but there is no even one example (for instance a
class) which validates a line in a CSV file.Most of people have written only
the parser which is kind of useless without validator (in my case).If you
know somewhere where a validator class has been presented so I can get a
general idea how to write my vallidator class and validate each field value
against some sort of ruls ,I would appreciate if you introduce me.my lines
and rules are like this:

"C1","C2","C3","C4"

C1:could be 1,2,3,4 (1,2,3,4 is actually defined in a lookup table in my
database)
C2 :is a data and it's fomrat should be MM/DD/YYYY
C3: is an integer witch should be in the range of an integer
C4:shouldn't be null

I havw different options like using regex or XML schema ,but the problem is
the lookup tables (like country codes,marital status,.......) which their
vallues are used to validate some fields.

Thanks agian for your complete and nice answeres.
Ali-R

"Bob Grommes" <bo*@bobgrommes.com> wrote in message
news:%2****************@TK2MSFTNGP15.phx.gbl...
"ALI-R" <Ra*@Alirezaei.com> wrote in message
news:u8**************@TK2MSFTNGP14.phx.gbl...
There are lots of records in each CSV with 33 fields in each record.My
problem are as follows

1)How to design the validator for the best performance


That is a very broad question and a completely separate issue from parsing
the CSV. There is no one "correct" answer. However, you will be amazed
at how much detailed parsing and checking a finished C# application can
crank out in the blink of an eye. Parsing and rewriting CSVs with
thousands of records takes just seconds including program load time. I
don't even bother to do a release build, it's not worth the effort to
maintain two builds (of course all of my stuff is for internal release to
one machine).
2)what kind of application to use to host validator and other componenets
(Windows Service,Consol Application,....)


Depends on your needs. Generally a console app is my starting point for
anything like this. My apps tend to stay as console applications, some of
which run via a scheduler. That works fine for me since my apps all run
on dedicated worker machines. With very little extra work you can turn a
debugged console app into a Windows Service, or call the mainline
processing from a WindowsForms application (although you may want to feed
all your console output through a central routine so that you can modify
it to properly update the UI on the UI thread, if you think you will be
talking to a WinForms app to show detailed progress messages; otherwise,
and Console.WriteLine() calls are just ignored in a normal Windows app
since no console is present at runtime).
3) Another problem is that they zip the CSV files and put it in our
database ,I probably need another DTS package to unzip the data from DTS
and Write it somewhere as a CSV file


If the data is zipped and put into a database field, you can just grab the
zip file image as a byte[] from your C# app and then feed it to a zip
library. There are several free or inexpensive zip libraries for .NET.
I have some questions about yours:

1) Is your application multi-threaded or not?


No, it scans a drop folder and sequentially processes whatever files it
finds there. A scheduler runs it every 5 minutes. This may or may not be
suitable to your needs, it happens to work for my particular scenario.

Even in .NET, where threads are much simpler, they greatly complicate
testing and debugging and I use multi threading only when the application
really demands it, which is rarely. After all, running a console app on a
dedicated box, or a background Windows Service, amounts to a dedicated
thread anyway.
2) Do yuo process CSV Files in memory or you put them on File system and
you just pass the path to the CSV to your DTS package.


Most of the time I process one record at a time -- one in, one out. My
objective is to convert a variety of source file formats (including CSV)
into a standard CSV format with fields in a particular order. From there
it goes into our database (temp table) for scrubbing and parsing and error
logging, and a certain amount of human intervention as well (our source
data is very poorly validated and some human judgment calls are required).
Only then does it go into the live database.

I have one particular special case where I load the entire file into
memory because of a complex record merge that is necessary part of the
parsing / conversion process; it's analagous to loading an entire XML
document and selecting data from different parts of the document and
rearranging it. This uses string interning to keep memory consumption
manageable, but that slows things way down. It works however and is still
acceptably fast. But this is a really unusual special case and I doubt
very much that I would encounter a need like this once in a decade of
work.
3)Some errors like (record already exists in Database) are kind a errors
which can't be validate till you actually insert them into the database
,how would deal with them (I have to log those errors too)


I think I pretty much covered that question in my last answer above.

Let me just say, I get the sense that you are trying to bite off too much
of this problem at once. Don't look at it monolithically. Break it down
into smaller problems:

1) Parse the source CSV
2) Put the parsed data into a temp table in the DB
3) Scrub / validate / log
4) Commit fully validated records to the DB

Write the low level routines (such as parsing the CSV) for the most part
without regard to what kind of app they will go into or what thread they
will execute in. Then assemble the pieces as appropriate. You may well
end up with two or more cooperating applications. For example I have one
console app for converting source data into a *standard* CSV format; a
scheduler to run that; another WinForms app takes the standard CSV and
does the scrubbing / validating and then finally exports the scrubbed data
as a CSV which is then bulk loaded into the DB by another console app.
The database then has an ASP.NET front end for general management and
querying, and the queries feed into a report server that queues up
reporting jobs and distributes them (Windows Service).

The right tools for each task -- of course "right" is relative, and the
above decisions were the result of unique requirements of a particular
situation.

--Bob

Nov 16 '05 #23
shriop,
For best performance, don't design your own, go find a >product alreadymade
and tuned. It will save you hours of >work, hours of support, andwill
probably handle more >issues than you can even think of.
Do you know any kind of validator from which I can get some ideas?

Thanks for you help
Ali-R

"shriop" <sh****@hotmail.com> wrote in message
news:11**********************@f14g2000cwb.googlegr oups.com...
1)How to design the validator for the best performance


For best performance, don't design your own, go find a product already
made and tuned. It will save you hours of work, hours of support, and
will probably handle more issues than you can even think of.
2)what kind of application to use to host validator and other

componenets
(Windows Service,Consol Application,....)


You said you're just planning on monitoring a directory right? If so,
I'd say use a windows service.
3) Another problem is that they zip the CSV files and put it in our

database
,I probably need another DTS package to unzip the data from DTS and

Write it
somewhere as a CSV file


I'm not sure why you'd use DTS to unzip files. I'd just use C# products
to do that.
http://www.icsharpcode.net/OpenSourc...b/Default.aspx
1) Is your application multi-threaded or not?


My applications that have done this type of work have actually been web
page file uploads, so they've of course been multithreaded because of
the way web pages run.
2) Do yuo process CSV Files in memory or you put them on File system

and you
just pass the path to the CSV to your DTS package.


I would not recommend loading everything into memory. Again, I wouldn't
be using DTS to validate, I'd be using a CSV parser product to read
over the file line by line. It's scalable, efficient, and fast.
3)Some errors like (record already exists in Database) are kind a

errors
which can't be validate till you actually insert them into the

database ,how
would deal with them (I have to log those errors too)


Most people I know that have these types of uploads do not load records
directly into the table that they'll end up at. They load data into a
working table first. Then, you can run queries against the working
table and the end table if you want to log duplicate records for
instance. Then, when they're all done, they move the records from the
working table into the live table.

Nov 16 '05 #24
I can't really recommend a validator for you. I'm normally more
interested in how to parse these files using the fastest parser than
how to validate them. When I have had to validate them, I've either
used regex's or used simple hardcoded string parsing. Regex's are the
most flexible, but they're more meant to validate on large amounts of
data at once, not on simple small string bits that have already been
parsed. I doubt you're going to find any samples out there where
someone is doing all of this well. I would suggest just breaking up the
task into seperate pieces, and finding the best approach for each piece.

Nov 16 '05 #25
Shirop,

Yes,you are right .I will do that
Thanks for your help

"shriop" <sh****@hotmail.com> wrote in message
news:11**********************@z14g2000cwz.googlegr oups.com...
I can't really recommend a validator for you. I'm normally more
interested in how to parse these files using the fastest parser than
how to validate them. When I have had to validate them, I've either
used regex's or used simple hardcoded string parsing. Regex's are the
most flexible, but they're more meant to validate on large amounts of
data at once, not on simple small string bits that have already been
parsed. I doubt you're going to find any samples out there where
someone is doing all of this well. I would suggest just breaking up the
task into seperate pieces, and finding the best approach for each piece.

Nov 16 '05 #26

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Bob Bedford | last post by:
I've a function asking for a $_FILES argument: function insertarticle($particle,......,$pUserID,$pOptions,$files){ in this function I do add the record and then save the files in a subdirectory...
3
by: Himanshu Garg | last post by:
Hello, I am trying to pinpoint an apparent bug in HTML::Parser. The encoding of the text seems to change incorrectly if the locale isn't set properly. However Parser.pm in the directory...
3
by: cr88192 | last post by:
for various reasons, I added an imo ugly hack to my xml parser. basically, I wanted the ability to have binary payload within the xml parse trees. this was partly because I came up with a binary...
5
by: qqcq6s59 | last post by:
Hi all I am a newbie and I just saw a ongoing thread on Fileprocessing which talks abt config parser. I have writen many pyhton program to parse many kind of text files by using string module and...
4
by: billcoumbe | last post by:
any recommendations? I'm looking for something that will just run from the unix command line to validate large (20-50Mb) XML files against an XML DTD. Ideally something that is actively...
2
by: news.microsoft.com | last post by:
Hi: I work in Csharp's parser files by LEX/YACC.Now I have only CSharp-lex.l and CSharp.y file,but they not have CSharp'comment Parse. this is a part of CSharp-lex.l. ........................
6
by: ST | last post by:
Hi, I keep getting the parser error, and I have no idea why. I've tried a number of things including: 1)building/rebuilding about 100x 2)making sure all dll's are in the bin folder in the root...
0
by: Bob Bedford | last post by:
Hi all, I've to parse XML files on the fly and I must admit I'm far away to understand why the parser are done. The files I receive have all the same structure, and due to limitations of my...
0
by: UncleRic | last post by:
Environment: Mac OS X (10.4.10) on MacBook Pro I'm a Perl Neophyte. I've downloaded the XML::Parser module and am attempting to install it in my working directory (referenced via PERL5LIB env): ...
18
by: Just Another Victim of the Ambient Morality | last post by:
Is pyparsing really a recursive descent parser? I ask this because there are grammars it can't parse that my recursive descent parser would parse, should I have written one. For instance: ...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.