By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,512 Members | 747 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,512 IT Pros & Developers. It's quick & easy.

importing db as text files

P: n/a

What's the big deal with importing text files? I have a 70 MB file to import
and it's been one problem after another. I used the copy command and it appears
that it's just not possible. I finally massaged the file into a .sql file and
ran that using \i db.sql but that failed too because I overlooked ' in names
like D'Adario. The other problem I encountered was that a numeric field had
to have data in it, pg would not default to the default value. So instead of
massaging all the data again I decided to change the data type for that column.
This is my first experience with postgresql and I'm wondering if I should expect
to encounter similar pain as I go further into this? So far it's been very
painful trying to do what I thought would be easy and what I think should be
easy.

PostgreSQL 7.3.4 on linux redhat 9
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 11 '05 #1
Share this Question
Share on Google+
12 Replies

P: n/a
On Wed, Aug 13, 2003 at 14:14:20 -0700,
expect <ex****@ihubbell.com> wrote:

What's the big deal with importing text files? I have a 70 MB file to import
and it's been one problem after another. I used the copy command and it appears
that it's just not possible. I finally massaged the file into a .sql file and
That is unlikely, but without more details it is hard to say what you need
to do.
ran that using \i db.sql but that failed too because I overlooked ' in names
like D'Adario. The other problem I encountered was that a numeric field had
to have data in it, pg would not default to the default value. So instead of
You can use the keyword default in insert statements to get a default value.
massaging all the data again I decided to change the data type for that column.
This is my first experience with postgresql and I'm wondering if I should expect
to encounter similar pain as I go further into this? So far it's been very
painful trying to do what I thought would be easy and what I think should be
easy.


The impression I get is that you expect postgres to make a best guess when
presented with ambiguous data. That is a very dangerous thing to do. I would
much prefer ambiguous data be rejected so that I can make sure what I think
the value is, is the same as what the database thinks the value is.

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 11 '05 #2

P: n/a
Hi expect,

Best way in my opinion is to use the copy table command. This way Pg will
actually 'massage' the data (string escapes and all) for you.

If you have complex files best to use a combo of sed/awk to pre-massage the
field delimiters and import the data to a temporary table. In most instances
it is best to pipe the data to the psql command using copy table from stdin
as from file requires that you are the process owner.

Always use a temporary import table and perform validation/further conversion
in that table (IMO).

eg (from one of my earlier posts)

cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"
| psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"

The first sed will replace all instances of "," with a tab character
(I achieve tab in my shell scripts with a line like TAB=$(printf "\t"))

The second sed will remove the start and end " because the first only checks
for ",".

The third sed will remove the Windows carriage returns (if any - depends on
the platform the file was generated from.

Note here that my file format never contains any " in a field so I can safely
run the second sed. If your import file contains these then you will need to
change the regex or use awk.

Regards,

Jason

On Thu, 14 Aug 2003 07:14 am, expect wrote:
What's the big deal with importing text files? I have a 70 MB file to
import and it's been one problem after another. I used the copy command
and it appears that it's just not possible. I finally massaged the file
into a .sql file and ran that using \i db.sql but that failed too because I
overlooked ' in names like D'Adario. The other problem I encountered was
that a numeric field had to have data in it, pg would not default to the
default value. So instead of massaging all the data again I decided to
change the data type for that column. This is my first experience with
postgresql and I'm wondering if I should expect to encounter similar pain
as I go further into this? So far it's been very painful trying to do what
I thought would be easy and what I think should be easy.

PostgreSQL 7.3.4 on linux redhat 9
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 11 '05 #3

P: n/a
On Thu, 14 Aug 2003 07:34:55 +1000
Jason Godden <ja*********@optushome.com.au> wrote:
Hi expect,

Best way in my opinion is to use the copy table command. This way Pg will
actually 'massage' the data (string escapes and all) for you.
I guess we're of the same opinion. I did use the copy table command.
I believe the problem is that pg is unable to use the default value when a
value is not present.


If you have complex files best to use a combo of sed/awk to pre-massage the
field delimiters and import the data to a temporary table. In most instances
it is best to pipe the data to the psql command using copy table from stdin
as from file requires that you are the process owner.

Always use a temporary import table and perform validation/further conversion
in that table (IMO).

eg (from one of my earlier posts)

cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"
| psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"

The first sed will replace all instances of "," with a tab character
(I achieve tab in my shell scripts with a line like TAB=$(printf "\t"))

The second sed will remove the start and end " because the first only checks
for ",".

The third sed will remove the Windows carriage returns (if any - depends on
the platform the file was generated from.

Note here that my file format never contains any " in a field so I can safely
run the second sed. If your import file contains these then you will need to
change the regex or use awk.

Regards,

Jason

On Thu, 14 Aug 2003 07:14 am, expect wrote:
What's the big deal with importing text files? I have a 70 MB file to
import and it's been one problem after another. I used the copy command
and it appears that it's just not possible. I finally massaged the file
into a .sql file and ran that using \i db.sql but that failed too because I
overlooked ' in names like D'Adario. The other problem I encountered was
that a numeric field had to have data in it, pg would not default to the
default value. So instead of massaging all the data again I decided to
change the data type for that column. This is my first experience with
postgresql and I'm wondering if I should expect to encounter similar pain
as I go further into this? So far it's been very painful trying to do what
I thought would be easy and what I think should be easy.

PostgreSQL 7.3.4 on linux redhat 9
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings



---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 11 '05 #4

P: n/a
On Thu, 14 Aug 2003 01:52 pm, you wrote:
On Thu, 14 Aug 2003 07:34:55 +1000

Jason Godden <ja*********@optushome.com.au> wrote:
Hi expect,

Best way in my opinion is to use the copy table command. This way Pg
will actually 'massage' the data (string escapes and all) for you.


I guess we're of the same opinion. I did use the copy table command.
I believe the problem is that pg is unable to use the default value when a
value is not present.


PG is behaving correctly IMO:

create table data (data1 int4 not null, data2 int4,data3 int4 not null default
10);

insert into data values (2,null,default) - OK
insert into data values (null,2,default) - Fail not null data1
insert into data values (2,null) - Fail? missing field?
insert into data (data1,data2) values (2,null) - OK data3 = default with
explicit field nomination ^

copy from... (essentially becomes - although it does something a bit different
behind the scenes):

insert into data (data1,data2,data3) values (x,y,z)

if data3 is specified not null default 10 and you have a line in your import
file which is translated thus:

2 2 \N - default null but you can nominate what that is
insert into data (data1,data2,data3) values (2,2,null);

this will fail because you are explicitly saying put null in a not null field.

So use an intermediatory table without not null constraints with copy from...
then use a query:

insert into realtable (data1,data2,data3) select data1,
case when data2 is null then default else data2 end,
data3 from data where data1 is not null; <-because data1 has no option to be
null or a default value etc...

Problem solved... I'd be curious as to how many ppl actually import their data
STRAIGHT into their production tables without integrity checking. Ofcourse
if you massage and manage it externally such as the method Greg uses then
you're in business too - either way I believe the pg copy syntax is correct
and makes sense. PG Copy CANT make a guess that you intend null or to skip
that field so the default pops in there - null and default are two very
different things. in fact null is oo (infinity?) different things...

Unless someone changes the insert behaviour in a later release then you will
have to come up with a 'massaged' way (Greg/Perl, intermediatory tables and
pl/pgsql functions). But then think about this:

insert into data values (1,2,null)

which by the proposed new null behaviour suddenly becomes:

insert into data values (1,2,default) (say default = 10);

but in reality the user simply mucked up on data entry, didn't actually mean
default and meant to insert 70 instead - they don't get a warning about it
and your data integrity is screwed. Not only that the db isn't paying strict
attention to the intended SQL syntax and the constraint management is moved
to the client - PG (and any other decent database - ie Oracle) is far above
the MySQL
cram-it-in-even-if-it-breaks-code-all-your-business-rules-in-your-client way
of doing things!

Default should be used in an explicit sense IMHO:

insert into data (data1,data2) values (1,2) - now data3 becomes default and
all is good

VERY different to: insert into data values (1,2,null);

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 11 '05 #5

P: n/a
On Wed, Aug 13, 2003 at 20:45:55 -0700,
expect <ex****@ihubbell.com> wrote:
On Wed, 13 Aug 2003 16:30:04 -0500
Bruno Wolff III <br***@wolff.to> wrote:
On Wed, Aug 13, 2003 at 14:14:20 -0700,
expect <ex****@ihubbell.com> wrote:

What's the big deal with importing text files? I have a 70 MB file to import
and it's been one problem after another. I used the copy command and it appears
that it's just not possible. I finally massaged the file into a .sql file and
That is unlikely, but without more details it is hard to say what you need
to do.

What's unlikely? That it didn't work? But it didn't work.


That it isn't possible to load your data using the copy command.
You may need to do some transformation before sending it to copy.
Using copy is probably going to give you the fastest load time.
From what I've read it should use the default value when there is no value.
Is that not true? Is this a known issue?
Where did you read that? What do you mean by 'no value'?

The syntax for the insert command is given in the manual:
INSERT INTO table [ ( column [, ...] ) ]
{ DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) | query }

No you've come to the wrong impression. I believe that the problem lies in the
fact that pg will not default to the default value when no value is present.


And how do you think 'no value' is represented? If you use default to
respresent 'no value' then things will work as you expect for insert
statements. When you are using copy, defaults can't be used on a row
by row basis (but unlisted columns will get default values) and you will
need to inlcude the value that you want in each row.

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 11 '05 #6

P: n/a
On Thu, 14 Aug 2003 08:45:20 -0500
Bruno Wolff III <br***@wolff.to> wrote:
On Wed, Aug 13, 2003 at 20:45:55 -0700,
expect <ex****@ihubbell.com> wrote:
On Wed, 13 Aug 2003 16:30:04 -0500
Bruno Wolff III <br***@wolff.to> wrote:
On Wed, Aug 13, 2003 at 14:14:20 -0700,
expect <ex****@ihubbell.com> wrote:
>
> What's the big deal with importing text files? I have a 70 MB file to import
> and it's been one problem after another. I used the copy command and it appears
> that it's just not possible. I finally massaged the file into a .sql file and

That is unlikely, but without more details it is hard to say what you need
to do.

What's unlikely? That it didn't work? But it didn't work.


That it isn't possible to load your data using the copy command.
You may need to do some transformation before sending it to copy.
Using copy is probably going to give you the fastest load time.
From what I've read it should use the default value when there is no value.
Is that not true? Is this a known issue?


Where did you read that? What do you mean by 'no value'?


The data is really very simple. Here's a pseudo-example of 2 rows of data:

191922C,Bob Cobb,D'Obbalina Sr.,312 Elm Street,Yountville,CA,94599,5,062001,082009
339111C,Elma Thelma,Velma,98 Oak Lane,St. Louis,MO,63119-2065,,,

The syntax for the insert command is given in the manual:
INSERT INTO table [ ( column [, ...] ) ]
{ DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) | query }

No you've come to the wrong impression. I believe that the problem lies in the
fact that pg will not default to the default value when no value is present.


And how do you think 'no value' is represented? If you use default to
respresent 'no value' then things will work as you expect for insert
statements. When you are using copy, defaults can't be used on a row
by row basis (but unlisted columns will get default values) and you will
need to inlcude the value that you want in each row.


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 11 '05 #7

P: n/a
On Thu, Aug 14, 2003 at 09:08:16 -0700,
expect <ex****@ihubbell.com> wrote:

191922C,Bob Cobb,D'Obbalina Sr.,312 Elm Street,Yountville,CA,94599,5,062001,082009
339111C,Elma Thelma,Velma,98 Oak Lane,St. Louis,MO,63119-2065,,,


That should be handled by copy pretty easily. If the columns after what
appears to be the zip code are some type of number rather than a string,
then the empty string is going to be an invalid value.

If you want the empty string to be interpreted as the default and you don't
have any NULL values in your input, then the simplest thing to do is set
the code for NULLs to be the empty string. You can then do queries after
the data is in the database to change the NULLs to the appropiate default.
If you are running the 7.4 beta you can use the DEFAULT key in the update,
otherwise you will need to duplicate the default expression in the update
commands. You will want to run update once for each column that can have
NULL values using IS NULL in the WHERE clause.
If these columns have a NOT NULL constraint, you may want to use a temporary
table to load the data and then copy it over (with a single insert statement)
after it has been cleaned up.

Note that it isn't obvious what empty strings should map to for numbers.
NULL and 0 make about as much sense as using the default value.

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 11 '05 #8

P: n/a
On Thu, 14 Aug 2003 12:46:07 -0500
Bruno Wolff III <br***@wolff.to> wrote:
On Thu, Aug 14, 2003 at 09:08:16 -0700,
expect <ex****@ihubbell.com> wrote:

191922C,Bob Cobb,D'Obbalina Sr.,312 Elm Street,Yountville,CA,94599,5,062001,082009
339111C,Elma Thelma,Velma,98 Oak Lane,St. Louis,MO,63119-2065,,,
That should be handled by copy pretty easily. If the columns after what
appears to be the zip code are some type of number rather than a string,
then the empty string is going to be an invalid value.

If you want the empty string to be interpreted as the default and you don't
have any NULL values in your input, then the simplest thing to do is set
the code for NULLs to be the empty string. You can then do queries after

I'm new to this so everything has to be explicit, I don't want to make any
leaps with what anyone is saying here.
copy db from '/usr/tmp/db' with delimiter ',' null '';

Is this the command that you're recommending?

the data is in the database to change the NULLs to the appropiate default.
If you are running the 7.4 beta you can use the DEFAULT key in the update,
otherwise you will need to duplicate the default expression in the update
commands. You will want to run update once for each column that can have
NULL values using IS NULL in the WHERE clause.
If these columns have a NOT NULL constraint, you may want to use a temporary
table to load the data and then copy it over (with a single insert statement)
after it has been cleaned up.

Note that it isn't obvious what empty strings should map to for numbers.
NULL and 0 make about as much sense as using the default value.
Well I'm new here but it seems to me they should map to the default value
for that column. Why wouldn't they?



---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 11 '05 #9

P: n/a
On Fri, 15 Aug 2003 00:32:01 -0700 (PDT)
Stephan Szabo <ss****@megazone.bigpanda.com> wrote:

On Thu, 14 Aug 2003, expect wrote:
On Thu, 14 Aug 2003 12:46:07 -0500
Bruno Wolff III <br***@wolff.to> wrote:
Note that it isn't obvious what empty strings should map to for numbers.
NULL and 0 make about as much sense as using the default value.
Well I'm new here but it seems to me they should map to the default value
for that column. Why wouldn't they?


One problem with doing that is that it's inconsistent.


Please elaborate. How is it inconsistent, exactly?


Given
create table test(
a text default 'abc',
b int default 5
);

copy test from stdin with delimiter ',';
,
\.

What would you expect the values of the row in test
to be?

Oh a test....

Does the \. end the STDIN input?

Where's the null option? Don't you mean:

copy test from stdin with delimiter ',' null '';
In this case I would expect the row to have:

a | b
----------
|
abc | 5
Is this too much to expect?




---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 11 '05 #10

P: n/a
Whilst I agree with what Bruno said regarding adding a default option to the
copy syntax I think that the basic principles Stephan and I outlined in how
copy treats your import is correct and the developers did the right thing.

Now if the devs, you or I want to add a default option to copy in the future
thats all good but for the moment you will have to use some pre or post
process to get the absolute effect you require. It's not that hard..
literally one line of piped commands on the command line where you specify
the defaults or a post process that does the final import into your
production tables from your data import table.

Remember NULL != Default. These are VERY different concepts and I believe
that enough examples of how this behaviour can lead to confusion have been
shown. With copy it is also important to remember that is not part of the
SQL standard and PG doesn't use SQL statements to do the import but rather a
low-level C operation. If you need the behaviour you're referring to sed/awk
the data up and generate BEGIN... INSERT... COMMIT... statements and pipe
that to PG. It's not that difficult and I'll give you some suggestions with
it if you want.

I'm not trying to be difficult with my view of this but there are always other
ways (sometimes one liners) that can achieve the behaviour you're after.

Rgds,

Jason
On Fri, 15 Aug 2003 05:53 pm, you wrote:
On Fri, 15 Aug 2003 00:32:01 -0700 (PDT)

Stephan Szabo <ss****@megazone.bigpanda.com> wrote:
On Thu, 14 Aug 2003, expect wrote:
On Thu, 14 Aug 2003 12:46:07 -0500

Bruno Wolff III <br***@wolff.to> wrote:
> Note that it isn't obvious what empty strings should map to for
> numbers. NULL and 0 make about as much sense as using the default
> value.

Well I'm new here but it seems to me they should map to the default
value for that column. Why wouldn't they?


One problem with doing that is that it's inconsistent.


Please elaborate. How is it inconsistent, exactly?
Given
create table test(
a text default 'abc',
b int default 5
);

copy test from stdin with delimiter ',';
,
\.

What would you expect the values of the row in test
to be?


Oh a test....

Does the \. end the STDIN input?

Where's the null option? Don't you mean:

copy test from stdin with delimiter ',' null '';
In this case I would expect the row to have:

a | b
----------

abc | 5
Is this too much to expect?


---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 11 '05 #11

P: n/a
On Fri, 15 Aug 2003, expect wrote:
On Fri, 15 Aug 2003 00:32:01 -0700 (PDT)
Stephan Szabo <ss****@megazone.bigpanda.com> wrote:

On Thu, 14 Aug 2003, expect wrote:
On Thu, 14 Aug 2003 12:46:07 -0500
Bruno Wolff III <br***@wolff.to> wrote:
> Note that it isn't obvious what empty strings should map to for numbers.
> NULL and 0 make about as much sense as using the default value.

Well I'm new here but it seems to me they should map to the default value
for that column. Why wouldn't they?
One problem with doing that is that it's inconsistent.


Please elaborate. How is it inconsistent, exactly?


See my comments below.

Given
create table test(
a text default 'abc',
b int default 5
);

copy test from stdin with delimiter ',';
,
\.

What would you expect the values of the row in test
to be?

Oh a test....

Does the \. end the STDIN input?

Where's the null option? Don't you mean:

copy test from stdin with delimiter ',' null '';


No, I'm saying without a null specifier.
In this case I would expect the row to have:

a | b
----------
|
abc | 5
Is this too much to expect?


Without a null specifier of '', empty string is a valid value for a.
Why would it get the default, and how would you insert an empty string?
Should it treat b differently because '' isn't valid for that type, that'd
be inconsistent.

With a null specifier of '', empty string is valid for both and means
NULL. Otherwise, how would you specify a null when you have that
null specifier?

What you probably really want is another specifier that inserts the
default for a column, so you could say something like:
copy test from stdin with delimiter ',' default '';

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 11 '05 #12

P: n/a
AFAICT,
NULL == NULL
'' == empty string
default == column default

Seems pretty clear, and about as intuitive as it gets. Trying to use an empty string for anything else just means it's not availble to represent itself. any other DB using anything else is complicating things

Now, for exports/dumps that don't use the ' char to delineate the start and begin of a column, i.e.

454wetoit,Four score and seven years ago,default,,

what would the last three values be? Empty strings would be my guess.

'454wetoit','Four score and seven years ago','default','',''

seems like quoted column values would be necessary to distinguish the constants NULL and default from strings that contain those two text sequences.
Stephan Szabo wrote:
On Fri, 15 Aug 2003, expect wrote:

On Fri, 15 Aug 2003 00:32:01 -0700 (PDT)
Stephan Szabo <ss****@megazone.bigpanda.com> wrote:

On Thu, 14 Aug 2003, expect wrote:
On Thu, 14 Aug 2003 12:46:07 -0500
Bruno Wolff III <br***@wolff.to> wrote:

>Note that it isn't obvious what empty strings should map to for numbers.
>NULL and 0 make about as much sense as using the default value.

Well I'm new here but it seems to me they should map to the default value
for that column. Why wouldn't they?

One problem with doing that is that it's inconsistent.


Please elaborate. How is it inconsistent, exactly?

See my comments below.

Given
create table test(
a text default 'abc',
b int default 5
);

copy test from stdin with delimiter ',';
,
\.

What would you expect the values of the row in test
to be?

Oh a test....

Does the \. end the STDIN input?

Where's the null option? Don't you mean:

copy test from stdin with delimiter ',' null '';

No, I'm saying without a null specifier.

In this case I would expect the row to have:

a | b
----------
|
abc | 5
Is this too much to expect?

Without a null specifier of '', empty string is a valid value for a.
Why would it get the default, and how would you insert an empty string?
Should it treat b differently because '' isn't valid for that type, that'd
be inconsistent.

With a null specifier of '', empty string is valid for both and means
NULL. Otherwise, how would you specify a null when you have that
null specifier?

What you probably really want is another specifier that inserts the
default for a column, so you could say something like:
copy test from stdin with delimiter ',' default '';

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 11 '05 #13

This discussion thread is closed

Replies have been disabled for this discussion.