468,727 Members | 1,427 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,727 developers. It's quick & easy.

imputing missing values in MySQL?

Hello,
I'm not very experienced in SQL and I need some advice. I have a
comma separarated values file containing around 20 million records and about
20 fields. There are many missing values (blank). I am only directly
interested in 2 fields (1 & 4), but need to use two other fields (2 & 3) for
imputing values for Field 1 (if missing). If Field 4 is missing I must
delete / ignore the record. The (Python) types are,

Field1 (string) Field2 (string) Field3 (string) Field4 (integer)
For imputation:

If Field 1 is blank I substitute a value from another record which matches
on Field 2 and partially matches (all but first 4 characters) on Field 3.
Failing this I want to randomly select a value for Field 1 by choosing
randomly from the (Field 1) values of those (complete) records with matching
Field 2. Otherwise the record is ignored / deleted.

I tried doing this in pure Python, but ran into memory problems. If I can
manage to construct a suitable MySQL table I can pull out the relevant
records and get on with my analysis. But (due to my limited knowledge of
SQL) I find that I'm still (I think) doing far too much in Python.

e.g. I'm parsing each record individually and inserting into one of two
tables (one for records with values for Field 1 and one for records with
missing values for Field 1). Then I'm going to iterate over all the records
in the second table, impute, and add the resulting record to the first
table. Somehow I think there must be a better way. Any ideas? TIA.

Duncan
Jul 19 '05 #1
2 3153
Duncan - if you think about using another toolset -
I would recommend Visual FoxPro and odbc to your mySql db.

You can parse the csv file into vfp, then use either a remote view or
straight sql pass through to handle the record insert into mySql, using the
rules you've desribed below .
If you have MS Visual Studio 6.0 floating around anywhere - VFP 6.0 is on CD
#1.
I've used VFP from versions 5 to 8 with mySql, with considerable results and
low time for coding .

I'm not suggesting you learn yet another toolset, though - but I am a bit
fan of that one.
mondo regards [Will]
--
William Sanders / Electronic Filing Group Remove the DOT BOB to reply via
email.
FREE LONG DISTANCE -> mailto:ex********@efgroup.net
mySql / VFP / MS-SQL
"Duncan Smith" <bu*****@urubu.freeserve.co.uk> wrote in message
news:bh**********@newsg3.svr.pol.co.uk...
Hello,
I'm not very experienced in SQL and I need some advice. I have a
comma separarated values file containing around 20 million records and about 20 fields. There are many missing values (blank). I am only directly
interested in 2 fields (1 & 4), but need to use two other fields (2 & 3) for imputing values for Field 1 (if missing). If Field 4 is missing I must
delete / ignore the record. The (Python) types are,

Field1 (string) Field2 (string) Field3 (string) Field4 (integer)
For imputation:

[snip]
Jul 19 '05 #2
Duncan - if you think about using another toolset -
I would recommend Visual FoxPro and odbc to your mySql db.

You can parse the csv file into vfp, then use either a remote view or
straight sql pass through to handle the record insert into mySql, using the
rules you've desribed below .
If you have MS Visual Studio 6.0 floating around anywhere - VFP 6.0 is on CD
#1.
I've used VFP from versions 5 to 8 with mySql, with considerable results and
low time for coding .

I'm not suggesting you learn yet another toolset, though - but I am a bit
fan of that one.
mondo regards [Will]
--
William Sanders / Electronic Filing Group Remove the DOT BOB to reply via
email.
FREE LONG DISTANCE -> mailto:ex********@efgroup.net
mySql / VFP / MS-SQL
"Duncan Smith" <bu*****@urubu.freeserve.co.uk> wrote in message
news:bh**********@newsg3.svr.pol.co.uk...
Hello,
I'm not very experienced in SQL and I need some advice. I have a
comma separarated values file containing around 20 million records and about 20 fields. There are many missing values (blank). I am only directly
interested in 2 fields (1 & 4), but need to use two other fields (2 & 3) for imputing values for Field 1 (if missing). If Field 4 is missing I must
delete / ignore the record. The (Python) types are,

Field1 (string) Field2 (string) Field3 (string) Field4 (integer)
For imputation:

[snip]
Jul 19 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Marcus | last post: by
reply views Thread by David Otton | last post: by
reply views Thread by Duncan Smith | last post: by
reply views Thread by Daniel Mercier | last post: by
2 posts views Thread by Daniel Mercier | last post: by
2 posts views Thread by Ike | last post: by
reply views Thread by zhoujie | last post: by
xarzu
1 post views Thread by xarzu | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.