By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,206 Members | 1,035 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,206 IT Pros & Developers. It's quick & easy.

imputing missing values in MySQL?

P: n/a
Hello,
I'm not very experienced in SQL and I need some advice. I have a
comma separarated values file containing around 20 million records and about
20 fields. There are many missing values (blank). I am only directly
interested in 2 fields (1 & 4), but need to use two other fields (2 & 3) for
imputing values for Field 1 (if missing). If Field 4 is missing I must
delete / ignore the record. The (Python) types are,

Field1 (string) Field2 (string) Field3 (string) Field4 (integer)
For imputation:

If Field 1 is blank I substitute a value from another record which matches
on Field 2 and partially matches (all but first 4 characters) on Field 3.
Failing this I want to randomly select a value for Field 1 by choosing
randomly from the (Field 1) values of those (complete) records with matching
Field 2. Otherwise the record is ignored / deleted.

I tried doing this in pure Python, but ran into memory problems. If I can
manage to construct a suitable MySQL table I can pull out the relevant
records and get on with my analysis. But (due to my limited knowledge of
SQL) I find that I'm still (I think) doing far too much in Python.

e.g. I'm parsing each record individually and inserting into one of two
tables (one for records with values for Field 1 and one for records with
missing values for Field 1). Then I'm going to iterate over all the records
in the second table, impute, and add the resulting record to the first
table. Somehow I think there must be a better way. Any ideas? TIA.

Duncan
Jul 19 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Duncan - if you think about using another toolset -
I would recommend Visual FoxPro and odbc to your mySql db.

You can parse the csv file into vfp, then use either a remote view or
straight sql pass through to handle the record insert into mySql, using the
rules you've desribed below .
If you have MS Visual Studio 6.0 floating around anywhere - VFP 6.0 is on CD
#1.
I've used VFP from versions 5 to 8 with mySql, with considerable results and
low time for coding .

I'm not suggesting you learn yet another toolset, though - but I am a bit
fan of that one.
mondo regards [Will]
--
William Sanders / Electronic Filing Group Remove the DOT BOB to reply via
email.
FREE LONG DISTANCE -> mailto:ex********@efgroup.net
mySql / VFP / MS-SQL
"Duncan Smith" <bu*****@urubu.freeserve.co.uk> wrote in message
news:bh**********@newsg3.svr.pol.co.uk...
Hello,
I'm not very experienced in SQL and I need some advice. I have a
comma separarated values file containing around 20 million records and about 20 fields. There are many missing values (blank). I am only directly
interested in 2 fields (1 & 4), but need to use two other fields (2 & 3) for imputing values for Field 1 (if missing). If Field 4 is missing I must
delete / ignore the record. The (Python) types are,

Field1 (string) Field2 (string) Field3 (string) Field4 (integer)
For imputation:

[snip]
Jul 19 '05 #2

P: n/a
Duncan - if you think about using another toolset -
I would recommend Visual FoxPro and odbc to your mySql db.

You can parse the csv file into vfp, then use either a remote view or
straight sql pass through to handle the record insert into mySql, using the
rules you've desribed below .
If you have MS Visual Studio 6.0 floating around anywhere - VFP 6.0 is on CD
#1.
I've used VFP from versions 5 to 8 with mySql, with considerable results and
low time for coding .

I'm not suggesting you learn yet another toolset, though - but I am a bit
fan of that one.
mondo regards [Will]
--
William Sanders / Electronic Filing Group Remove the DOT BOB to reply via
email.
FREE LONG DISTANCE -> mailto:ex********@efgroup.net
mySql / VFP / MS-SQL
"Duncan Smith" <bu*****@urubu.freeserve.co.uk> wrote in message
news:bh**********@newsg3.svr.pol.co.uk...
Hello,
I'm not very experienced in SQL and I need some advice. I have a
comma separarated values file containing around 20 million records and about 20 fields. There are many missing values (blank). I am only directly
interested in 2 fields (1 & 4), but need to use two other fields (2 & 3) for imputing values for Field 1 (if missing). If Field 4 is missing I must
delete / ignore the record. The (Python) types are,

Field1 (string) Field2 (string) Field3 (string) Field4 (integer)
For imputation:

[snip]
Jul 19 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.