converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

py_genetic

Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

May 18 '07 #1

Subscribe Post Reply

1965

Dustan

On May 18, 6:07 pm, py_genetic <conor.robin...@gmail.comwrote:

Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

given the string s:

try:
integerValue = int(s)
except ValueError, e:
try:
floatValue = float(s)
except ValueError:
pass
else:
s = floatValue
else:
s = integerValue

I believe it will automatically identify base 8 and base 16 integers
(but not base 8/16 floats).

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

May 18 '07 #2

James Stroud

py_genetic wrote:

Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

This is untested, but here is an outline to do what you want.

First convert rows to columns:
columns = zip(*rows)
Okay, that was a lot of typing. Now, you should run down the columns,
testing with the most restrictive type and working to less restrictive
types. You will also need to keep in mind the potential for commas in
your numbers--so you will need to write your own converters, determining
for yourself what literals map to what values. Only you can decide what
you really want here. Here is a minimal idea of how I would do it:
def make_int(astr):
if not astr:
return 0
else:
return int(astr.replace(',', ''))

def make_float(astr):
if not astr:
return 0.0
else:
return float(astr.replace(',', ''))

make_str = lambda s: s
Now you can put the converters in a list, remembering to order them.
converters = [make_int, make_float, make_str]
Now, go down the columns checking, moving to the next, less restrictive,
converter when a particular converter fails. We assume that the make_str
identity operator will never fail. We could leave it out and have a
flag, etc., for efficiency, but that is left as an exercise.
new_columns = []
for column in columns:
for converter in converters:
try:
new_column = [converter(v) for v in column]
break
except:
continue
new_columns.append(new_column)
For no reason at all, convert back to rows:
new_rows = zip(*new_columns)
You must decide for yourself how to deal with ambiguities. For example,
will '1.0' be a float or an int? The above assumes you want all values
in a column to have the same type. Reordering the loops can give mixed
types in columns, but would not fulfill your stated requirements. Some
things are not as efficient as they might be (for example, eliminating
the clumsy make_str). But adding tests to improve efficiency would cloud
the logic.

James

May 19 '07 #3

John Machin

On 19/05/2007 10:04 AM, James Stroud wrote:

py_genetic wrote:
>Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

This is untested, but here is an outline to do what you want.

First convert rows to columns:
columns = zip(*rows)
Okay, that was a lot of typing. Now, you should run down the columns,
testing with the most restrictive type and working to less restrictive
types. You will also need to keep in mind the potential for commas in
your numbers--so you will need to write your own converters, determining
for yourself what literals map to what values. Only you can decide what
you really want here. Here is a minimal idea of how I would do it:
def make_int(astr):
if not astr:
return 0
else:
return int(astr.replace(',', ''))

def make_float(astr):
if not astr:
return 0.0
else:
return float(astr.replace(',', ''))

make_str = lambda s: s
Now you can put the converters in a list, remembering to order them.
converters = [make_int, make_float, make_str]
Now, go down the columns checking, moving to the next, less restrictive,
converter when a particular converter fails. We assume that the make_str
identity operator will never fail. We could leave it out and have a
flag, etc., for efficiency, but that is left as an exercise.
new_columns = []
for column in columns:
for converter in converters:
try:
new_column = [converter(v) for v in column]
break
except:
continue
new_columns.append(new_column)
For no reason at all, convert back to rows:
new_rows = zip(*new_columns)
You must decide for yourself how to deal with ambiguities. For example,
will '1.0' be a float or an int? The above assumes you want all values
in a column to have the same type. Reordering the loops can give mixed
types in columns, but would not fulfill your stated requirements. Some
things are not as efficient as they might be (for example, eliminating
the clumsy make_str). But adding tests to improve efficiency would cloud
the logic.

[apologies in advance if this appears more than once]

This approach is quite reasonable, IF:
(1) the types involved follow a simple "ladder" hierarchy [ints pass the
float test, floats pass the str test]
(2) the supplier of the data has ensured that all values in a column are
actually instances of the intended type.

Constraint (1) falls apart if you need dates. Consider 31/12/99,
31/12/1999, 311299 [int?], 31121999 [int?], 31DEC99, ... and that's
before you allow for dates in three different orders (dmy, mdy, ymd).

Constraint (2) just falls apart -- with user-supplied data, there seem
to be no rules but Rafferty's and no laws but Murphy's.

The approach that I've adopted is to test the values in a column for all
types, and choose the non-text type that has the highest success rate
(provided the rate is greater than some threshold e.g. 90%, otherwise
it's text).

For large files, taking a 1/N sample can save a lot of time with little
chance of misdiagnosis.

Example: file of 1,079,000 records, with 15 columns, ultimately
diagnosed as being 8 x text, 3 x int, 1 x float, 2 x date (dmy order),
and [no kidding] 1 x date (ymd order). Using N==101 took about 15
seconds [Python 2.5.1, Win XP Pro SP2, 3.2GHz dual-core]; N==1 takes
about 900 seconds. The "converter" function for dates is written in C.

Cheers,
John

May 19 '07 #4

Paddy

On May 19, 12:07 am, py_genetic <conor.robin...@gmail.comwrote:

Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.

- Paddy.

May 19 '07 #5

James Stroud

John Machin wrote:

The approach that I've adopted is to test the values in a column for all
types, and choose the non-text type that has the highest success rate
(provided the rate is greater than some threshold e.g. 90%, otherwise
it's text).

For large files, taking a 1/N sample can save a lot of time with little
chance of misdiagnosis.

Why stop there? You could lower the minimum 1/N by straightforward
application of Bayesian statistics, using results from previous tables
as priors.
James

May 19 '07 #6

John Machin

On 19/05/2007 9:17 PM, James Stroud wrote:

John Machin wrote:
>The approach that I've adopted is to test the values in a column for
all types, and choose the non-text type that has the highest success
rate (provided the rate is greater than some threshold e.g. 90%,
otherwise it's text).

For large files, taking a 1/N sample can save a lot of time with
little chance of misdiagnosis.

Why stop there? You could lower the minimum 1/N by straightforward
application of Bayesian statistics, using results from previous tables
as priors.

The example I gave related to one file out of several files prepared at
the same time by the same organisation from the same application by the
same personnel using the same query tool for a yearly process which has
been going on for several years. All files for a year should be in the
same format, and the format should not change year by year, and the
format should match the agreed specifications ... but this doesn't
happen. Against that background, please explain to me how I can use
"results from previous tables as priors".

Cheers,
John

May 20 '07 #7

John Machin

On 19/05/2007 3:14 PM, Paddy wrote:

On May 19, 12:07 am, py_genetic <conor.robin...@gmail.comwrote:
>Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.

Here you go, Paddy, the following has been generated very methodically;
what data type is the first column? What is the value in the first
column of the 6th row likely to be?

"$39,082.00","$123,456.78"
"$39,113.00","$124,218.10"
"$39,141.00","$124,973.76"
"$39,172.00","$125,806.92"
"$39,202.00","$126,593.21"

N.B. I've kindly given you five lines instead of one or two :-)

Cheers,
John

May 20 '07 #8

James Stroud

John Machin wrote:

>Against that background, please explain to me how I can use
"results from previous tables as priors".

Cheers,
John

It depends on how you want to model your probabilities, but, as an
example, you might find the following frequencies of columns in all
tables you have parsed from this organization: 35% Strings, 25% Floats,
20% Ints, 15% Date MMDDYYYY, and 5% Date YYMMDD. Let's say that you have
also used prior counting statistics to find that there is a 2% error
rate in the columns (2% of the values of a typical Float column fail to
cast to Float, 2% of values in Int columns fail to cast to Int, and
so-on, though these need not all be equal). Lets also say that for
non-Int columns, 1% of cells randomly selected cast to Int.

These percentages could be converted to probabilities and these
probabilities could be used as priors in Bayesian scheme to determine a
column type. Lets say you take one cell randomly and it can be cast to
an Int. What is the probability that the column is an Int? (See
<http://tinyurl.com/2bdn38>.)

P_1(H) = 0.20 --Prior (20% prior columns are Int columns)
P(D|H) = 0.98
P(D|H') = 0.01

P_1(H|D) = 0.9607843 --Posterior & New Prior "P_2(H|D)"
Now with one test positive for Int, you are getting pretty certain you
have an Int column. Now we take a second cell randomly from the same
column and find that it too casts to Int.

P_2(H) = 0.9607843 --Confidence its an Int column from round 1
P(D|H) = 0.98
P(D|H') = 0.02

P_2(H|D) = 0.9995836
Yikes! But I'm still not convinced its an Int because I haven't even had
to wait a millisecond to get the answer. Lets burn some more clock cycles.

Lets say we really have an Int column and get "lucky" with our tests (P
= 0.98**4 = 92% chance) and find two more random cells successfully cast
to Int:

P_4(H) = 0.9999957
P(D|H) = 0.98
P(D|H') = 0.02

P(H|D) = 0.9999999
I don't know about you, but after only four positives, my calculator ran
out of significant digits and so I am at least 99.99999% convinced its
an Int column and I'm going to stop wasting CPU cycles and move on to
test the next column. How do you know its not a float? Well, given
floats with only one decimal place, you would expect only 1/10th could
be cast to Int (were the tenths-decimal place to vary randomly). You
could generate a similar statistical model to convince yourself with
vanishing uncertainty that the column that tests positive for Int four
times in a (random sample) is not actually a Float (even with only one
decimal place known).
James

May 20 '07 #9

James Stroud

James Stroud wrote:

Now with one test positive for Int, you are getting pretty certain you
have an Int column. Now we take a second cell randomly from the same
column and find that it too casts to Int.

P_2(H) = 0.9607843 --Confidence its an Int column from round 1
P(D|H) = 0.98
P(D|H') = 0.02

P_2(H|D) = 0.9995836
Yikes! But I'm still not convinced its an Int because I haven't even had
to wait a millisecond to get the answer. Lets burn some more clock cycles.

Lets say we really have an Int column and get "lucky" with our tests (P
= 0.98**4 = 92% chance) and find two more random cells successfully cast
to Int:

P_4(H) = 0.9999957
P(D|H) = 0.98
P(D|H') = 0.02

P(H|D) = 0.9999999

I had typos. P(D|H') should be 0.01 for all rounds.

Also, I should clarify that 4 of 4 are positive with no fails observed.
Integrating fails would use the last posterior as a prior in a similar
scheme.

Also, given a 1% false positive rate, after only 4 rounds you are 1 -
(0.01**4) = 99.9999% sure your observations aren't because you
accidentally pulled 4 of the false positives in succession.

James

May 20 '07 #10

John Machin

On 20/05/2007 5:47 PM, James Stroud wrote:

John Machin wrote:
>Against that background, please explain to me how I can use "results
from previous tables as priors".

Cheers,
John

It depends on how you want to model your probabilities, but, as an
example, you might find the following frequencies of columns in all
tables you have parsed from this organization: 35% Strings, 25% Floats,
20% Ints, 15% Date MMDDYYYY, and 5% Date YYMMDD.

The model would have to be a lot more complicated than that. There is a
base number of required columns. The kind suppliers of the data randomly
add extra columns, randomly permute the order in which the columns
appear, and, for date columns, randomly choose the day-month-year order,
how much punctuation to sprinkle between the digits, and whether to
append some bonus extra bytes like " 00:00:00".

Let's say that you have
also used prior counting statistics to find that there is a 2% error
rate in the columns (2% of the values of a typical Float column fail to
cast to Float, 2% of values in Int columns fail to cast to Int, and
so-on, though these need not all be equal). Lets also say that for
non-Int columns, 1% of cells randomly selected cast to Int.

Past stats on failure to cast are no guide to the future ... a sudden
change in the failure rate can be caused by the kind folk introducing a
new null designator i.e. outside the list ['', 'NULL', 'NA', 'N/A',
'#N/A!', 'UNK', 'UNKNOWN', 'NOT GIVEN', etc etc etc]
There is also the problem of first-time-participating organisations --
in police parlance, they have no priors :-)

So, all in all, Bayesian inference doesn't seem much use in this scenario.

>
These percentages could be converted to probabilities and these
probabilities could be used as priors in Bayesian scheme to determine a
column type. Lets say you take one cell randomly and it can be cast to
an Int. What is the probability that the column is an Int? (See
<http://tinyurl.com/2bdn38>.)

That's fancy -- a great improvement on the slide rule and squared paper :-)

Cheers,
John

May 20 '07 #11

James Stroud

John Machin wrote:

So, all in all, Bayesian inference doesn't seem much use in this scenario.

This is equivalent to saying that any statistical analysis doesn't seem
much use in this scenario--but you go ahead and use statistics anyway?

May 20 '07 #12

Paddy

On May 20, 2:16 am, John Machin <sjmac...@lexicon.netwrote:

On 19/05/2007 3:14 PM, Paddy wrote:

On May 19, 12:07 am, py_genetic <conor.robin...@gmail.comwrote:
Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.

Here you go, Paddy, the following has been generated very methodically;
what data type is the first column? What is the value in the first
column of the 6th row likely to be?

"$39,082.00","$123,456.78"
"$39,113.00","$124,218.10"
"$39,141.00","$124,973.76"
"$39,172.00","$125,806.92"
"$39,202.00","$126,593.21"

N.B. I've kindly given you five lines instead of one or two :-)

Cheers,
John

John,
I've had cases where some investigation of the source of the data has
completely removed any ambiguity. I've found that data was generated
from one or two sources and been able to know what every field type is
by just examining a field that I have determined wil tell me the
source program that generated the data.

I have also found that the flow generating some data is subject to
hand editing so have had to both put in extra checks in my reader, and
on some occasions created specific editors to replace hand edits by
checked assisted hand edits.
I stand by my statement; "Know the source of your data", its less
likely to bite!

- Paddy.

May 20 '07 #13

James Stroud

John Machin wrote:

The model would have to be a lot more complicated than that. There is a
base number of required columns. The kind suppliers of the data randomly
add extra columns, randomly permute the order in which the columns
appear, and, for date columns

I'm going to ignore this because these things have absolutely no affect
on the analysis whatsoever. Random order of columns? How could this
influence any statistics, counting, Bayesian, or otherwise?

randomly choose the day-month-year order,

how much punctuation to sprinkle between the digits, and whether to
append some bonus extra bytes like " 00:00:00".

I absolutely do not understand how bonus bytes or any of the above would
selectively adversely affect any single type of statistics--if your
converter doesn't recognize it then your converter doesn't recognize it
and so it will fail under every circumstance and influence any and all
statistical analysis. Under such conditions, I want very robust
analysis--probably more robust than simple counting statistics. And I
definitely want something more efficient.

Past stats on failure to cast are no guide to the future

Not true when using Bayesian statistics (and any type of inference for
that matter). For example, where did you get 90% cutoff? From
experience? I thought that past stats are no guide to future expectations?

... a sudden

change in the failure rate can be caused by the kind folk introducing a
new null designator i.e. outside the list ['', 'NULL', 'NA', 'N/A',
'#N/A!', 'UNK', 'UNKNOWN', 'NOT GIVEN', etc etc etc]

Using the rough model and having no idea that they threw in a few weird
designators so that you might suspect a 20% failure (instead of the 2% I
modeled previously), the *low probabilities of false positives* (say 5%
of the non-Int columns evaluate to integer--after you've eliminated
dates because you remembered to test more restrictive types first) would
still *drive the statistics*. Remember, the posteriors become priors
after the first test.

P_1(H) = 0.2 (Just a guess, it'll wash after about 3 tests.)
P(D|H) = 0.8 (Are you sure they have it together enough to pay you?)
P(D|H') = 0.05 (5% of the names, salaries, etc., evaluate to float?)

Lets model failures since the companies you work with have bad typists.
We have to reverse the probabilities for this:

Pf_1(H) = 0.2 (Only if this is round 1.)
Pf(D|H) = 0.2 (We *guess* a 20% chance by random any column is Int.)
Pf(D|H') = 0.80 (80% of Ints fail because of carpel tunnel, ennui, etc.)

You might take issue with Pf(D|H) = 0.2. I encourage you to try a range
of values here to see what the posteriors look like. You'll find that
this is not as important as the *low false positive rate*.

For example, lets not stop until we are 99.9% sure one way or the other.
With this cutoff, lets suppose this deplorable display of typing integers:

pass-fail-fail-pass-pass-pass

which might be expected from the above very pessimistic priors (maybe
you got data from the _Apathy_Coalition_ or the _Bad_Typists_Union_ or
the _Put_a_Quote_Around_Every_5th_Integer_League_):

P_1(H|D) = 0.800 (pass)
P_2(H|D) = 0.500 (fail)
P_3(H|D) = 0.200 (fail--don't stop, not 99.9% sure)
P_4(H|D) = 0.800 (pass)
P_6(H|D) = 0.9846153 (pass--not there yet)
P_7(H|D) = 0.9990243 (pass--got it!)

Now this is with 5% all salaries, names of people, addresses, favorite
colors, etc., evaluating to integers. (Pausing while I remember fondly
Uncle 41572--such a nice guy...funny name, though.)

There is also the problem of first-time-participating organisations --
in police parlance, they have no priors :-)

Yes, because they teleported from Alpha Centauri where organizations are
fundamentally different from here on Earth and we can not make any
reasonable assumptions about them--like that they will indeed cough up
money when the time comes or that they speak a dialect of an earth
language or that they even generate spreadsheets for us to parse.

James

May 20 '07 #14

John Machin

On 20/05/2007 8:52 PM, Paddy wrote:

On May 20, 2:16 am, John Machin <sjmac...@lexicon.netwrote:
>On 19/05/2007 3:14 PM, Paddy wrote:

>>On May 19, 12:07 am, py_genetic <conor.robin...@gmail.comwrote:
Hello,
I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...
Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]
Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.
1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?
2. Is there anything out there already which deals with this issue?
Thanks,
Conor
You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.
Here you go, Paddy, the following has been generated very methodically;
what data type is the first column? What is the value in the first
column of the 6th row likely to be?

"$39,082.00","$123,456.78"
"$39,113.00","$124,218.10"
"$39,141.00","$124,973.76"
"$39,172.00","$125,806.92"
"$39,202.00","$126,593.21"

N.B. I've kindly given you five lines instead of one or two :-)

Cheers,
John

John,
I've had cases where some investigation of the source of the data has
completely removed any ambiguity. I've found that data was generated
from one or two sources and been able to know what every field type is
by just examining a field that I have determined wil tell me the
source program that generated the data.

The source program that produced my sample dataset was Microsoft Excel
(or OOo Calc or Gnumeric); it was induced to perform a "save as CSV"
operation. Does that help you determine the true nature of the first column?

>
I have also found that the flow generating some data is subject to
hand editing so have had to both put in extra checks in my reader, and
on some occasions created specific editors to replace hand edits by
checked assisted hand edits.
I stand by my statement; "Know the source of your data", its less
likely to bite!

My dataset has a known source, and furthermore meets your "lucky"
criteria (methodically generated, column type is consistent) -- I'm
waiting to hear from you about the "easily determined" part :-)

Cheers,
John

May 20 '07 #15

Paddy

On May 20, 1:12 pm, John Machin <sjmac...@lexicon.netwrote:

On 20/05/2007 8:52 PM, Paddy wrote:

On May 20, 2:16 am, John Machin <sjmac...@lexicon.netwrote:
On 19/05/2007 3:14 PM, Paddy wrote:

>On May 19, 12:07 am, py_genetic <conor.robin...@gmail.comwrote:
Hello,
I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...
Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]
Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.
1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?
2. Is there anything out there already which deals with this issue?
Thanks,
Conor
You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.
Here you go, Paddy, the following has been generated very methodically;
what data type is the first column? What is the value in the first
column of the 6th row likely to be?

"$39,082.00","$123,456.78"
"$39,113.00","$124,218.10"
"$39,141.00","$124,973.76"
"$39,172.00","$125,806.92"
"$39,202.00","$126,593.21"

N.B. I've kindly given you five lines instead of one or two :-)

Cheers,
John

John,
I've had cases where some investigation of the source of the data has
completely removed any ambiguity. I've found that data was generated
from one or two sources and been able to know what every field type is
by just examining a field that I have determined wil tell me the
source program that generated the data.

The source program that produced my sample dataset was Microsoft Excel
(or OOo Calc or Gnumeric); it was induced to perform a "save as CSV"
operation. Does that help you determine the true nature of the first column?

I have also found that the flow generating some data is subject to
hand editing so have had to both put in extra checks in my reader, and
on some occasions created specific editors to replace hand edits by
checked assisted hand edits.
I stand by my statement; "Know the source of your data", its less
likely to bite!

My dataset has a known source, and furthermore meets your "lucky"
criteria (methodically generated, column type is consistent) -- I'm
waiting to hear from you about the "easily determined" part :-)

Cheers,
John

John,
Open up your Excel spreadsheet and check what the format is for the
column. It's not a contest. If you KNOW what generated the data then
USE that knowledge. It would be counter-productive to do otherwise
surely?

(I know, don't call you Shirley :-)

- Paddy.

May 20 '07 #16

John Machin

On May 21, 2:04 am, Paddy <paddy3...@googlemail.comwrote:

On May 20, 1:12 pm, John Machin <sjmac...@lexicon.netwrote:

On 20/05/2007 8:52 PM, Paddy wrote:

On May 20, 2:16 am, John Machin <sjmac...@lexicon.netwrote:
>On 19/05/2007 3:14 PM, Paddy wrote:

>>On May 19, 12:07 am, py_genetic <conor.robin...@gmail.comwrote:
>>>Hello,
>>>I'm importing large text files of data using csv. I would like to add
>>>some more auto sensing abilities. I'm considing sampling the data
>>>file and doing some fuzzy logic scoring on the attributes (colls in a
>>>data base/ csv file, eg. height weight income etc.) to determine the
>>>most efficient 'type' to convert the attribute coll into for further
>>>processing and efficient storage...
>>>Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
>>>there' '100,000,000,000'], [next row...] ....]
>>>Aside from a missing attribute designator, we can assume that the same
>>>type of data continues through a coll. For example, a string, int8,
>>>int16, float etc.
>>>1. What is the most efficient way in python to test weather a string
>>>can be converted into a given numeric type, or left alone if its
>>>really a string like 'A' or 'hello'? Speed is key? Any thoughts?
>>>2. Is there anything out there already which deals with this issue?
>>>Thanks,
>>>Conor
>>You might try investigating what can generate your data. With luck,
>>it could turn out that the data generator is methodical and column
>>data-types are consistent and easily determined by testing the
>>first or second row. At worst, you will get to know how much you
>>must check for human errors.
>Here you go, Paddy, the following has been generated very methodically;
>what data type is the first column? What is the value in the first
>column of the 6th row likely to be?

>"$39,082.00","$123,456.78"
>"$39,113.00","$124,218.10"
>"$39,141.00","$124,973.76"
>"$39,172.00","$125,806.92"
>"$39,202.00","$126,593.21"

>N.B. I've kindly given you five lines instead of one or two :-)

>Cheers,
>John

John,
I've had cases where some investigation of the source of the data has
completely removed any ambiguity. I've found that data was generated
from one or two sources and been able to know what every field type is
by just examining a field that I have determined wil tell me the
source program that generated the data.

The source program that produced my sample dataset was Microsoft Excel
(or OOo Calc or Gnumeric); it was induced to perform a "save as CSV"
operation. Does that help you determine the true nature of the first column?

I have also found that the flow generating some data is subject to
hand editing so have had to both put in extra checks in my reader, and
on some occasions created specific editors to replace hand edits by
checked assisted hand edits.
I stand by my statement; "Know the source of your data", its less
likely to bite!

My dataset has a known source, and furthermore meets your "lucky"
criteria (methodically generated, column type is consistent) -- I'm
waiting to hear from you about the "easily determined" part :-)

Cheers,
John

John,
Open up your Excel spreadsheet and check what the format is for the
column. It's not a contest. If you KNOW what generated the data then
USE that knowledge. It would be counter-productive to do otherwise
surely?

(I know, don't call you Shirley :-)

.... and I won't call you Patsy more than this once :-)

Patsy, re-read. The scenario is that I don't have the Excel
spreadsheet; I have a CSV file. The format is rather obviously
"currency" but that is not correct. The point is that (1) it was
methodically [mis-]produced by a known source [your criteria] but the
correct type of column 1 can't be determined by inspection of a value
or 2.

Yeah, it's not a contest, but I was kinda expecting that you might
have taken first differences of column 1 by now ...

Cheers,
John

May 20 '07 #17

George Sakkis

On May 18, 7:07 pm, py_genetic <conor.robin...@gmail.comwrote:

Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

There are several replies to your immediate column type-guessing
problem, so I'm not going to address that. Once you decide the
converters for each column, you have to pass the dataset through them
(and optionally rearrange or omit some of them). That's easy to
hardcode for a few datasets with the same or similar structure but it
soon gets tiring.

I had a similar task recently so I wrote a general and efficient (at
least as far as pure python goes) row transformer that does the
repetitive work. Below are some examples from an Ipython session; let
me know if this might be useful and I'll post it here or at the
Cookbook.

George
#======= RowTransformer examples ============================

In [1]: from transrow import RowTransformer
In [2]: rows = [row.split(',') for row in "1,3.34,4-3.2j,John",
"4,4,4,4", "0,-1.1,3.4,None"]
In [3]: rows
Out[3]:
[['1', '3.34', '4-3.2j', 'John'],
['4', '4', '4', '4'],
['0', '-1.1', '3.4', 'None']]

# adapt the first three columns; the rest are omitted
In [4]: for row in RowTransformer([int,float,complex])(rows):
...: print row
...:
[1, 3.3399999999999999, (4-3.2000000000000002j)]
[4, 4.0, (4+0j)]
[0, -1.1000000000000001, (3.3999999999999999+0j)]

# return the 2nd column as float, followed by the 4th column as is
In [5]: for row in RowTransformer({1:float, 3:None})(rows):
....: print row
....:
[3.3399999999999999, 'John']
[4.0, '4']
[-1.1000000000000001, 'None']
# return the 3rd column as complex, followed by the 1st column as int
In [6]: for row in RowTransformer([(2,complex),(0,int)])(rows):
....: print row
....:
[(4-3.2000000000000002j), 1]
[(4+0j), 4]
[(3.3999999999999999+0j), 0]

# return the first three columns, adapted by eval()
# XXX: use eval() only for trusted data
In [7]: for row in RowTransformer(include=range(3),
default_adaptor=eval)(rows):
....: print row
....:
[1, 3.3399999999999999, (4-3.2000000000000002j)]
[4, 4, 4]
[0, -1.1000000000000001, 3.3999999999999999]

# equivalent to the previous
In [8]: for row in RowTransformer(default_adaptor=eval, exclude=[3])
(rows):
....: print row
....:
[1, 3.3399999999999999, (4-3.2000000000000002j)]
[4, 4, 4]
[0, -1.1000000000000001, 3.3999999999999999]

May 21 '07 #18

James Stroud

I need to correct myself here before someone else does. I didn't
actually reverse the probabilities as promised for the failing case. It
was late last night and I was starting to get a little cloudy.

Pf(D|H) = 0.2 (We *guess* a 20% chance by random any column is Int.)

This can be read instead as "probability that it will fail the test
given that it is really from an Int column", which is 20% of the time.

Pf(D|H') = 0.80 (80% of Ints fail because of carpel tunnel, ennui, etc.)

This can be read as "probability it will fail the test if it is not
really from an Int column". That would be Pf(D|H') = 0.95 (e.g. testing
the inability to cast to Int is a pretty bad test for Int because it
gives false positives 95% of the time).

This change doesn't change the conclusions of the example, with the
P_3(H|D) = 0.1505882 (lower than 20%, but no where near the 0.001 cutoff
to conclude the column is not Int) and the final probability P_7(H|D) =
0.9986247 (rounding up to our 0.999 criteria for confidence that it is
an Int ;).

James

May 21 '07 #19

Neil Cerutti

On 2007-05-20, John Machin <sj******@lexicon.netwrote:

On 19/05/2007 3:14 PM, Paddy wrote:
>On May 19, 12:07 am, py_genetic <conor.robin...@gmail.comwrote:
>>Hello,

I'm importing large text files of data using csv. I would like to add
some more auto sensing abilities. I'm considing sampling the data
file and doing some fuzzy logic scoring on the attributes (colls in a
data base/ csv file, eg. height weight income etc.) to determine the
most efficient 'type' to convert the attribute coll into for further
processing and efficient storage...

Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
there' '100,000,000,000'], [next row...] ....]

Aside from a missing attribute designator, we can assume that the same
type of data continues through a coll. For example, a string, int8,
int16, float etc.

1. What is the most efficient way in python to test weather a string
can be converted into a given numeric type, or left alone if its
really a string like 'A' or 'hello'? Speed is key? Any thoughts?

2. Is there anything out there already which deals with this issue?

Thanks,
Conor

You might try investigating what can generate your data. With luck,
it could turn out that the data generator is methodical and column
data-types are consistent and easily determined by testing the
first or second row. At worst, you will get to know how much you
must check for human errors.

Here you go, Paddy, the following has been generated very methodically;
what data type is the first column? What is the value in the first
column of the 6th row likely to be?

"$39,082.00","$123,456.78"
"$39,113.00","$124,218.10"
"$39,141.00","$124,973.76"
"$39,172.00","$125,806.92"
"$39,202.00","$126,593.21"

N.B. I've kindly given you five lines instead of one or two :-)

My experience with Excel-related mistakes leads me to think that
column one contains dates that got somehow misformatted on
export.

--
Neil Cerutti

May 21 '07 #20

py_genetic

This is excellect advise, thank you gentelman.

Paddy:

We can't really, in this arena make assumtions about the data source.
I fully agree with your point, but if we had the luxury of really
knowing the source we wouldn't be having this conversation. Files we
can deal with could be consumer data files, log files, financial
files... all from different users BCP-ed out or cvs excell etc.
However, I agree that we can make one basic assumtion, for each coll
there is a correct and furthermore optimal format. In many cases we
may have a supplied "data dictionary" with the data in which case you
are right and we can override much of this process, except we still
need to find the optimal format like int8 vs int16.

James:

Using a baysian method were my inital thoughts as well. The key to
this method, I feel is getting a solid random sample of the entire
file without having to load the whole beast into memory.

What are your thoughts on other techniques? For example training a
neural net and feeding it a sample, this might be nice and very fast
since after training (we would have to create a good global training
set) we could just do a quick transform on a coll sample and ave the
probabilities of the output units (one output unit for each type).
The question here would encoding, any ideas? A bin rep of the vars?
Furthermore, niave bayes decision trees etc?
John:

The approach that I've adopted is to test the values in a column for all
types, and choose the non-text type that has the highest success rate
(provided the rate is greater than some threshold e.g. 90%, otherwise
it's text).

For large files, taking a 1/N sample can save a lot of time with little
chance of misdiagnosis.

I like your approach, this could be simple. Intially, I was thinking
a loop that did exactly this, just test the sample colls for "hits"
and take the best. Thanks for the sample code.
George:

Thank you for offering to share your transform function. I'm very
interested.

May 21 '07 #21

James Stroud

py_genetic wrote:

Using a baysian method were my inital thoughts as well. The key to
this method, I feel is getting a solid random sample of the entire
file without having to load the whole beast into memory.

If you feel only the first 1000 rows are representative, then you can
take a random sample from the first 200-1000 rows depending on how good
you think the typing was. At 10,000 bytes per row, you are only reading
in a 10MB file if you read 1000 rows. I just timed reading a 28MB file
(source tgz file of Open Office 2.0.1) from a local drive at about 1s.
As I hope I have demonstrated, you will need only a small random sample
from this 200-1000 for testing (maybe 5-15 depending on quality and priors).

What are your thoughts on other techniques? For example training a
neural net and feeding it a sample, this might be nice and very fast
since after training (we would have to create a good global training
set) we could just do a quick transform on a coll sample and ave the
probabilities of the output units (one output unit for each type).
The question here would encoding, any ideas? A bin rep of the vars?
Furthermore, niave bayes decision trees etc?

I think these latter ideas require more characterization of your input
than is necessary, especially with judicious ordering of simple
converter tests. What properties, aside from passing and failing certain
converter tests, would you use for the data types? For example, you
might use fraction of the string that are integer digits as a property
to classify integers. But how many CPU cycles would you spend counting
these digits and calculating the fractions for your sample? Do you
expect that your data quality is poor enough to warrant expending CPU
cycles to quantify the several properties that might characterize each type?

However, you might need some advanced learning tools if you want to
automatically decide whether a column is a last name or a first name.
I'm guessing this is not required and you would only want to know
whether to make such a column an int or a string, in which case the
testing is pretty straightforward and quick (number of tests per column
< 10).

I've never converted tables from organizations on a routine basis, but I
have a feeling that the quality of these tables are not as poor as one
might fear, especially given reasonable foreknowledge of how data types
are typically encoded.

James

May 21 '07 #22

converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

Similar topics