469,622 Members | 1,462 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,622 developers. It's quick & easy.

flattening/rolling up/aggregating a large sorted text file

Hi,

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.
ID,color,shape,msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...

Mar 21 '07 #1
3 1550
ad****@yahoo.com wrote:
Hi,

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.
ID,color,shape,msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...
It seems a bit wrong-headed to force this problem to fit a solution
where you define relations with a variable number of columns when the
natural way to solve it would seem to be to sum the msr1 values for each
unique combination of ID, color and shape. That's a pretty
straightforward relational problem.

So, is there some reason the result *has* to have that variable number
of columns?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Mar 21 '07 #2
Apparently you want to use this data to know how many blue circles, blue
squares, red circles and red squares. In other words, I doubt you want
to output redundant data columns, you just want this data in a more
usable format and that you don't actually need to do multiple passes
over it.

This is a fun problem to solve because it uses two very powerful tools:
cvs.dictreader and bitwise categorization.

Note: your initial data has three records with the same ID. I assumes
the ID is the unique key. So I changed the data slightly.

ad****@yahoo.com wrote:
Hi,

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.
ID,color,shape,msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...

--
Shane Geiger
IT Director
National Council on Economic Education
sg*****@ncee.net | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy

"""

Apparently you want to use this data to know how many blue circles, blue squares, red circles and red squares. In other words, I doubt you want to output redundant data columns, you just want this data in a more usable format and that you don't actually need to do multiple passes over it.

This is a fun problem to solve because it uses two very powerful tools: cvs.dictreader and bitwise categorization.

Note: your initial data has three records with the same ID. I assumes the ID is the unique key. So I changed the data slightly.

------

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.
ID,color,shape,msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...

"""

import string
## BITWISE CATEGORIZATION STUFF

def gNextBit(val=0):
while True:
y = 2**val
val += 1
yield y

nb = gNextBit()

categories = ['blue','red','square','circle']
#categories_value = ['blue','red','square','circle']

def bitwise_categorize(items):
d = {}
for item in items:
d[item] = nb.next()
return d

categories_dict = bitwise_categorize(categories)

#print categories_dict # {'blue': 1, 'circle': 8, 'square': 4, 'red': 2}

def get_properties(category_int):
p_list = []
for k,v in categories_dict.items():
if category_int & v == v:
p_list.append(k)
return p_list

def list_properties():
for i in range(len(categories)**2):
print "Properties for something with category_int of",str(i),str(get_properties(i))

#list_properties()

### EXAMPLE DATA

header_fields = ['id','color','shape','msr1']

example_data = """
001, blue, square, 4
002, red , circle, 5
003, red, circle, 6
"""

# write out the example
import os
def writefile(f, data, perms=750): open(f, 'w').write(data) and os.chmod(f, perms)
csv_file = "/Users/shanegeiger/temp.csv"
writefile(csv_file, example_data)


### READING IN THE DATA AND CATEGORIZING IT WITH BITWISE CATEGORIZATION

import csv
reader = csv.DictReader(open(csv_file), [], delimiter=",")

data = []
info = {}

while True:
try:
# Read next "header" line (if there isn't one then exit the loop)
reader.fieldnames = header_fields
rdr = reader.next()
data.append(rdr)

except StopIteration: break

categories_int = 0

# print "categories_dict:",categories_dict

for rec in data: # for each record
color_cat = categories_dict[string.strip(rec['color'])] # should add them to the color category
shape_cat = categories_dict[string.strip(rec['shape'])] # should add them to the color category
combined_cat = color_cat + shape_cat

#print "color_category:", color_cat,
#print " shape_category:", shape_cat,
#print " combined categories:", combined_cat

rec['categories_int'] = combined_cat
if rec['id']:
info[rec['id']] = rec
elif rec['color']:
info[rec['color']] = rec
elif rec['shape']:
info[rec['shape']] = rec
elif rec['msr1']:
info[rec['msr1']] = rec
else:
print "Warning: Not all fields found for record:",str(rec)

for k,v in info.items():
print k,"-->",v
cat_int = v['categories_int']
print " -- The categories_int variable has value ",cat_int,"which means this record has these properties: ", get_properties(cat_int)

print

print "Just to be clear, here are the bitwise categories:"
list_properties()
Mar 21 '07 #3
ad****@yahoo.com writes:
>Hi,
>Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.
>If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.
>Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?
>I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.
>Thanks.
>ID,color,shape,msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
>ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...
Something like:

import sys
from sets import Set

ids = {}
keys = Set()

for line in sys.stdin:
ID,COL,SHAPE,VAL = [s.strip() for s in line.split(',')]
ids.setdefault(ID,{})
key = '%s_%s'%(COL,SHAPE)
ids[ID].setdefault(key,0)
ids[ID][key] += int(VAL)
keys.add(key)

print 'id',',',','.join([str(key) for key in keys])
for id,cols in ids.items():
print id,',', ', '.join([str(cols.get(k,0)) for k in keys])
Doesn't keep all possible keys just those that are actually used.
Needs to sort() things here and there.

Incidentally I don't think you could do it in SQL at all in this way but you
could do it in a more vertical fashion (eg
001, red, circle, 11
001, blue, square, 4
002, red, rhombus, 99) etc.

Eddie
Mar 22 '07 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

8 posts views Thread by robin | last post: by
24 posts views Thread by Joerg Schuster | last post: by
1 post views Thread by solex | last post: by
2 posts views Thread by gangesmaster | last post: by
8 posts views Thread by js | last post: by
9 posts views Thread by pargat.singh | last post: by
3 posts views Thread by farhadtarapore | last post: by
reply views Thread by devrayhaan | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.