Hi,
Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.
If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.
Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?
I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.
Thanks.
ID,color,shape,msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ... 3 1697 ad****@yahoo.com wrote:
Hi,
Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.
If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.
Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?
I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.
Thanks.
ID,color,shape,msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...
It seems a bit wrong-headed to force this problem to fit a solution
where you define relations with a variable number of columns when the
natural way to solve it would seem to be to sum the msr1 values for each
unique combination of ID, color and shape. That's a pretty
straightforward relational problem.
So, is there some reason the result *has* to have that variable number
of columns?
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com
Apparently you want to use this data to know how many blue circles, blue
squares, red circles and red squares. In other words, I doubt you want
to output redundant data columns, you just want this data in a more
usable format and that you don't actually need to do multiple passes
over it.
This is a fun problem to solve because it uses two very powerful tools:
cvs.dictreader and bitwise categorization.
Note: your initial data has three records with the same ID. I assumes
the ID is the unique key. So I changed the data slightly. ad****@yahoo.com wrote:
Hi,
Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.
If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.
Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?
I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.
Thanks.
ID,color,shape,msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...
--
Shane Geiger
IT Director
National Council on Economic Education sg*****@ncee.net | 402-438-8958 | http://www.ncee.net
Leading the Campaign for Economic and Financial Literacy
"""
Apparently you want to use this data to know how many blue circles, blue squares, red circles and red squares. In other words, I doubt you want to output redundant data columns, you just want this data in a more usable format and that you don't actually need to do multiple passes over it.
This is a fun problem to solve because it uses two very powerful tools: cvs.dictreader and bitwise categorization.
Note: your initial data has three records with the same ID. I assumes the ID is the unique key. So I changed the data slightly.
------
Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.
If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.
Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?
I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.
Thanks.
ID,color,shape,msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...
"""
import string
## BITWISE CATEGORIZATION STUFF
def gNextBit(val=0):
while True:
y = 2**val
val += 1
yield y
nb = gNextBit()
categories = ['blue','red','square','circle']
#categories_value = ['blue','red','square','circle']
def bitwise_categorize(items):
d = {}
for item in items:
d[item] = nb.next()
return d
categories_dict = bitwise_categorize(categories)
#print categories_dict # {'blue': 1, 'circle': 8, 'square': 4, 'red': 2}
def get_properties(category_int):
p_list = []
for k,v in categories_dict.items():
if category_int & v == v:
p_list.append(k)
return p_list
def list_properties():
for i in range(len(categories)**2):
print "Properties for something with category_int of",str(i),str(get_properties(i))
#list_properties()
### EXAMPLE DATA
header_fields = ['id','color','shape','msr1']
example_data = """
001, blue, square, 4
002, red , circle, 5
003, red, circle, 6
"""
# write out the example
import os
def writefile(f, data, perms=750): open(f, 'w').write(data) and os.chmod(f, perms)
csv_file = "/Users/shanegeiger/temp.csv"
writefile(csv_file, example_data)
### READING IN THE DATA AND CATEGORIZING IT WITH BITWISE CATEGORIZATION
import csv
reader = csv.DictReader(open(csv_file), [], delimiter=",")
data = []
info = {}
while True:
try:
# Read next "header" line (if there isn't one then exit the loop)
reader.fieldnames = header_fields
rdr = reader.next()
data.append(rdr)
except StopIteration: break
categories_int = 0
# print "categories_dict:",categories_dict
for rec in data: # for each record
color_cat = categories_dict[string.strip(rec['color'])] # should add them to the color category
shape_cat = categories_dict[string.strip(rec['shape'])] # should add them to the color category
combined_cat = color_cat + shape_cat
#print "color_category:", color_cat,
#print " shape_category:", shape_cat,
#print " combined categories:", combined_cat
rec['categories_int'] = combined_cat
if rec['id']:
info[rec['id']] = rec
elif rec['color']:
info[rec['color']] = rec
elif rec['shape']:
info[rec['shape']] = rec
elif rec['msr1']:
info[rec['msr1']] = rec
else:
print "Warning: Not all fields found for record:",str(rec)
for k,v in info.items():
print k,"-->",v
cat_int = v['categories_int']
print " -- The categories_int variable has value ",cat_int,"which means this record has these properties: ", get_properties(cat_int)
print
print "Just to be clear, here are the bitwise categories:"
list_properties() ad****@yahoo.com writes:
>Hi,
>Given a large ascii file (delimited or fixed width) with one ID field and dimensions/measures fields, sorted by dimensions, I'd like to "flatten" or "rollup" the file by creating new columns: one for each combination of dimension level, and summing up measures over all records for a given ID.
>If the wheel has already been invented, great, please point me in the right direction. If not, please share some pointers on how to think about this problem in order to write efficient code.
>Is a hash with dimension level combinations a good approach, with values reset at each new ID level?
>I know mysql, Oracle etc will do this , but they all have a cap on # of columns allowed. SAS will allow unlimited columns, but I don't own SAS.
>Thanks.
>ID,color,shape,msr1 ------------------------------ 001, blue, square, 4 001, red , circle, 5 001, red, circle, 6
>ID, blue_circle, blue_square, red_circle, red_square -------------------------------------------------------------------------- 001,0,4,11,0 002 ...
Something like:
import sys
from sets import Set
ids = {}
keys = Set()
for line in sys.stdin:
ID,COL,SHAPE,VAL = [s.strip() for s in line.split(',')]
ids.setdefault(ID,{})
key = '%s_%s'%(COL,SHAPE)
ids[ID].setdefault(key,0)
ids[ID][key] += int(VAL)
keys.add(key)
print 'id',',',','.join([str(key) for key in keys])
for id,cols in ids.items():
print id,',', ', '.join([str(cols.get(k,0)) for k in keys])
Doesn't keep all possible keys just those that are actually used.
Needs to sort() things here and there.
Incidentally I don't think you could do it in SQL at all in this way but you
could do it in a more vertical fashion (eg
001, red, circle, 11
001, blue, square, 4
002, red, rhombus, 99) etc.
Eddie This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: robin |
last post by:
I need to do a search through about 50 million records, each of which
are less than 100 bytes wide. A database is actually too slow for
this, so I thought of optimising the data and putting it all...
|
by: Joerg Schuster |
last post by:
Hello,
I am looking for a method to "shuffle" the lines of a large file.
I have a corpus of sorted and "uniqed" English sentences that has been
produced with (1):
(1) sort corpus | uniq >...
|
by: dub |
last post by:
Hello web folks... I've been desigining web pages for 13 years using my
trusty text editor (UltraEdit) and in depth knowledge of HTML. I'm
truly a text editor ninja at this point. I am frequently...
|
by: solex |
last post by:
Hello,
Part of my XML application will allow the user to select any field(s) from
the hiearchy as output to a CSV file for viewing/ manipulation. The
datastore consists of many XML files that...
|
by: gangesmaster |
last post by:
as we all know, * (asterisk) can be used to "inline" or "flatten" a
tuple into an argument list, i.e.:
def f(a, b, c):
...
x = (1,2,3)
f(*x)
so... mainly for symmetry's sake, why not make a...
|
by: js |
last post by:
Hello, list.
I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.
Let's say the list has a sentence...
|
by: Alan |
last post by:
Hi. I have programmed in C++ before, but I`m a couple of years out
of practice. I am seeking some advice on getting started on a quickie
project. . . .
I have to read a 54MB text file and do a...
|
by: pargat.singh |
last post by:
Hi Everyone:
I have a C# program which loop through No. of files. In
UAT i had medium size files and every thing goes ok but in Production
files are big and i notice that my program is very...
|
by: farhadtarapore |
last post by:
I have a very large C++ application that has been converted into a
windows service. This application writes a lot of statements to the
console i.e. cout and cerr.
I have used
std::ofstream...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: Faith0G |
last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
|
by: ryjfgjl |
last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: aa123db |
last post by:
Variable and constants
Use var or let for variables and const fror constants.
Var foo ='bar';
Let foo ='bar';const baz ='bar';
Functions
function $name$ ($parameters$) {
}
...
|
by: ryjfgjl |
last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
| |