473,888 Members | 1,551 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

flattening/rolling up/aggregating a large sorted text file

Hi,

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.
ID,color,shape, msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...

Mar 21 '07 #1
3 1722
ad****@yahoo.co m wrote:
Hi,

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.
ID,color,shape, msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...
It seems a bit wrong-headed to force this problem to fit a solution
where you define relations with a variable number of columns when the
natural way to solve it would seem to be to sum the msr1 values for each
unique combination of ID, color and shape. That's a pretty
straightforward relational problem.

So, is there some reason the result *has* to have that variable number
of columns?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Mar 21 '07 #2
Apparently you want to use this data to know how many blue circles, blue
squares, red circles and red squares. In other words, I doubt you want
to output redundant data columns, you just want this data in a more
usable format and that you don't actually need to do multiple passes
over it.

This is a fun problem to solve because it uses two very powerful tools:
cvs.dictreader and bitwise categorization.

Note: your initial data has three records with the same ID. I assumes
the ID is the unique key. So I changed the data slightly.

ad****@yahoo.co m wrote:
Hi,

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.
ID,color,shape, msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...

--
Shane Geiger
IT Director
National Council on Economic Education
sg*****@ncee.ne t | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy

"""

Apparently you want to use this data to know how many blue circles, blue squares, red circles and red squares. In other words, I doubt you want to output redundant data columns, you just want this data in a more usable format and that you don't actually need to do multiple passes over it.

This is a fun problem to solve because it uses two very powerful tools: cvs.dictreader and bitwise categorization.

Note: your initial data has three records with the same ID. I assumes the ID is the unique key. So I changed the data slightly.

------

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.
ID,color,shape, msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...

"""

import string
## BITWISE CATEGORIZATION STUFF

def gNextBit(val=0) :
while True:
y = 2**val
val += 1
yield y

nb = gNextBit()

categories = ['blue','red','s quare','circle']
#categories_val ue = ['blue','red','s quare','circle']

def bitwise_categor ize(items):
d = {}
for item in items:
d[item] = nb.next()
return d

categories_dict = bitwise_categor ize(categories)

#print categories_dict # {'blue': 1, 'circle': 8, 'square': 4, 'red': 2}

def get_properties( category_int):
p_list = []
for k,v in categories_dict .items():
if category_int & v == v:
p_list.append(k )
return p_list

def list_properties ():
for i in range(len(categ ories)**2):
print "Properties for something with category_int of",str(i),str( get_properties( i))

#list_propertie s()

### EXAMPLE DATA

header_fields = ['id','color','s hape','msr1']

example_data = """
001, blue, square, 4
002, red , circle, 5
003, red, circle, 6
"""

# write out the example
import os
def writefile(f, data, perms=750): open(f, 'w').write(data ) and os.chmod(f, perms)
csv_file = "/Users/shanegeiger/temp.csv"
writefile(csv_f ile, example_data)


### READING IN THE DATA AND CATEGORIZING IT WITH BITWISE CATEGORIZATION

import csv
reader = csv.DictReader( open(csv_file), [], delimiter=",")

data = []
info = {}

while True:
try:
# Read next "header" line (if there isn't one then exit the loop)
reader.fieldnam es = header_fields
rdr = reader.next()
data.append(rdr )

except StopIteration: break

categories_int = 0

# print "categories_dic t:",categories_ dict

for rec in data: # for each record
color_cat = categories_dict[string.strip(re c['color'])] # should add them to the color category
shape_cat = categories_dict[string.strip(re c['shape'])] # should add them to the color category
combined_cat = color_cat + shape_cat

#print "color_category :", color_cat,
#print " shape_category: ", shape_cat,
#print " combined categories:", combined_cat

rec['categories_int '] = combined_cat
if rec['id']:
info[rec['id']] = rec
elif rec['color']:
info[rec['color']] = rec
elif rec['shape']:
info[rec['shape']] = rec
elif rec['msr1']:
info[rec['msr1']] = rec
else:
print "Warning: Not all fields found for record:",str(re c)

for k,v in info.items():
print k,"-->",v
cat_int = v['categories_int ']
print " -- The categories_int variable has value ",cat_int,"whic h means this record has these properties: ", get_properties( cat_int)

print

print "Just to be clear, here are the bitwise categories:"
list_properties ()
Mar 21 '07 #3
ad****@yahoo.co m writes:
>Hi,
>Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.
>If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.
>Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?
>I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.
>Thanks.
>ID,color,shape ,msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
>ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...
Something like:

import sys
from sets import Set

ids = {}
keys = Set()

for line in sys.stdin:
ID,COL,SHAPE,VA L = [s.strip() for s in line.split(',')]
ids.setdefault( ID,{})
key = '%s_%s'%(COL,SH APE)
ids[ID].setdefault(key ,0)
ids[ID][key] += int(VAL)
keys.add(key)

print 'id',',',','.jo in([str(key) for key in keys])
for id,cols in ids.items():
print id,',', ', '.join([str(cols.get(k, 0)) for k in keys])
Doesn't keep all possible keys just those that are actually used.
Needs to sort() things here and there.

Incidentally I don't think you could do it in SQL at all in this way but you
could do it in a more vertical fashion (eg
001, red, circle, 11
001, blue, square, 4
002, red, rhombus, 99) etc.

Eddie
Mar 22 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
3863
by: robin | last post by:
I need to do a search through about 50 million records, each of which are less than 100 bytes wide. A database is actually too slow for this, so I thought of optimising the data and putting it all in memory. There is a single key field, so a dictionary is an obvious choice for a structure, since Python optimises these nicely. But is there a better choice? Is it worth building some sort of tree?
24
7321
by: Joerg Schuster | last post by:
Hello, I am looking for a method to "shuffle" the lines of a large file. I have a corpus of sorted and "uniqed" English sentences that has been produced with (1): (1) sort corpus | uniq > corpus.uniq corpus.uniq is 80G large. The fact that every sentence appears only
21
4846
by: dub | last post by:
Hello web folks... I've been desigining web pages for 13 years using my trusty text editor (UltraEdit) and in depth knowledge of HTML. I'm truly a text editor ninja at this point. I am frequently using regular expressions to search and replace in multiple files. I also use many different copy buffers and know all the keyboard shortcuts like the back of my hand. The point is... it's comfortable. Unfortunately, it's become just too...
1
2666
by: solex | last post by:
Hello, Part of my XML application will allow the user to select any field(s) from the hiearchy as output to a CSV file for viewing/ manipulation. The datastore consists of many XML files that adhere to a single schema. If possible I would like a generic xslt file that will create the CSV file by flattening the hiearchy. I suggest a generic transform file because the selected fields will change.
2
1639
by: gangesmaster | last post by:
as we all know, * (asterisk) can be used to "inline" or "flatten" a tuple into an argument list, i.e.: def f(a, b, c): ... x = (1,2,3) f(*x) so... mainly for symmetry's sake, why not make a "flattening" operator that also works outside the context of function calls? for example:
8
1633
by: js | last post by:
Hello, list. I have a list of sentence in text files that I use to filter-out some data. I managed the list so badly that now it's become literally a mess. Let's say the list has a sentence below 1. "Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. "
7
2518
by: Alan | last post by:
Hi. I have programmed in C++ before, but I`m a couple of years out of practice. I am seeking some advice on getting started on a quickie project. . . . I have to read a 54MB text file and do a pairwise comparison among 2500 items or so in the file. Each of those items have to be compared to every other item. Many of the comparison will only require comparing one field of the items. I will probably sort on this field before I do the...
9
11671
by: pargat.singh | last post by:
Hi Everyone: I have a C# program which loop through No. of files. In UAT i had medium size files and every thing goes ok but in Production files are big and i notice that my program is very slow and it's taking 100% CPU time. Is there anyway i can improve the performance. Something like Do Events etc. Any suggestion is much appreciated.
3
9624
by: farhadtarapore | last post by:
I have a very large C++ application that has been converted into a windows service. This application writes a lot of statements to the console i.e. cout and cerr. I have used std::ofstream out(coutFilePath.c_str (), ofstream::out | ofstream::app); if(!out.is_open()) throw std::runtime_error("Failed to open cout file"); //Save the previous target before we redirect
0
11182
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10886
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10439
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9597
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5824
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
6014
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4642
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
4245
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3252
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.