flattening/rolling up/aggregating a large sorted text file

adtvff

Hi,

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.
ID,color,shape, msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...

Mar 21 '07 #1

Subscribe Reply

1710

Steve Holden

ad****@yahoo.co m wrote:

Hi,

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.
ID,color,shape, msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...

It seems a bit wrong-headed to force this problem to fit a solution
where you define relations with a variable number of columns when the
natural way to solve it would seem to be to sum the msr1 values for each
unique combination of ID, color and shape. That's a pretty
straightforward relational problem.

So, is there some reason the result *has* to have that variable number
of columns?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Mar 21 '07 #2

Shane Geiger

Apparently you want to use this data to know how many blue circles, blue
squares, red circles and red squares. In other words, I doubt you want
to output redundant data columns, you just want this data in a more
usable format and that you don't actually need to do multiple passes
over it.

This is a fun problem to solve because it uses two very powerful tools:
cvs.dictreader and bitwise categorization.

Note: your initial data has three records with the same ID. I assumes
the ID is the unique key. So I changed the data slightly.

ad****@yahoo.co m wrote:

Hi,

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.
ID,color,shape, msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...

--
Shane Geiger
IT Director
National Council on Economic Education
sg*****@ncee.ne t | 402-438-8958 | http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy

"""

Apparently you want to use this data to know how many blue circles, blue squares, red circles and red squares. In other words, I doubt you want to output redundant data columns, you just want this data in a more usable format and that you don't actually need to do multiple passes over it.

This is a fun problem to solve because it uses two very powerful tools: cvs.dictreader and bitwise categorization.

Note: your initial data has three records with the same ID. I assumes the ID is the unique key. So I changed the data slightly.

------

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.
ID,color,shape, msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6
ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...

"""

import string
## BITWISE CATEGORIZATION STUFF

def gNextBit(val=0) :
while True:
y = 2**val
val += 1
yield y

nb = gNextBit()

categories = ['blue','red','s quare','circle']
#categories_val ue = ['blue','red','s quare','circle']

def bitwise_categor ize(items):
d = {}
for item in items:
d[item] = nb.next()
return d

categories_dict = bitwise_categor ize(categories)

#print categories_dict # {'blue': 1, 'circle': 8, 'square': 4, 'red': 2}

def get_properties( category_int):
p_list = []
for k,v in categories_dict .items():
if category_int & v == v:
p_list.append(k )
return p_list

def list_properties ():
for i in range(len(categ ories)**2):
print "Properties for something with category_int of",str(i),str( get_properties( i))

#list_propertie s()

### EXAMPLE DATA

header_fields = ['id','color','s hape','msr1']

example_data = """
001, blue, square, 4
002, red , circle, 5
003, red, circle, 6
"""

# write out the example
import os
def writefile(f, data, perms=750): open(f, 'w').write(data ) and os.chmod(f, perms)
csv_file = "/Users/shanegeiger/temp.csv"
writefile(csv_f ile, example_data)

### READING IN THE DATA AND CATEGORIZING IT WITH BITWISE CATEGORIZATION

import csv
reader = csv.DictReader( open(csv_file), [], delimiter=",")

data = []
info = {}

while True:
try:
# Read next "header" line (if there isn't one then exit the loop)
reader.fieldnam es = header_fields
rdr = reader.next()
data.append(rdr )

except StopIteration: break

categories_int = 0

# print "categories_dic t:",categories_ dict

for rec in data: # for each record
color_cat = categories_dict[string.strip(re c['color'])] # should add them to the color category
shape_cat = categories_dict[string.strip(re c['shape'])] # should add them to the color category
combined_cat = color_cat + shape_cat

#print "color_category :", color_cat,
#print " shape_category: ", shape_cat,
#print " combined categories:", combined_cat

rec['categories_int '] = combined_cat
if rec['id']:
info[rec['id']] = rec
elif rec['color']:
info[rec['color']] = rec
elif rec['shape']:
info[rec['shape']] = rec
elif rec['msr1']:
info[rec['msr1']] = rec
else:
print "Warning: Not all fields found for record:",str(re c)

for k,v in info.items():
print k,"-->",v
cat_int = v['categories_int ']
print " -- The categories_int variable has value ",cat_int,"whic h means this record has these properties: ", get_properties( cat_int)

print

print "Just to be clear, here are the bitwise categories:"
list_properties ()

Mar 21 '07 #3

Eddie Corns

ad****@yahoo.co m writes:

>Hi,

>Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

>If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

>Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

>I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

>Thanks.

>ID,color,shape ,msr1
------------------------------
001, blue, square, 4
001, red , circle, 5
001, red, circle, 6

>ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...

Something like:

import sys
from sets import Set

ids = {}
keys = Set()

for line in sys.stdin:
ID,COL,SHAPE,VA L = [s.strip() for s in line.split(',')]
ids.setdefault( ID,{})
key = '%s_%s'%(COL,SH APE)
ids[ID].setdefault(key ,0)
ids[ID][key] += int(VAL)
keys.add(key)

print 'id',',',','.jo in([str(key) for key in keys])
for id,cols in ids.items():
print id,',', ', '.join([str(cols.get(k, 0)) for k in keys])
Doesn't keep all possible keys just those that are actually used.
Needs to sort() things here and there.

Incidentally I don't think you could do it in SQL at all in this way but you
could do it in a more vertical fashion (eg
001, red, circle, 11
001, blue, square, 4
002, red, rhombus, 99) etc.

Eddie

Mar 22 '07 #4

Similar topics

3850

very large dictionaries

by: robin | last post by:

I need to do a search through about 50 million records, each of which are less than 100 bytes wide. A database is actually too slow for this, so I thought of optimising the data and putting it all in memory. There is a single key field, so a dictionary is an obvious choice for a structure, since Python optimises these nicely. But is there a better choice? Is it worth building some sort of tree?

Python

7296

shuffle the lines of a large file

by: Joerg Schuster | last post by:

Hello, I am looking for a method to "shuffle" the lines of a large file. I have a corpus of sorted and "uniqed" English sentences that has been produced with (1): (1) sort corpus | uniq > corpus.uniq corpus.uniq is 80G large. The fact that every sentence appears only

Python

4840

Tools and Techniques for Managing Large Websites

by: dub | last post by:

Hello web folks... I've been desigining web pages for 13 years using my trusty text editor (UltraEdit) and in depth knowledge of HTML. I'm truly a text editor ninja at this point. I am frequently using regular expressions to search and replace in multiple files. I also use many different copy buffers and know all the keyboard shortcuts like the back of my hand. The point is... it's comfortable. Unfortunately, it's become just too...

HTML / CSS

2663

XML to CSV was Flattening XML tree

by: solex | last post by:

Hello, Part of my XML application will allow the user to select any field(s) from the hiearchy as output to a CSV file for viewing/ manipulation. The datastore consists of many XML files that adhere to a single schema. If possible I would like a generic xslt file that will create the CSV file by flattening the hiearchy. I suggest a generic transform file because the selected fields will change.

.NET Framework

1635

a flattening operator?

by: gangesmaster | last post by:

as we all know, * (asterisk) can be used to "inline" or "flatten" a tuple into an argument list, i.e.: def f(a, b, c): ... x = (1,2,3) f(*x) so... mainly for symmetry's sake, why not make a "flattening" operator that also works outside the context of function calls? for example:

Python

1626

prefix search on a large file

by: js | last post by:

Hello, list. I have a list of sentence in text files that I use to filter-out some data. I managed the list so badly that now it's become literally a mess. Let's say the list has a sentence below 1. "Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. "

Python

2512

parsing large text file and pairwise comparison

by: Alan | last post by:

Hi. I have programmed in C++ before, but I`m a couple of years out of practice. I am seeking some advice on getting started on a quickie project. . . . I have to read a 54MB text file and do a pairwise comparison among 2500 items or so in the file. Each of those items have to be compared to every other item. Many of the comparison will only require comparing one field of the items. I will probably sort on this field before I do the...

C / C++

11669

Large file handling in C#

by: pargat.singh | last post by:

Hi Everyone: I have a C# program which loop through No. of files. In UAT i had medium size files and every thing goes ok but in Production files are big and i notice that my program is very slow and it's taking 100% CPU time. Is there anyway i can improve the performance. Something like Do Events etc. Any suggestion is much appreciated.

C# / C Sharp

9603

Rolling files in C++

by: farhadtarapore | last post by:

I have a very large C++ application that has been converted into a windows service. This application writes a lot of statements to the console i.e. cout and cerr. I have used std::ofstream out(coutFilePath.c_str (), ofstream::out | ofstream::app); if(!out.is_open()) throw std::runtime_error("Failed to open cout file"); //Save the previous target before we redirect

C / C++

9568

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9404

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10168

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10008

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9959

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9837

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

5279

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5423

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2806

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General