Starting at the beginning...

Hello...

I m very new to programming but not to computers. I am trying to expand my skills beyond what I do in Excel. I started working with Python about a month ago. I feel things are close to clicking for me but I need a bit of a nudge from an expert to get me over the hump.

I am not looking for help writing a script per se, but actually looking for insight on how to approach my project. I have been reading about tuples and lists and pickles and shelves and I find myself a bit perplexed on which way to proceed.

The project starts with a .csv file that I download from a website. There are 14 columns of data and roughly 1500 rows or records.

Initially, I need to read the file, reorder the columns, strip some characters from one column, compare the values of 2 different fields and highlight a deviation of greater than 20%, sum the value of one column and print revised records to a file.

I have been playing around with and have created pickle files and shelve files but have had problems visualizing the order in which to work this out...do I even want to go that route or should I just be looking at re, maybe???

Your thoughts would be appreciated!!

Feb 14 '08 #1

Subscribe Post Reply

1255

bvdet

2,851

Expert Mod 2GB

Python built-in module csv may be ideal for your application. Instead of that, the approach to this partly depends on the data itself. If your data has a header line, you can read all the data into a dictionary using the column headers as keys. I have a simple example. This will not work if any of your data has embedded commas (the data with embedded commas must be enclosed in quotes). Following is the sample file:

Expand|Select|Wrap|Line Numbers

 Count,Dia.,Type,Length,Hd Wash,Nt Wash,TC

24, 7/8,A325,3 1/2,,1 Hardened,No

448, 7/8,A325,3 1/4,,1 Hardened,No

484, 7/8,A325,3,,1 Hardened,No

508, 7/8,A325,2 3/4,,1 Hardened,No

243, 7/8,A325,2 1/2,,1 Hardened,No

479, 7/8,A325,2 1/4,,1 Hardened,No

2632, 7/8,A325,2,,1 Hardened,No

3378, 7/8,A325,1 3/4,,1 Hardened,No

8, 3/4,A325,2,,1 Hardened,No

8, 7/8,A307,11,,,No

24, 7/8,A307,9,,,No

36, 7/8,A307,7,,,No

The following code parses and compiles the data so it can be manipulated:

Expand|Select|Wrap|Line Numbers

 fn = 'sample.csv'

f = open(fn)

# read first line of file

s = f.readline().strip()
 
# the first line contains the column headers

# use the column headers as dictionary keys

hdrList = s.split(',')

colList = [[] for i in range(len(hdrList))]

for line in f:

    for i, item in enumerate(line.strip().split(',')):

        colList[i].append(item)

f.close()
 
# create the dictionary

dd = dict(zip(hdrList, colList))
 
# reorder the columnar data 

# in this case, the length of hdrList = 7

# column numbers = 1-7

# list indices begin with 0

neworder = [1,3,5,2,4,7]

newhdrList = [hdrList[i-1] for i in neworder]
 
# compile the reordered data

outList = [newhdrList[:]] + [[] for i in range(len(dd.values()[0]))]

for key in newhdrList:

    for i, item in enumerate(dd[key]):

        outList[i+1].append(item)
 
# print the data

# also suitable for writing to file

print '\n'.join([','.join([s for s in item]) for item in outList])

# print the sum of column 'Count'

print 'Total number of bolts = %d' % sum([int(i) for i in dd['Count']])

Printed output:

Expand|Select|Wrap|Line Numbers

 >>> Count,Type,Hd Wash,Dia.,Length,TC

24,A325,, 7/8,3 1/2,No

448,A325,, 7/8,3 1/4,No

484,A325,, 7/8,3,No

508,A325,, 7/8,2 3/4,No

243,A325,, 7/8,2 1/2,No

479,A325,, 7/8,2 1/4,No

2632,A325,, 7/8,2,No

3378,A325,, 7/8,1 3/4,No

8,A325,, 3/4,2,No

8,A307,, 7/8,11,No

24,A307,, 7/8,9,No

36,A307,, 7/8,7,No

Total number of bolts = 8272

>>>

HTH

Feb 14 '08 #2

hjnathan

Thanks for the quick response!

Breaking this up by columns was definitely a different direction than I was seeing...I was focused on the rows...

It took me a while to reply back because I tried it out with my data, not to mention I'm at work now. It works great! I'm going to spend a little time getting more comfortable with what you sent and try to add to it. I saw your column total and had to comment it out until I figure out the adaptation for my data.

I have one question, though...being that the data is in a dictionary, when I compare fields I'll be referencing the dictionary keys in all formulas...correct?

I do believe this may have been the nudge I needed...thanks very much!!

Feb 15 '08 #3

bvdet

2,851

Expert Mod 2GB

Thanks for the quick response!

Breaking this up by columns was definitely a different direction than I was seeing...I was focused on the rows...

It took me a while to reply back because I tried it out with my data, not to mention I'm at work now. It works great! I'm going to spend a little time getting more comfortable with what you sent and try to add to it. I saw your column total and had to comment it out until I figure out the adaptation for my data.

I have one question, though...being that the data is in a dictionary, when I compare fields I'll be referencing the dictionary keys in all formulas...correct?

I do believe this may have been the nudge I needed...thanks very much!!

You are very welcome. I am pleased hearing from someone with your enthusiasm.

Let's say you want to display the difference between column 'Data 1' and column 'Data 3' if it exceeds 1.0, and both columns contain floating point numbers.

Expand|Select|Wrap|Line Numbers

 compList = [[float(item), float(dd['Data 1'][i])] for i, item in enumerate(dd['Data 2'])]
 
for i, item in enumerate(compList):

    diff = item[0]-item[1]

    if abs(diff) > 1:

        print 'Record %d difference is %0.4f' % (i, diff)

Feb 15 '08 #4

hjnathan

You are very welcome. I am pleased hearing from someone with your enthusiasm.

Let's say you want to display the difference between column 'Data 1' and column 'Data 3' if it exceeds 1.0, and both columns contain floating point numbers.

Expand|Select|Wrap|Line Numbers

compList = [[float(item), float(dd['Data 1'][i])] for i, item in enumerate(dd['Data 2'])]

for i, item in enumerate(compList):

    diff = item[0]-item[1]

    if abs(diff) > 1:

        print 'Record %d difference is %0.4f' % (i, diff)

Okay...I am back at it this morning. I understand how you created the dictionary and how to reorder the columns so far, but I am having a little trouble figuring out the "Comparison" code. In your code where you are defining compList, what list does 'data 1' and 'data 2' represent? I understand that the list is part of the dictionary named "dd" and I have tried replacing those entries with key names and key position numbers with no success. Obviously I am not quite understanding what I'm looking at...??

Feb 15 '08 #5

bvdet

2,851

Expert Mod 2GB

Okay...I am back at it this morning. I understand how you created the dictionary and how to reorder the columns so far, but I am having a little trouble figuring out the "Comparison" code. In your code where you are defining compList, what list does 'data 1' and 'data 2' represent? I understand that the list is part of the dictionary named "dd" and I have tried replacing those entries with key names and key position numbers with no success. Obviously I am not quite understanding what I'm looking at...??

My sample.csv file was not very well suited for the comparison of two fields, so I made up two field names which would contain some sort of data that could be compared - like numbers. The sample CSV file now looks like this:

Count,Dia.,Type,Length,Hd Wash,Nt Wash,TC,Data 1,Data 2
24, 7/8,A325,3 1/2,,1 Hardened,No,3,6
448, 7/8,A325,3 1/4,,1 Hardened,No,4,6
....................

Feb 15 '08 #6

hjnathan

My sample.csv file was not very well suited for the comparison of two fields, so I made up two field names which would contain some sort of data that could be compared - like numbers. The sample CSV file now looks like this:

Count,Dia.,Type,Length,Hd Wash,Nt Wash,TC,Data 1,Data 2
24, 7/8,A325,3 1/2,,1 Hardened,No,3,6
448, 7/8,A325,3 1/4,,1 Hardened,No,4,6
....................

So, those should be keys then...I'l play with it some more. Thanks!

Feb 15 '08 #7

hjnathan

So, those should be keys then...I'l play with it some more. Thanks!

Well I have spent a few hours trying to get the column comparison code to work.
Here is the error I get...

compList = [[float(item), float(dd['shipping_actual'][i])] for i, item in enumerate(dd['shipping'])]

for i, item in enumerate(compList):
diff = item[0]-item[1]
if abs(diff) > 1:
print 'Record %d difference is %0.4f' % (i, diff)

Here is a bit of my data...

['product_order_qty', 'shipping_actual', 'PackageNotes', 'order_id', 'ShipToState', '**ProductLocation', 'shipping', 'product_price', 'ShipToAttn', 'number_boxes', 'product_ship_date', 'ship_method', 'ShipToCompany', 'product_name', 'customer_name']

26.66,30.31,649R2671,Ray Villalona (Ray),College Ready Magazine,35/pack,UPS Ground,12/12/2007 7:21:33 PM,0,4,2,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
26.66,6.22,649R2671,Ray Villalona (Ray),Past Due Product Card,100/pack,UPS Ground,1/31/2008 6:30:36 PM,0,3,1,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
26.66,6.91,649R2671,Ray Villalona (Ray),Past Due Balance Quick Guide,50/pack,UPS Ground,1/15/2008 5:06:38 PM,0,4,1,CUNY School of Law /Queens College,Angela Joseph,NY,ALS-RI
26.66,30.31,649R2671,Ray Villalona (Ray),Product Card - Zero-Fee Wachovia Education Loan,100/pack,UPS Ground,12/12/2007 7:21:33 PM,0,2,2,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
13.99,13.39,649R2672,Ray Villalona (Ray),College Ready Magazine,35/pack,UPS Ground,12/12/2007 7:22:05 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,RAA
13.99,6.22,649R2672,Ray Villalona (Ray),Past Due Product Card,100/pack,UPS Ground,1/31/2008 6:30:59 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,RAA
13.99,6.22,649R2672,Ray Villalona (Ray),Past Due Balance Quick Guide,50/pack,UPS Ground,1/15/2008 5:07:07 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,ALS-RI

...and my (your) code...

Expand|Select|Wrap|Line Numbers

 fn = 'newtest.csv'

f = open(fn)

# read first line of file

s = f.readline().strip()
 
# the first line contains the column headers

# use the column headers as dictionary keys

hdrList = s.split(',')

colList = [[] for i in range(len(hdrList))]

for line in f:

    for i, item in enumerate(line.strip().split(',')):

        colList[i].append(item)

f.close()
 
# create the dictionary

dd = dict(zip(hdrList, colList))

print dd.keys() 
 
# reorder the columnar data 

# in this case, the length of hdrList = 15

# column numbers = 1-15

# list indices begin with 0

neworder = [9,10,1,2,3,4,5,6,7,8,11,12,13,14,15]

newhdrList = [hdrList[i-1] for i in neworder]
 
# compile the reordered data

outList = [newhdrList[:]] + [[] for i in range(len(dd.values()[0]))]

for key in newhdrList:

    for i, item in enumerate(dd[key]):

        outList[i+1].append(item)        
 
# print the data

# also suitable for writing to file

print '\n'.join([','.join([s for s in item]) for item in outList])

# print the sum of column 'Count'

print 'Total value of all orders = %d' % sum([float(i) for i in dd['product_price']])
 
compList = [[float(item), float(dd['shipping_actual'][i])] for i, item in enumerate(dd['shipping'])]
 
for i, item in enumerate(compList):

    diff = item[0]-item[1]

    if abs(diff) > 1:

        print 'Record %d difference is %0.4f' % (i, diff)

I am not sure what I am doing wrong, but this is the key to being able to complete my project. I need to do several comparisons based on order number and date which will allow me to remove duplicate data from individual cells and flag shipment amounts that vary +- 20%.

If you could tell me why I am getting this error I would really appreciate it.

Thanks again for your assistance!!

Feb 16 '08 #8

bvdet

2,851

Expert Mod 2GB

Well I have spent a few hours trying to get the column comparison code to work.
Here is the error I get...

compList = [[float(item), float(dd['shipping_actual'][i])] for i, item in enumerate(dd['shipping'])]

for i, item in enumerate(compList):
diff = item[0]-item[1]
if abs(diff) > 1:
print 'Record %d difference is %0.4f' % (i, diff)

Here is a bit of my data...

['product_order_qty', 'shipping_actual', 'PackageNotes', 'order_id', 'ShipToState', '**ProductLocation', 'shipping', 'product_price', 'ShipToAttn', 'number_boxes', 'product_ship_date', 'ship_method', 'ShipToCompany', 'product_name', 'customer_name']

26.66,30.31,649R2671,Ray Villalona (Ray),College Ready Magazine,35/pack,UPS Ground,12/12/2007 7:21:33 PM,0,4,2,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
26.66,6.22,649R2671,Ray Villalona (Ray),Past Due Product Card,100/pack,UPS Ground,1/31/2008 6:30:36 PM,0,3,1,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
26.66,6.91,649R2671,Ray Villalona (Ray),Past Due Balance Quick Guide,50/pack,UPS Ground,1/15/2008 5:06:38 PM,0,4,1,CUNY School of Law /Queens College,Angela Joseph,NY,ALS-RI
26.66,30.31,649R2671,Ray Villalona (Ray),Product Card - Zero-Fee Wachovia Education Loan,100/pack,UPS Ground,12/12/2007 7:21:33 PM,0,2,2,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
13.99,13.39,649R2672,Ray Villalona (Ray),College Ready Magazine,35/pack,UPS Ground,12/12/2007 7:22:05 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,RAA
13.99,6.22,649R2672,Ray Villalona (Ray),Past Due Product Card,100/pack,UPS Ground,1/31/2008 6:30:59 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,RAA
13.99,6.22,649R2672,Ray Villalona (Ray),Past Due Balance Quick Guide,50/pack,UPS Ground,1/15/2008 5:07:07 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,ALS-RI

...and my (your) code...

Expand|Select|Wrap|Line Numbers

fn = 'newtest.csv'

f = open(fn)

# read first line of file

s = f.readline().strip()

# the first line contains the column headers

# use the column headers as dictionary keys

hdrList = s.split(',')

colList = [[] for i in range(len(hdrList))]

for line in f:

    for i, item in enumerate(line.strip().split(',')):

        colList[i].append(item)

f.close()

# create the dictionary

dd = dict(zip(hdrList, colList))

print dd.keys()

# reorder the columnar data

# in this case, the length of hdrList = 15

# column numbers = 1-15

# list indices begin with 0

neworder = [9,10,1,2,3,4,5,6,7,8,11,12,13,14,15]

newhdrList = [hdrList[i-1] for i in neworder]

# compile the reordered data

outList = [newhdrList[:]] + [[] for i in range(len(dd.values()[0]))]

for key in newhdrList:

    for i, item in enumerate(dd[key]):

        outList[i+1].append(item)

# print the data

# also suitable for writing to file

print '\n'.join([','.join([s for s in item]) for item in outList])

# print the sum of column 'Count'

print 'Total value of all orders = %d' % sum([float(i) for i in dd['product_price']])

compList = [[float(item), float(dd['shipping_actual'][i])] for i, item in enumerate(dd['shipping'])]

for i, item in enumerate(compList):

    diff = item[0]-item[1]

    if abs(diff) > 1:

        print 'Record %d difference is %0.4f' % (i, diff)

I am not sure what I am doing wrong, but this is the key to being able to complete my project. I need to do several comparisons based on order number and date which will allow me to remove duplicate data from individual cells and flag shipment amounts that vary +- 20%.

If you could tell me why I am getting this error I would really appreciate it.

Thanks again for your assistance!!

I assume that your error occurred at the line of code shown. It would be helpful if you show the error message you received. The sample data you gave appears to be reordered with respect to the header labels. It would be helpful if you show part of the original csv file including the header line.

Feb 17 '08 #9

hjnathan

I assume that your error occurred at the line of code shown. It would be helpful if you show the error message you received. The sample data you gave appears to be reordered with respect to the header labels. It would be helpful if you show part of the original csv file including the header line.

I finally figured out the problem...it turns out it was a data problem. There were two cells in the shipping_actual column that were empty. Took a while but I learned how to inspect the stack during this exercise...:-) I still need to comprehend the error messages a little better. I'll keep working on it, though!
I'll show you what I come up with as I continue my project. Thanks again for the great headstart you gave me!!

Feb 17 '08 #10

bvdet

2,851

Expert Mod 2GB

I finally figured out the problem...it turns out it was a data problem. There were two cells in the shipping_actual column that were empty. Took a while but I learned how to inspect the stack during this exercise...:-) I still need to comprehend the error messages a little better. I'll keep working on it, though!
I'll show you what I come up with as I continue my project. Thanks again for the great headstart you gave me!!

I'm glad you were able to solve the problem. Thanks for the update.

Feb 17 '08 #11

by: invni | last post by:

I have a nested while. How do I go from the inner while to the beginning of the outer while? Can this be done without using goto? while_1() { some codes here while_2() { if true go to the...

C / C++

System crash on starting ASP .NET web application

by: satish.kishenchand | last post by:

Hi, I am just starting to learn ASP .NET programming and have been having a weird problem. When I run any ASP .NET program from Visual Studio 2003 my machine crashes. I am forced to do a...

ASP.NET

Recomended Starting Point

by: Brent Borovan | last post by:

Good day, We are at the beginning stages of developing a new product which is essentially a cutomized web application, where users will purchase access and log into our site to use the...

ASP.NET

sys.tables starting with #

by: Mark | last post by:

Hello, if there are tables with names starting with a # I can´t look into them, e.g. by doing "select * from tempdb.sys.tables". What is the reason for such tables, who creates them and how I...

Microsoft SQL Server

Can someone tell if Sitemap always require a starting root?

by: rlueneberg | last post by:

Can someone tell if Sitemap always require a starting root? Can it have more than one root? I am having a problem trying to hide a root node without hiding its children using a Treeview component....

ASP.NET

How to extract a string starting with 'abc' & ending with 'xyz' ?

by: Umesh | last post by:

I want to extract a string abc*xyz from a text file. * indicates arbitrary no. of characters. I'm only able to do it when the string has definite no. of characters or the string length is...

C / C++

Reducing complexity for starting a project

by: dataangel | last post by:

After spending a lot of time writing code in other languages, returning to C++ and starting a new project has been painful. It seems in order to get anything done you need to #include, add linker...

C / C++

String starting with ' won't concatenate

by: Gustaf | last post by:

Some error handling code: catch (System.Xml.XmlException e) { Console.WriteLine("Error in " + e.SourceUri + ": " + e.Message); return; } The output:

C# / C Sharp

error on starting outlook 2007

by: =?Utf-8?B?ZnJmYWxsYXM=?= | last post by:

Hello, I had an error starting outlook that says my mailbox.pst file is corrupted, I have runned the scanpst tool as te dialog box says, nothing happened, I created other data file named...

.NET Framework

EXE is very slow by starting (>10sec.) (build with PyInstaller)

by: Mark Delon | last post by:

Hi, My Python executable created with PyInstaller is too slow by starting... It takes about 15 secs.!!! I am using PyQt4 libraries for creating my very simple GUI-application. -How can I...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Starting at the beginning...

Similar topics