473,387 Members | 1,504 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Starting at the beginning...

Hello...

I m very new to programming but not to computers. I am trying to expand my skills beyond what I do in Excel. I started working with Python about a month ago. I feel things are close to clicking for me but I need a bit of a nudge from an expert to get me over the hump.

I am not looking for help writing a script per se, but actually looking for insight on how to approach my project. I have been reading about tuples and lists and pickles and shelves and I find myself a bit perplexed on which way to proceed.

The project starts with a .csv file that I download from a website. There are 14 columns of data and roughly 1500 rows or records.

Initially, I need to read the file, reorder the columns, strip some characters from one column, compare the values of 2 different fields and highlight a deviation of greater than 20%, sum the value of one column and print revised records to a file.

I have been playing around with and have created pickle files and shelve files but have had problems visualizing the order in which to work this out...do I even want to go that route or should I just be looking at re, maybe???

Your thoughts would be appreciated!!
Feb 14 '08 #1
10 1255
bvdet
2,851 Expert Mod 2GB
Python built-in module csv may be ideal for your application. Instead of that, the approach to this partly depends on the data itself. If your data has a header line, you can read all the data into a dictionary using the column headers as keys. I have a simple example. This will not work if any of your data has embedded commas (the data with embedded commas must be enclosed in quotes). Following is the sample file:
Expand|Select|Wrap|Line Numbers
  1. Count,Dia.,Type,Length,Hd Wash,Nt Wash,TC
  2. 24, 7/8,A325,3 1/2,,1 Hardened,No
  3. 448, 7/8,A325,3 1/4,,1 Hardened,No
  4. 484, 7/8,A325,3,,1 Hardened,No
  5. 508, 7/8,A325,2 3/4,,1 Hardened,No
  6. 243, 7/8,A325,2 1/2,,1 Hardened,No
  7. 479, 7/8,A325,2 1/4,,1 Hardened,No
  8. 2632, 7/8,A325,2,,1 Hardened,No
  9. 3378, 7/8,A325,1 3/4,,1 Hardened,No
  10. 8, 3/4,A325,2,,1 Hardened,No
  11. 8, 7/8,A307,11,,,No
  12. 24, 7/8,A307,9,,,No
  13. 36, 7/8,A307,7,,,No
The following code parses and compiles the data so it can be manipulated:
Expand|Select|Wrap|Line Numbers
  1. fn = 'sample.csv'
  2. f = open(fn)
  3. # read first line of file
  4. s = f.readline().strip()
  5.  
  6. # the first line contains the column headers
  7. # use the column headers as dictionary keys
  8. hdrList = s.split(',')
  9. colList = [[] for i in range(len(hdrList))]
  10. for line in f:
  11.     for i, item in enumerate(line.strip().split(',')):
  12.         colList[i].append(item)
  13. f.close()
  14.  
  15. # create the dictionary
  16. dd = dict(zip(hdrList, colList))
  17.  
  18. # reorder the columnar data 
  19. # in this case, the length of hdrList = 7
  20. # column numbers = 1-7
  21. # list indices begin with 0
  22. neworder = [1,3,5,2,4,7]
  23. newhdrList = [hdrList[i-1] for i in neworder]
  24.  
  25. # compile the reordered data
  26. outList = [newhdrList[:]] + [[] for i in range(len(dd.values()[0]))]
  27. for key in newhdrList:
  28.     for i, item in enumerate(dd[key]):
  29.         outList[i+1].append(item)
  30.  
  31. # print the data
  32. # also suitable for writing to file
  33. print '\n'.join([','.join([s for s in item]) for item in outList])
  34. # print the sum of column 'Count'
  35. print 'Total number of bolts = %d' % sum([int(i) for i in dd['Count']])
Printed output:
Expand|Select|Wrap|Line Numbers
  1. >>> Count,Type,Hd Wash,Dia.,Length,TC
  2. 24,A325,, 7/8,3 1/2,No
  3. 448,A325,, 7/8,3 1/4,No
  4. 484,A325,, 7/8,3,No
  5. 508,A325,, 7/8,2 3/4,No
  6. 243,A325,, 7/8,2 1/2,No
  7. 479,A325,, 7/8,2 1/4,No
  8. 2632,A325,, 7/8,2,No
  9. 3378,A325,, 7/8,1 3/4,No
  10. 8,A325,, 3/4,2,No
  11. 8,A307,, 7/8,11,No
  12. 24,A307,, 7/8,9,No
  13. 36,A307,, 7/8,7,No
  14. Total number of bolts = 8272
  15. >>> 
HTH
Feb 14 '08 #2
Thanks for the quick response!

Breaking this up by columns was definitely a different direction than I was seeing...I was focused on the rows...

It took me a while to reply back because I tried it out with my data, not to mention I'm at work now. It works great! I'm going to spend a little time getting more comfortable with what you sent and try to add to it. I saw your column total and had to comment it out until I figure out the adaptation for my data.

I have one question, though...being that the data is in a dictionary, when I compare fields I'll be referencing the dictionary keys in all formulas...correct?

I do believe this may have been the nudge I needed...thanks very much!!
Feb 15 '08 #3
bvdet
2,851 Expert Mod 2GB
Thanks for the quick response!

Breaking this up by columns was definitely a different direction than I was seeing...I was focused on the rows...

It took me a while to reply back because I tried it out with my data, not to mention I'm at work now. It works great! I'm going to spend a little time getting more comfortable with what you sent and try to add to it. I saw your column total and had to comment it out until I figure out the adaptation for my data.

I have one question, though...being that the data is in a dictionary, when I compare fields I'll be referencing the dictionary keys in all formulas...correct?

I do believe this may have been the nudge I needed...thanks very much!!
You are very welcome. I am pleased hearing from someone with your enthusiasm.

Let's say you want to display the difference between column 'Data 1' and column 'Data 3' if it exceeds 1.0, and both columns contain floating point numbers.
Expand|Select|Wrap|Line Numbers
  1. compList = [[float(item), float(dd['Data 1'][i])] for i, item in enumerate(dd['Data 2'])]
  2.  
  3. for i, item in enumerate(compList):
  4.     diff = item[0]-item[1]
  5.     if abs(diff) > 1:
  6.         print 'Record %d difference is %0.4f' % (i, diff)
Feb 15 '08 #4
You are very welcome. I am pleased hearing from someone with your enthusiasm.

Let's say you want to display the difference between column 'Data 1' and column 'Data 3' if it exceeds 1.0, and both columns contain floating point numbers.
Expand|Select|Wrap|Line Numbers
  1. compList = [[float(item), float(dd['Data 1'][i])] for i, item in enumerate(dd['Data 2'])]
  2.  
  3. for i, item in enumerate(compList):
  4.     diff = item[0]-item[1]
  5.     if abs(diff) > 1:
  6.         print 'Record %d difference is %0.4f' % (i, diff)
Okay...I am back at it this morning. I understand how you created the dictionary and how to reorder the columns so far, but I am having a little trouble figuring out the "Comparison" code. In your code where you are defining compList, what list does 'data 1' and 'data 2' represent? I understand that the list is part of the dictionary named "dd" and I have tried replacing those entries with key names and key position numbers with no success. Obviously I am not quite understanding what I'm looking at...??
Feb 15 '08 #5
bvdet
2,851 Expert Mod 2GB
Okay...I am back at it this morning. I understand how you created the dictionary and how to reorder the columns so far, but I am having a little trouble figuring out the "Comparison" code. In your code where you are defining compList, what list does 'data 1' and 'data 2' represent? I understand that the list is part of the dictionary named "dd" and I have tried replacing those entries with key names and key position numbers with no success. Obviously I am not quite understanding what I'm looking at...??
My sample.csv file was not very well suited for the comparison of two fields, so I made up two field names which would contain some sort of data that could be compared - like numbers. The sample CSV file now looks like this:

Count,Dia.,Type,Length,Hd Wash,Nt Wash,TC,Data 1,Data 2
24, 7/8,A325,3 1/2,,1 Hardened,No,3,6
448, 7/8,A325,3 1/4,,1 Hardened,No,4,6
....................
Feb 15 '08 #6
My sample.csv file was not very well suited for the comparison of two fields, so I made up two field names which would contain some sort of data that could be compared - like numbers. The sample CSV file now looks like this:

Count,Dia.,Type,Length,Hd Wash,Nt Wash,TC,Data 1,Data 2
24, 7/8,A325,3 1/2,,1 Hardened,No,3,6
448, 7/8,A325,3 1/4,,1 Hardened,No,4,6
....................
So, those should be keys then...I'l play with it some more. Thanks!
Feb 15 '08 #7
So, those should be keys then...I'l play with it some more. Thanks!
Well I have spent a few hours trying to get the column comparison code to work.
Here is the error I get...

compList = [[float(item), float(dd['shipping_actual'][i])] for i, item in enumerate(dd['shipping'])]

for i, item in enumerate(compList):
diff = item[0]-item[1]
if abs(diff) > 1:
print 'Record %d difference is %0.4f' % (i, diff)


Here is a bit of my data...

['product_order_qty', 'shipping_actual', 'PackageNotes', 'order_id', 'ShipToState', '**ProductLocation', 'shipping', 'product_price', 'ShipToAttn', 'number_boxes', 'product_ship_date', 'ship_method', 'ShipToCompany', 'product_name', 'customer_name']

26.66,30.31,649R2671,Ray Villalona (Ray),College Ready Magazine,35/pack,UPS Ground,12/12/2007 7:21:33 PM,0,4,2,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
26.66,6.22,649R2671,Ray Villalona (Ray),Past Due Product Card,100/pack,UPS Ground,1/31/2008 6:30:36 PM,0,3,1,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
26.66,6.91,649R2671,Ray Villalona (Ray),Past Due Balance Quick Guide,50/pack,UPS Ground,1/15/2008 5:06:38 PM,0,4,1,CUNY School of Law /Queens College,Angela Joseph,NY,ALS-RI
26.66,30.31,649R2671,Ray Villalona (Ray),Product Card - Zero-Fee Wachovia Education Loan,100/pack,UPS Ground,12/12/2007 7:21:33 PM,0,2,2,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
13.99,13.39,649R2672,Ray Villalona (Ray),College Ready Magazine,35/pack,UPS Ground,12/12/2007 7:22:05 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,RAA
13.99,6.22,649R2672,Ray Villalona (Ray),Past Due Product Card,100/pack,UPS Ground,1/31/2008 6:30:59 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,RAA
13.99,6.22,649R2672,Ray Villalona (Ray),Past Due Balance Quick Guide,50/pack,UPS Ground,1/15/2008 5:07:07 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,ALS-RI


...and my (your) code...

Expand|Select|Wrap|Line Numbers
  1. fn = 'newtest.csv'
  2. f = open(fn)
  3. # read first line of file
  4. s = f.readline().strip()
  5.  
  6. # the first line contains the column headers
  7. # use the column headers as dictionary keys
  8. hdrList = s.split(',')
  9. colList = [[] for i in range(len(hdrList))]
  10. for line in f:
  11.     for i, item in enumerate(line.strip().split(',')):
  12.         colList[i].append(item)
  13. f.close()
  14.  
  15. # create the dictionary
  16. dd = dict(zip(hdrList, colList))
  17. print dd.keys() 
  18.  
  19.  
  20. # reorder the columnar data 
  21. # in this case, the length of hdrList = 15
  22. # column numbers = 1-15
  23. # list indices begin with 0
  24. neworder = [9,10,1,2,3,4,5,6,7,8,11,12,13,14,15]
  25. newhdrList = [hdrList[i-1] for i in neworder]
  26.  
  27. # compile the reordered data
  28. outList = [newhdrList[:]] + [[] for i in range(len(dd.values()[0]))]
  29. for key in newhdrList:
  30.     for i, item in enumerate(dd[key]):
  31.         outList[i+1].append(item)        
  32.  
  33. # print the data
  34. # also suitable for writing to file
  35. print '\n'.join([','.join([s for s in item]) for item in outList])
  36. # print the sum of column 'Count'
  37. print 'Total value of all orders = %d' % sum([float(i) for i in dd['product_price']])
  38.  
  39. compList = [[float(item), float(dd['shipping_actual'][i])] for i, item in enumerate(dd['shipping'])]
  40.  
  41. for i, item in enumerate(compList):
  42.     diff = item[0]-item[1]
  43.     if abs(diff) > 1:
  44.         print 'Record %d difference is %0.4f' % (i, diff)

I am not sure what I am doing wrong, but this is the key to being able to complete my project. I need to do several comparisons based on order number and date which will allow me to remove duplicate data from individual cells and flag shipment amounts that vary +- 20%.

If you could tell me why I am getting this error I would really appreciate it.

Thanks again for your assistance!!
Feb 16 '08 #8
bvdet
2,851 Expert Mod 2GB
Well I have spent a few hours trying to get the column comparison code to work.
Here is the error I get...

compList = [[float(item), float(dd['shipping_actual'][i])] for i, item in enumerate(dd['shipping'])]

for i, item in enumerate(compList):
diff = item[0]-item[1]
if abs(diff) > 1:
print 'Record %d difference is %0.4f' % (i, diff)


Here is a bit of my data...

['product_order_qty', 'shipping_actual', 'PackageNotes', 'order_id', 'ShipToState', '**ProductLocation', 'shipping', 'product_price', 'ShipToAttn', 'number_boxes', 'product_ship_date', 'ship_method', 'ShipToCompany', 'product_name', 'customer_name']

26.66,30.31,649R2671,Ray Villalona (Ray),College Ready Magazine,35/pack,UPS Ground,12/12/2007 7:21:33 PM,0,4,2,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
26.66,6.22,649R2671,Ray Villalona (Ray),Past Due Product Card,100/pack,UPS Ground,1/31/2008 6:30:36 PM,0,3,1,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
26.66,6.91,649R2671,Ray Villalona (Ray),Past Due Balance Quick Guide,50/pack,UPS Ground,1/15/2008 5:06:38 PM,0,4,1,CUNY School of Law /Queens College,Angela Joseph,NY,ALS-RI
26.66,30.31,649R2671,Ray Villalona (Ray),Product Card - Zero-Fee Wachovia Education Loan,100/pack,UPS Ground,12/12/2007 7:21:33 PM,0,2,2,CUNY School of Law /Queens College,Angela Joseph,NY,RAA
13.99,13.39,649R2672,Ray Villalona (Ray),College Ready Magazine,35/pack,UPS Ground,12/12/2007 7:22:05 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,RAA
13.99,6.22,649R2672,Ray Villalona (Ray),Past Due Product Card,100/pack,UPS Ground,1/31/2008 6:30:59 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,RAA
13.99,6.22,649R2672,Ray Villalona (Ray),Past Due Balance Quick Guide,50/pack,UPS Ground,1/15/2008 5:07:07 PM,0,2,1,SUNY Downstate Medical Center,Ken Rodell,NY,ALS-RI


...and my (your) code...

Expand|Select|Wrap|Line Numbers
  1. fn = 'newtest.csv'
  2. f = open(fn)
  3. # read first line of file
  4. s = f.readline().strip()
  5.  
  6. # the first line contains the column headers
  7. # use the column headers as dictionary keys
  8. hdrList = s.split(',')
  9. colList = [[] for i in range(len(hdrList))]
  10. for line in f:
  11.     for i, item in enumerate(line.strip().split(',')):
  12.         colList[i].append(item)
  13. f.close()
  14.  
  15. # create the dictionary
  16. dd = dict(zip(hdrList, colList))
  17. print dd.keys() 
  18.  
  19.  
  20. # reorder the columnar data 
  21. # in this case, the length of hdrList = 15
  22. # column numbers = 1-15
  23. # list indices begin with 0
  24. neworder = [9,10,1,2,3,4,5,6,7,8,11,12,13,14,15]
  25. newhdrList = [hdrList[i-1] for i in neworder]
  26.  
  27. # compile the reordered data
  28. outList = [newhdrList[:]] + [[] for i in range(len(dd.values()[0]))]
  29. for key in newhdrList:
  30.     for i, item in enumerate(dd[key]):
  31.         outList[i+1].append(item)        
  32.  
  33. # print the data
  34. # also suitable for writing to file
  35. print '\n'.join([','.join([s for s in item]) for item in outList])
  36. # print the sum of column 'Count'
  37. print 'Total value of all orders = %d' % sum([float(i) for i in dd['product_price']])
  38.  
  39. compList = [[float(item), float(dd['shipping_actual'][i])] for i, item in enumerate(dd['shipping'])]
  40.  
  41. for i, item in enumerate(compList):
  42.     diff = item[0]-item[1]
  43.     if abs(diff) > 1:
  44.         print 'Record %d difference is %0.4f' % (i, diff)

I am not sure what I am doing wrong, but this is the key to being able to complete my project. I need to do several comparisons based on order number and date which will allow me to remove duplicate data from individual cells and flag shipment amounts that vary +- 20%.

If you could tell me why I am getting this error I would really appreciate it.

Thanks again for your assistance!!
I assume that your error occurred at the line of code shown. It would be helpful if you show the error message you received. The sample data you gave appears to be reordered with respect to the header labels. It would be helpful if you show part of the original csv file including the header line.
Feb 17 '08 #9
I assume that your error occurred at the line of code shown. It would be helpful if you show the error message you received. The sample data you gave appears to be reordered with respect to the header labels. It would be helpful if you show part of the original csv file including the header line.
I finally figured out the problem...it turns out it was a data problem. There were two cells in the shipping_actual column that were empty. Took a while but I learned how to inspect the stack during this exercise...:-) I still need to comprehend the error messages a little better. I'll keep working on it, though!
I'll show you what I come up with as I continue my project. Thanks again for the great headstart you gave me!!
Feb 17 '08 #10
bvdet
2,851 Expert Mod 2GB
I finally figured out the problem...it turns out it was a data problem. There were two cells in the shipping_actual column that were empty. Took a while but I learned how to inspect the stack during this exercise...:-) I still need to comprehend the error messages a little better. I'll keep working on it, though!
I'll show you what I come up with as I continue my project. Thanks again for the great headstart you gave me!!
I'm glad you were able to solve the problem. Thanks for the update.
Feb 17 '08 #11

Sign in to post your reply or Sign up for a free account.

Similar topics

36
by: invni | last post by:
I have a nested while. How do I go from the inner while to the beginning of the outer while? Can this be done without using goto? while_1() { some codes here while_2() { if true go to the...
2
by: satish.kishenchand | last post by:
Hi, I am just starting to learn ASP .NET programming and have been having a weird problem. When I run any ASP .NET program from Visual Studio 2003 my machine crashes. I am forced to do a...
2
by: Brent Borovan | last post by:
Good day, We are at the beginning stages of developing a new product which is essentially a cutomized web application, where users will purchase access and log into our site to use the...
1
by: Mark | last post by:
Hello, if there are tables with names starting with a # I canīt look into them, e.g. by doing "select * from tempdb.sys.tables". What is the reason for such tables, who creates them and how I...
6
by: rlueneberg | last post by:
Can someone tell if Sitemap always require a starting root? Can it have more than one root? I am having a problem trying to hide a root node without hiding its children using a Treeview component....
34
by: Umesh | last post by:
I want to extract a string abc*xyz from a text file. * indicates arbitrary no. of characters. I'm only able to do it when the string has definite no. of characters or the string length is...
7
by: dataangel | last post by:
After spending a lot of time writing code in other languages, returning to C++ and starting a new project has been painful. It seems in order to get anything done you need to #include, add linker...
11
by: Gustaf | last post by:
Some error handling code: catch (System.Xml.XmlException e) { Console.WriteLine("Error in " + e.SourceUri + ": " + e.Message); return; } The output:
0
by: =?Utf-8?B?ZnJmYWxsYXM=?= | last post by:
Hello, I had an error starting outlook that says my mailbox.pst file is corrupted, I have runned the scanpst tool as te dialog box says, nothing happened, I created other data file named...
2
by: Mark Delon | last post by:
Hi, My Python executable created with PyInstaller is too slow by starting... It takes about 15 secs.!!! I am using PyQt4 libraries for creating my very simple GUI-application. -How can I...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.