473,408 Members | 2,813 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,408 software developers and data experts.

merging excel files - but a bit trickier than usual

4
Advanced thanks for any helping. I'm running Python on a Mac OS X.

Here's the basic situation. A single group of people had various health measurements performed on them over the course of a few decades. But individuals dropped out of the study between examinations. For instance, for the first exam there were 3,000 individuals, then for the second exam (a few years later), there were 2,500, and so on. By the last exam, exam 26, there were just a few hundred individuals of the initial group left. No individuals were added to the study at any point.

The records from each exam are saved as separate Excel files. The number of rows in the file depends upon how many patients there were for that exam - e.g., the file for exam 1 has 3,000 rows; for exam 2, 2,500; for exam 26, 200. The rows correspond to the number of subjects because each subject was given a patient ID number, and they were put into the last column of the file. So each row has one subject's health records, with the ID at the end of the row indicating which patient it is.

Basically, what happened is that if a patient wasn't at a particular exam, their patient ID was left off of that file. They did not keep the same number of rows in each file, so the files get shorter and shorter as people dropped out. Each column in the file is either the patient ID column or is a variable that was measured - e.g., blood pressure, cholesterol, etc.

I want to create a masterfile that has 3,000 rows (for the 3,000 original patient ID numbers) and has information from specified columns for all of the exams. Specifically, cholesterol is recorded in a single column in every exam. I want to create a masterfile that has 3,000 rows and 27 columns (records for 26 exams, plus one column for the patient ID column) and call it "cholesterol_masterfile". If a subject wasn't at a later examination, I just want to put in "was_not_here" in those cells for which there are no records. I hope this is clear so far.

Now, the first exam has the 3,000 rows, and I just deleted all the columns that weren't cholesterol records. So it has 3,000 rows and two columns - cholesterol, plus patient ID. So my task comes in two steps. First, I need to import columns from later exams into this first Excel file. But also - and this is trickier - I need to match up the patient ID numbers so that the information gets put in the right row. So for instance, in the fourth examination, we can imagine that there was a subject with patient ID 45554. I want the information in the choleterol column - say, column 5 - of this row and I want to put it in the row in the first examination that is for patient ID 45554. And I want to do this for all of the patients. I figured when this is all done, I can loop through it and fill in blank cells with "was_not_there".

Don't hesitate to ask for clarification if this doesn't make sense.
Jan 13 '10 #1
1 2487
bvdet
2,851 Expert Mod 2GB
The experts on this site are more than happy to help you with your problems but they cannot write your program for you. Attempt the program yourself first and post questions regarding any difficulties you have or about a particular function of the code that you don't know how to achieve.

Please read the Posting Guidelines and particularly the Coursework Posting Guidelines.

Having said that, I would approach the problem in this manner:
Initialize an empty dictionary to hold all data
Iterate on the Excel file names
Use module xlrd to create an xlrd.Book instance of the current file
Create an xlrd.sheet.Sheet instance from the xlrd.Book instance work sheet that contains the data
Iterate on the xlrd.sheet.Sheet rows
Get the current row values
Assign to the dictionary the subject ID as the key and a list of the values that you want to tabulate

Example iteration on the current workbook sheet rows where the ID is at index 0 and the data is at index 2:
Expand|Select|Wrap|Line Numbers
  1. dd = {}
  2. for j, file_name in enumerate(list_of_file_names):
  3.     wb = xlrd.open_workbook(file_name)
  4.     sheet1 = wb.sheet_by_name(u'Name of Worksheet')
  5.  
  6.     for i in range(sheet1.nrows):
  7.         rowList = sheet1.row_values(i)
  8.         # skip rows with no ID
  9.         if str(rowList[0]).strip():
  10.             dd.setdefault(str(rowList[0]), []).append(str(rowList[2]))
  11.  
Assuming the first file_name has all 3000 ID's, if an individual dd[ID#] does not have a listing in the current file, pad the value list with string "was_not_there". You may be able to use index value j for that task.

After iterating on list_of_file_names, you should have all the data to create a CSV file.

Example:
Expand|Select|Wrap|Line Numbers
  1. keys = dd.keys()
  2. keys.sort()
  3. result = "\n".join(["%s,%s" % (key, ",".join(dd[key])) for key in keys])
Then it's a matter of writing result to disk.

BV - Moderator
Jan 13 '10 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

0
by: Jack | last post by:
Windows 2K Pro Access/Excel 2003 Hi there, I have a large number of password protected Excel Workbooks. The files are protected by code that runs when the workbook opens. The code asks the...
0
by: steve | last post by:
Hi there, I am trying to import data from 2 dbf files into excel using the 'get external data' option which launches ms query. Ultimately I am merging data with a right join statement. I can...
3
by: Damian Arntzen | last post by:
I'm a beginner to moderate programmer who's fiddling around with automating Excel, in particular after being able to have the user fill out a form and it then generate the workbook. I can't quite...
1
by: adamrace | last post by:
Hi, I've got two excel files, one has a list of products and their current prices and they all have a product ID, I have another file with a list of price's that need updating. I was wondering...
10
by: Ben | last post by:
Hi, i have a weird problem and i don't know who is responsible for this: IIS, excel or asp.net. My problem: we use an asp.net 2.0 application under IIS 6.0 (server 2003 sp2) which must write...
4
by: jack | last post by:
Hi. My project requires to upload the excel file and populate it into the database. By doing so it should also check whether the excel file selected is in the right format. for example if the...
3
by: =?iso-8859-1?Q?Tine_M=FCller?= | last post by:
On this site http://www.tinemuller.dk/sikkerbyg/ the dropdown is functioning the way it should for the different part in Denmark but now I want to show the same informations for the hole Denmark...
5
by: Albert-jan Roskam | last post by:
Hi, I wrote the program below to merge all xls files in a given directory into one multisheet xls file. It uses xlwt and xlrd. The xls files I use for input are generated by Spss. When I open and...
0
by: Albert-jan Roskam | last post by:
Hi John, Thanks! Using a higher xlrd version did the trick! Regarding your other remarks: -yep, input files with multiple sheets don't work yet. I kinda repressed that ;-) Spss outputs only...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.