473,403 Members | 2,284 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,403 software developers and data experts.

best way to match values in two tables...

58
Hey all,
Sorry the subject should have said..
"best way to match values in TWO tables"
I have two tables that I need to match based off an Unique ID in both tables. Im running this process using hadoop streaming with python, so the actual code is a bit different (ie using csv files to debug locally). I've tried a couple different methods and both are not quite fast enough... ha!
First was like this, using the two tables as lists and comparing the ID's

Expand|Select|Wrap|Line Numbers
  1. reader = open("D:\\temp\\table1.csv",'r')
  2. for line in reader:
  3.     line = line.strip()
  4.     TmpArr.append( line.split(',') )    
  5. reader.close()
  6.  
  7. reader = open("D:\\temp\\table2.csv",'r')
  8. for line in reader:
  9.     line = line.strip()
  10.     Tmp2Arr = line.split(',')
  11.     For line2 in TmpArr:
  12.         If Tmp2Arr[0] == TmpArr[i][0]:
  13.             Do some stuff...
  14.  
This works but slow... and very memory intensive (my table1 is 500MB and table2 is extremely large)

I also tried using a list and a dictionary
Expand|Select|Wrap|Line Numbers
  1. reader = open("D:\\temp\\table1.csv",'r')
  2. for line in reader:
  3.     line = line.strip()
  4.     TmpArr=line.split(',')
  5. TmpDict[TmpArr[0]]=TmpArr[1]+str(',')+TmpArr[5]+str(',')+TmpArr[6]
  6. reader.close()
  7.  
  8. reader = open("D:\\temp\\table2.csv",'r')
  9. for line in reader:
  10.     line = line.strip()
  11.     Tmp2Arr = line.split(',')
  12.     for k,v in TmmDict.iteritems():
  13.         If Tmp2Arr[0] == k:
  14.             Do some stuff...
  15.  
This is better, with a loop through table1 taking about 4 seconds...
This is still too long..
Any thoughts on how to best match my two ID's? I haven't tried using two dictionaries? would that improve performance?
Thanks ahead of time!
Aug 3 '10 #1

✓ answered by dwblas

You want to do this the other way around, i.e, look up the items in the dictionary. I would also suggest that you strip() the keys before using them to get rid of any spaces.
Expand|Select|Wrap|Line Numbers
  1. reader = open("D:\\temp\\table2.csv",'r')
  2. for line in reader:
  3.     line = line.strip()
  4.     Tmp2Arr = line.split(',')
  5.     If Tmp2Arr[0].strip() in TmpDict:
  6.         print TmpDict[Tmp2Arr[0]] 
You can also use the intersection of 2 sets if you only want to identify the elements that are the same.
This works but slow... and very memory intensive (my table1 is 500MB and table2 is extremely large)
If a dictionary does not work for whatever reason, the next step up is an SQL database. SQLite comes with Python, so post back if you want some help in that area.
Expand|Select|Wrap|Line Numbers
  1. reader = open("D:\\temp\\table1.csv",'r')
  2. for line in reader:
  3.     line = line.strip()
  4.     TmpArr=line.split(',')
  5.     ##
  6.     ## indentation error here in the code as posted (should be indented)
  7.     TmpDict[TmpArr[0]]=TmpArr[1]+str(',')+TmpArr[5]+str(',')+TmpArr[6]
  8. ##
  9. ## also a list is slightly faster
  10.     TmpDict[TmpArr[0]]=[TmpArr[1], TmpArr[5], TmpArr[6]]
  11. ## then you can just join them
  12. ## an example
  13. test_list = ["one", "two", "three"]
  14. print ",".join(test_list) 

reader.close()

3 13875
dwblas
626 Expert 512MB
You want to do this the other way around, i.e, look up the items in the dictionary. I would also suggest that you strip() the keys before using them to get rid of any spaces.
Expand|Select|Wrap|Line Numbers
  1. reader = open("D:\\temp\\table2.csv",'r')
  2. for line in reader:
  3.     line = line.strip()
  4.     Tmp2Arr = line.split(',')
  5.     If Tmp2Arr[0].strip() in TmpDict:
  6.         print TmpDict[Tmp2Arr[0]] 
You can also use the intersection of 2 sets if you only want to identify the elements that are the same.
This works but slow... and very memory intensive (my table1 is 500MB and table2 is extremely large)
If a dictionary does not work for whatever reason, the next step up is an SQL database. SQLite comes with Python, so post back if you want some help in that area.
Expand|Select|Wrap|Line Numbers
  1. reader = open("D:\\temp\\table1.csv",'r')
  2. for line in reader:
  3.     line = line.strip()
  4.     TmpArr=line.split(',')
  5.     ##
  6.     ## indentation error here in the code as posted (should be indented)
  7.     TmpDict[TmpArr[0]]=TmpArr[1]+str(',')+TmpArr[5]+str(',')+TmpArr[6]
  8. ##
  9. ## also a list is slightly faster
  10.     TmpDict[TmpArr[0]]=[TmpArr[1], TmpArr[5], TmpArr[6]]
  11. ## then you can just join them
  12. ## an example
  13. test_list = ["one", "two", "three"]
  14. print ",".join(test_list) 

reader.close()
Aug 3 '10 #2
erbrose
58
wow, thanks dwblas!
That dramatically increased the search speed! Still alot to learn with python..
Very much appreciated!
Eric
Aug 3 '10 #3
MMcCarthy
14,534 Expert Mod 8TB
If your query has been answered please mark as Best Answer the post you feel gave you the solution.

Mary
Aug 4 '10 #4

Sign in to post your reply or Sign up for a free account.

Similar topics

5
by: Daniel Pryde | last post by:
Hi everyone. I was wondering if anyone might be able to help me out here. I'm currently looking to find the quickest way to find a best fit match in a large array. My problem is that I have an...
18
by: Dan Cernat | last post by:
Hi there, A few threads I had a little chat about default values. I am starting this thread because I want to hear more opinions about the default values of function parameters. Some say they...
3
by: BPDudeMan | last post by:
Hi There, I've got one table that is constantly being added to (every few seconds). I've got a bunch of users that need to report from this table. What's the best way to setup the tables? ...
0
by: fariba123 | last post by:
there are two arrays: $players = array( "23" => "Michael Jordan", "32" => "Michael Johnson" ); $current_player = array( "player_id" => "23", "age" => "22"
0
by: chaitanya02 | last post by:
Hi All, Could we write a stored Procedure in asp connected to an Access db. I have four table sin Ms Access sql1-account id, account name sql2-account id, email id sql3- account id, contract...
2
by: RiverstoneJebin | last post by:
Hi, I have 2 tables with same fields like event, venue, date and cost. Table 1 has 5 records and Table 2 has 100 records. How can i match a record from Table 1 with Table 2 to get the exact match...
3
by: mkjame | last post by:
Hi everyone! I am actually coding a php script and I want to return all the values that doesn't have a relation between two tables. Basically I have two tables, one is called users, the other...
11
by: ankitmathur | last post by:
Hi, I'm trying to overcome a situation whereby I have to search through 4-5 columns and produce the results with an order by according to the values matched in these columns. Example: My...
3
rsrinivasan
by: rsrinivasan | last post by:
Hi all, I am using SQLServer2000. I have two same tables. I want a query to check whether these two tables have same data. How can i check it? Thanks,
1
by: mfaisalwarraich | last post by:
Hello, I am trying to match the values of a combo box which are typed inside a combo box with the comb box items, which are bind to a datasource. i can see the items in the combo box and if i get...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.