Hey all,
Sorry the subject should have said..
"best way to match values in TWO tables"
I have two tables that I need to match based off an Unique ID in both tables. Im running this process using hadoop streaming with python, so the actual code is a bit different (ie using csv files to debug locally). I've tried a couple different methods and both are not quite fast enough... ha!
First was like this, using the two tables as lists and comparing the ID's -
reader = open("D:\\temp\\table1.csv",'r')
-
for line in reader:
-
line = line.strip()
-
TmpArr.append( line.split(',') )
-
reader.close()
-
-
reader = open("D:\\temp\\table2.csv",'r')
-
for line in reader:
-
line = line.strip()
-
Tmp2Arr = line.split(',')
-
For line2 in TmpArr:
-
If Tmp2Arr[0] == TmpArr[i][0]:
-
Do some stuff...
-
This works but slow... and very memory intensive (my table1 is 500MB and table2 is extremely large)
I also tried using a list and a dictionary -
reader = open("D:\\temp\\table1.csv",'r')
-
for line in reader:
-
line = line.strip()
-
TmpArr=line.split(',')
-
TmpDict[TmpArr[0]]=TmpArr[1]+str(',')+TmpArr[5]+str(',')+TmpArr[6]
-
reader.close()
-
-
reader = open("D:\\temp\\table2.csv",'r')
-
for line in reader:
-
line = line.strip()
-
Tmp2Arr = line.split(',')
-
for k,v in TmmDict.iteritems():
-
If Tmp2Arr[0] == k:
-
Do some stuff...
-
This is better, with a loop through table1 taking about 4 seconds...
This is still too long..
Any thoughts on how to best match my two ID's? I haven't tried using two dictionaries? would that improve performance?
Thanks ahead of time!
You want to do this the other way around, i.e, look up the items in the dictionary. I would also suggest that you strip() the keys before using them to get rid of any spaces. - reader = open("D:\\temp\\table2.csv",'r')
-
for line in reader:
-
line = line.strip()
-
Tmp2Arr = line.split(',')
-
If Tmp2Arr[0].strip() in TmpDict:
-
print TmpDict[Tmp2Arr[0]]
You can also use the intersection of 2 sets if you only want to identify the elements that are the same.
This works but slow... and very memory intensive (my table1 is 500MB and table2 is extremely large)
If a dictionary does not work for whatever reason, the next step up is an SQL database. SQLite comes with Python, so post back if you want some help in that area. - reader = open("D:\\temp\\table1.csv",'r')
-
for line in reader:
-
line = line.strip()
-
TmpArr=line.split(',')
-
##
-
## indentation error here in the code as posted (should be indented)
-
TmpDict[TmpArr[0]]=TmpArr[1]+str(',')+TmpArr[5]+str(',')+TmpArr[6]
-
##
-
## also a list is slightly faster
-
TmpDict[TmpArr[0]]=[TmpArr[1], TmpArr[5], TmpArr[6]]
-
## then you can just join them
-
## an example
-
test_list = ["one", "two", "three"]
-
print ",".join(test_list)
reader.close()
3 13875
You want to do this the other way around, i.e, look up the items in the dictionary. I would also suggest that you strip() the keys before using them to get rid of any spaces. - reader = open("D:\\temp\\table2.csv",'r')
-
for line in reader:
-
line = line.strip()
-
Tmp2Arr = line.split(',')
-
If Tmp2Arr[0].strip() in TmpDict:
-
print TmpDict[Tmp2Arr[0]]
You can also use the intersection of 2 sets if you only want to identify the elements that are the same.
This works but slow... and very memory intensive (my table1 is 500MB and table2 is extremely large)
If a dictionary does not work for whatever reason, the next step up is an SQL database. SQLite comes with Python, so post back if you want some help in that area. - reader = open("D:\\temp\\table1.csv",'r')
-
for line in reader:
-
line = line.strip()
-
TmpArr=line.split(',')
-
##
-
## indentation error here in the code as posted (should be indented)
-
TmpDict[TmpArr[0]]=TmpArr[1]+str(',')+TmpArr[5]+str(',')+TmpArr[6]
-
##
-
## also a list is slightly faster
-
TmpDict[TmpArr[0]]=[TmpArr[1], TmpArr[5], TmpArr[6]]
-
## then you can just join them
-
## an example
-
test_list = ["one", "two", "three"]
-
print ",".join(test_list)
reader.close()
wow, thanks dwblas!
That dramatically increased the search speed! Still alot to learn with python..
Very much appreciated!
Eric
If your query has been answered please mark as Best Answer the post you feel gave you the solution.
Mary
Sign in to post your reply or Sign up for a free account.
Similar topics
by: Daniel Pryde |
last post by:
Hi everyone.
I was wondering if anyone might be able to help me out here. I'm currently
looking to find the quickest way to find a best fit match in a large array.
My problem is that I have an...
|
by: Dan Cernat |
last post by:
Hi there,
A few threads I had a little chat about default values. I am starting this
thread because I want to hear more opinions about the default values of
function parameters. Some say they...
|
by: BPDudeMan |
last post by:
Hi There,
I've got one table that is constantly being added to (every few
seconds). I've got a bunch of users that need to report from this
table. What's the best way to setup the tables? ...
|
by: fariba123 |
last post by:
there are two arrays:
$players = array(
"23" => "Michael Jordan",
"32" => "Michael Johnson"
);
$current_player = array(
"player_id" => "23",
"age" => "22"
|
by: chaitanya02 |
last post by:
Hi All,
Could we write a stored Procedure in asp connected to an Access db.
I have four table sin Ms Access
sql1-account id, account name
sql2-account id, email id
sql3- account id, contract...
|
by: RiverstoneJebin |
last post by:
Hi,
I have 2 tables with same fields like event, venue, date and cost. Table 1 has 5 records and Table 2 has 100 records. How can i match a record from Table 1 with Table 2 to get the exact match...
|
by: mkjame |
last post by:
Hi everyone!
I am actually coding a php script and I want to return all the values that doesn't have a relation between two tables.
Basically I have two tables, one is called users, the other...
|
by: ankitmathur |
last post by:
Hi,
I'm trying to overcome a situation whereby I have to search through 4-5 columns and produce the results with an order by according to the values matched in these columns.
Example:
My...
|
by: rsrinivasan |
last post by:
Hi all,
I am using SQLServer2000. I have two same tables. I want a query to check whether these two tables have same data. How can i check it?
Thanks,
|
by: mfaisalwarraich |
last post by:
Hello,
I am trying to match the values of a combo box which are typed inside a combo box with the comb box items, which are bind to a datasource. i can see the items in the combo box and if i get...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
| |