By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
434,651 Members | 1,908 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 434,651 IT Pros & Developers. It's quick & easy.

Faster search algorithm for lists

P: 2
I have a file containing 10,000 words which I import as strings and store in a list. Now, I have another list of strings and have to delete the common words in both lists. When I try to run the program, it takes a lot of time. Please let me know if there's any other way to do it.

For now, I am using:

For item in list_1:
If item in list_2:
{Do the deletion}
Jun 15 '18 #1
Share this Question
Share on Google+
4 Replies


husoski
P: 4
It's much faster to search a set than a list. Since that list doesn't seem to change during the program run, simply use
Expand|Select|Wrap|Line Numbers
  1.     common_words = set(list_2) # after building list_2 from the file
  2.     ...
  3.     for word in list_1:
  4.          if word in common_words:
  5.              ...do the deletion
  6.  
If you are just making a reduced list, then you can do that without any explicit looping:

Expand|Select|Wrap|Line Numbers
  1. uncommon = [w for w in list_2 if w not in common_words]
  2.  
That's a "list comprehension". It builds a list of every element from the list_2 sequence that is not an element of the common_words set.
Jun 15 '18 #2

P: 2
Thanks a lot. I'll definitely try it out.
Jun 18 '18 #3

P: 2
I would use this big list to build sub_lists , one sub_list for each letter , thus , reducing the search time by almost 26 times (in average) , I don't know if python's built-in search functions do that or just loop through every element.
Jun 21 '18 #4

husoski
P: 4
Searching a set is much faster than searching lists--even with the sublists you describe. It's a hash table search, and even if the key-hashing algorithm is a horrible match for the actual key distribution (nearly all "words" happen to hash to the same number, modulo the table size) then that worst-case performance is a sequential list search. Heads anytime you win; tails 30 times in a row you lose a little.
Jul 1 '18 #5

Post your reply

Sign in to post your reply or Sign up for a free account.