471,310 Members | 1,234 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,310 software developers and data experts.

append docname & linenumber to dic-value

1
Hey guys,
I am comparing two documents - if a word is in both documents, it gets added as a new key to a dictionary.
As the dictionary value I would like to store the documents name and the line# the word was found on.
Here is what I have so far with comments:

Expand|Select|Wrap|Line Numbers
  1. dic = {}
  2. def matchtermer():
  3.     f3 = open('korpus/avis.txt')
  4.     f4 = open("ordliste_output_kort.txt")
  5.     text3 = f3.read()
  6.     text4 = f4.read()
  7.     ordliste2 = text3.split()
  8.     ordliste3 = text4.split()
  9.     wordlist2 = []
  10.  
  11.     for word1 in ordliste2: #this part removes end characters that aren't part of the word and makes all lowercase
  12.         # last character of each word
  13.         lastchar = word1[-1:]
  14.         # use a list of punctuation marks
  15.         if lastchar in [",", ".", "!", "?", ";"]:
  16.             word2 = word1.rstrip(lastchar)
  17.         else:
  18.             word2 = word1
  19.         # build a wordList of lower case modified words
  20.         wordlist2.append(word2.lower())
  21.  
  22.     for word in wordlist2: # and finally this compares the two documents
  23.         if word in ordliste3:
  24.             if word not in dic.keys():
  25.                 dic[word]=[]  #if word not in dic, create it
  26.             #dic[word].append(docname, linenumber) - this is what I want to do - obviously this does not work
  27.     return dic
Dec 17 '07 #1
1 1331
bvdet
2,851 Expert Mod 2GB
I think this will do it:
Expand|Select|Wrap|Line Numbers
  1. import string, re
  2.  
  3. def wordList(words):
  4.     patt = re.compile(r'\d+')
  5.     # eliminate words with digits, strip punctuation and whitespace, lowercase
  6.     word_list = [word.strip().strip(string.punctuation).lower() for word \
  7.                  in words.split() if not patt.search(word)]
  8.     # elinimate blank words
  9.     return [word for word in word_list if word != '']
  10.  
  11. def matchtermer(fn1, fn2):
  12.     dd = {}
  13.     # file to compare against
  14.     f1 = open(fn1).read()
  15.     # file to compare
  16.     f2 = open(fn2).readlines()
  17.     word_list = wordList(f1)
  18.     for i, line in enumerate(f2):
  19.         for word in line.split():
  20.             word = word.strip().strip(string.punctuation).lower()
  21.             if word in word_list:
  22.                 dd.setdefault(word, []).append((fn2, i+1))
  23.     return dd
Usage:
Expand|Select|Wrap|Line Numbers
  1. wordDict = matchtermer('words1.txt', 'words2.txt')
Dec 17 '07 #2

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

14 posts views Thread by tertius | last post: by
19 posts views Thread by John Keeling | last post: by
1 post views Thread by Ahmed B. Zayan | last post: by
2 posts views Thread by Paul Wagstaff | last post: by
2 posts views Thread by jonfroehlich | last post: by
6 posts views Thread by Zach | last post: by
2 posts views Thread by =?Utf-8?B?SmFzb24=?= | last post: by
3 posts views Thread by Sheikko | last post: by
reply views Thread by rosydwin | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.