473,237 Members | 1,260 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,237 software developers and data experts.

append docname & linenumber to dic-value

Hey guys,
I am comparing two documents - if a word is in both documents, it gets added as a new key to a dictionary.
As the dictionary value I would like to store the documents name and the line# the word was found on.
Here is what I have so far with comments:

Expand|Select|Wrap|Line Numbers
  1. dic = {}
  2. def matchtermer():
  3.     f3 = open('korpus/avis.txt')
  4.     f4 = open("ordliste_output_kort.txt")
  5.     text3 = f3.read()
  6.     text4 = f4.read()
  7.     ordliste2 = text3.split()
  8.     ordliste3 = text4.split()
  9.     wordlist2 = []
  11.     for word1 in ordliste2: #this part removes end characters that aren't part of the word and makes all lowercase
  12.         # last character of each word
  13.         lastchar = word1[-1:]
  14.         # use a list of punctuation marks
  15.         if lastchar in [",", ".", "!", "?", ";"]:
  16.             word2 = word1.rstrip(lastchar)
  17.         else:
  18.             word2 = word1
  19.         # build a wordList of lower case modified words
  20.         wordlist2.append(word2.lower())
  22.     for word in wordlist2: # and finally this compares the two documents
  23.         if word in ordliste3:
  24.             if word not in dic.keys():
  25.                 dic[word]=[]  #if word not in dic, create it
  26.             #dic[word].append(docname, linenumber) - this is what I want to do - obviously this does not work
  27.     return dic
Dec 17 '07 #1
1 1401
2,851 Expert Mod 2GB
I think this will do it:
Expand|Select|Wrap|Line Numbers
  1. import string, re
  3. def wordList(words):
  4.     patt = re.compile(r'\d+')
  5.     # eliminate words with digits, strip punctuation and whitespace, lowercase
  6.     word_list = [word.strip().strip(string.punctuation).lower() for word \
  7.                  in words.split() if not patt.search(word)]
  8.     # elinimate blank words
  9.     return [word for word in word_list if word != '']
  11. def matchtermer(fn1, fn2):
  12.     dd = {}
  13.     # file to compare against
  14.     f1 = open(fn1).read()
  15.     # file to compare
  16.     f2 = open(fn2).readlines()
  17.     word_list = wordList(f1)
  18.     for i, line in enumerate(f2):
  19.         for word in line.split():
  20.             word = word.strip().strip(string.punctuation).lower()
  21.             if word in word_list:
  22.                 dd.setdefault(word, []).append((fn2, i+1))
  23.     return dd
Expand|Select|Wrap|Line Numbers
  1. wordDict = matchtermer('words1.txt', 'words2.txt')
Dec 17 '07 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

by: tertius | last post by:
Is there a better way to append certain chars in a string with a backslash that the example below? chr = "#$%^&_{}" # special chars to look out for str = "123 45^ & 00 0_" # string to...
by: John Keeling | last post by:
Dear all, I tried the test program below. My interest is to examine timing differences between insert vs. append & reverse for a list. My results on my XP Python 2.3.4 are as follows:...
by: Ahmed B. Zayan | last post by:
We just installed SP3 and the cursor behaviors changed, does anyone know anything about that? I call this stored procedure from DTS: DECLARE Queue_cursor SCROLL CURSOR FOR SELECT...
by: Paul Wagstaff | last post by:
Hi there I have 2 tables: tblAccuracy & tblClearance Users add new records to tblAccuracy using frmRegister. Under specific conditions I need to append the current record from frmRegister into...
by: Eric M L | last post by:
I am wondering if I am alone with this problem. Using VS 2005, I must validate an XML file via a Schema and it works well. When I get the schema exception and check the LineNumber and...
by: jonfroehlich | last post by:
According to the MSDN documentation within the XmlTextReader class for ..NET 2.0, the recommended practice to create XmlReader instances is using the XmlReaderSettings class and the...
by: Zach | last post by:
I am looking for a program which can automatically convert K&R C code to ANSI C code. Zach
by: =?Utf-8?B?SmFzb24=?= | last post by:
Is there any way to programmatically update (add custom words) to the custom.dic file (office's custom dictionary file) using .NET (vb or c#)? Any com interface? I tried looking in the...
by: Sheikko | last post by:
Sincerly is a little bit complicated to explain to you what I have in my mind, but I will try: Above all the problem is the type of data that I want to passe between these two applications. The...
by: Hags007 | last post by:
I have a XML file I am working with. This file has been created by hand and I now need to develop a PHP script that will create it in the same format. Here is what I have thus far: $query =...
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 3 Jan 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). For other local times, please check World Time Buddy In...
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.