473,414 Members | 1,606 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,414 software developers and data experts.

append docname & linenumber to dic-value

1
Hey guys,
I am comparing two documents - if a word is in both documents, it gets added as a new key to a dictionary.
As the dictionary value I would like to store the documents name and the line# the word was found on.
Here is what I have so far with comments:

Expand|Select|Wrap|Line Numbers
  1. dic = {}
  2. def matchtermer():
  3.     f3 = open('korpus/avis.txt')
  4.     f4 = open("ordliste_output_kort.txt")
  5.     text3 = f3.read()
  6.     text4 = f4.read()
  7.     ordliste2 = text3.split()
  8.     ordliste3 = text4.split()
  9.     wordlist2 = []
  10.  
  11.     for word1 in ordliste2: #this part removes end characters that aren't part of the word and makes all lowercase
  12.         # last character of each word
  13.         lastchar = word1[-1:]
  14.         # use a list of punctuation marks
  15.         if lastchar in [",", ".", "!", "?", ";"]:
  16.             word2 = word1.rstrip(lastchar)
  17.         else:
  18.             word2 = word1
  19.         # build a wordList of lower case modified words
  20.         wordlist2.append(word2.lower())
  21.  
  22.     for word in wordlist2: # and finally this compares the two documents
  23.         if word in ordliste3:
  24.             if word not in dic.keys():
  25.                 dic[word]=[]  #if word not in dic, create it
  26.             #dic[word].append(docname, linenumber) - this is what I want to do - obviously this does not work
  27.     return dic
Dec 17 '07 #1
1 1405
bvdet
2,851 Expert Mod 2GB
I think this will do it:
Expand|Select|Wrap|Line Numbers
  1. import string, re
  2.  
  3. def wordList(words):
  4.     patt = re.compile(r'\d+')
  5.     # eliminate words with digits, strip punctuation and whitespace, lowercase
  6.     word_list = [word.strip().strip(string.punctuation).lower() for word \
  7.                  in words.split() if not patt.search(word)]
  8.     # elinimate blank words
  9.     return [word for word in word_list if word != '']
  10.  
  11. def matchtermer(fn1, fn2):
  12.     dd = {}
  13.     # file to compare against
  14.     f1 = open(fn1).read()
  15.     # file to compare
  16.     f2 = open(fn2).readlines()
  17.     word_list = wordList(f1)
  18.     for i, line in enumerate(f2):
  19.         for word in line.split():
  20.             word = word.strip().strip(string.punctuation).lower()
  21.             if word in word_list:
  22.                 dd.setdefault(word, []).append((fn2, i+1))
  23.     return dd
Usage:
Expand|Select|Wrap|Line Numbers
  1. wordDict = matchtermer('words1.txt', 'words2.txt')
Dec 17 '07 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

14
by: tertius | last post by:
Is there a better way to append certain chars in a string with a backslash that the example below? chr = "#$%^&_{}" # special chars to look out for str = "123 45^ & 00 0_" # string to...
19
by: John Keeling | last post by:
Dear all, I tried the test program below. My interest is to examine timing differences between insert vs. append & reverse for a list. My results on my XP Python 2.3.4 are as follows:...
1
by: Ahmed B. Zayan | last post by:
We just installed SP3 and the cursor behaviors changed, does anyone know anything about that? I call this stored procedure from DTS: DECLARE Queue_cursor SCROLL CURSOR FOR SELECT...
2
by: Paul Wagstaff | last post by:
Hi there I have 2 tables: tblAccuracy & tblClearance Users add new records to tblAccuracy using frmRegister. Under specific conditions I need to append the current record from frmRegister into...
1
by: Eric M L | last post by:
I am wondering if I am alone with this problem. Using VS 2005, I must validate an XML file via a Schema and it works well. When I get the schema exception and check the LineNumber and...
2
by: jonfroehlich | last post by:
According to the MSDN documentation within the XmlTextReader class for ..NET 2.0, the recommended practice to create XmlReader instances is using the XmlReaderSettings class and the...
6
by: Zach | last post by:
I am looking for a program which can automatically convert K&R C code to ANSI C code. Zach
2
by: =?Utf-8?B?SmFzb24=?= | last post by:
Is there any way to programmatically update (add custom words) to the custom.dic file (office's custom dictionary file) using .NET (vb or c#)? Any com interface? I tried looking in the...
3
by: Sheikko | last post by:
Sincerly is a little bit complicated to explain to you what I have in my mind, but I will try: Above all the problem is the type of data that I want to passe between these two applications. The...
0
by: Hags007 | last post by:
I have a XML file I am working with. This file has been created by hand and I now need to develop a PHP script that will create it in the same format. Here is what I have thus far: $query =...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.