By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,962 Members | 1,997 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,962 IT Pros & Developers. It's quick & easy.

How to remove the duplicate lines retaining first occurences

P: 8
Let's say a input text file "input_msg.txt" file ( file size is 70,000 kb ) contains following records..

Jan 1 02:32:40 hello welcome to python world
Jan 1 02:32:40 hello welcome to python world
Mar 31 23:31:55 learn python
Mar 31 23:31:55 learn python be smart
Mar 31 23:31:56 python is good scripting language
Jan 1 00:00:01 hello welcome to python world
Jan 1 00:00:02 hello welcome to python world
Mar 31 23:31:55 learn python
Mar 31 23:31:56 python is good scripting language

The expected output file ( Let's say outputfile.txt ) should contain below records...

Jan 1 02:32:40 hello welcome to python world
Jan 1 02:32:40 hello welcome to python world
Mar 31 23:31:55 learn python
Mar 31 23:31:55 learn python be smart
Mar 31 23:31:56 python is good scripting language
Jan 1 00:00:01 hello welcome to python world
Jan 1 00:00:02 hello welcome to python world

Note: I need all the records (including duplicate) which are starting with "Jan 1" and also I don't need Duplicate records not starting with "Jan 1"

I have tried the following program where all the duplicate records are getting deleted.
Expand|Select|Wrap|Line Numbers
  1. def remove_Duplicate_Lines(inputfile, outputfile):  
  2.    with open(inputfile) as fin, open(outputfile, 'w') as out:
  3.       lines = (line.rstrip() for line in fin)
  4.       unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )
  5.       out.writelines("\n".join(unique_lines.iterkeys()))
  6.  return 0
Oputput of my program are below:

Jan 1 02:32:40 hello welcome to python world
Mar 31 23:31:55 learn python
Mar 31 23:31:55 learn python be smart
Mar 31 23:31:56 python is good scripting language
Jan 1 00:00:01 hello welcome to python world

Your help would be appreciated!!!
Jul 15 '15 #1

✓ answered by bvdet

You are iterating over the lines in the file twice. Try eliminating one of them. It is possible OrderedDict may be slower than a for loop. I don't know one way or the other. You can use module timeit to check different methods.

Share this Question
Share on Google+
7 Replies


bvdet
Expert Mod 2.5K+
P: 2,851
Use a for loop and conditionally append to a list.
Expand|Select|Wrap|Line Numbers
  1. data = """Jan 1 02:32:40 hello welcome to python world
  2. Jan 1 02:32:40 hello welcome to python world
  3. Mar 31 23:31:55 learn python
  4. Mar 31 23:31:55 learn python be smart
  5. Mar 31 23:31:56 python is good scripting language
  6. Jan 1 00:00:01 hello welcome to python world
  7. Jan 1 00:00:02 hello welcome to python world
  8. Mar 31 23:31:55 learn python
  9. Mar 31 23:31:56 python is good scripting language"""
  10.  
  11. output = []
  12.  
  13. for line in data.split("\n"):
  14.     if line.startswith("Jan 1"):
  15.         output.append(line)
  16.     elif line not in output:
  17.         output.append(line)
  18.  
  19. print "\n".join(output)
The output:
Expand|Select|Wrap|Line Numbers
  1. >>> Jan 1 02:32:40 hello welcome to python world
  2. Jan 1 02:32:40 hello welcome to python world
  3. Mar 31 23:31:55 learn python
  4. Mar 31 23:31:55 learn python be smart
  5. Mar 31 23:31:56 python is good scripting language
  6. Jan 1 00:00:01 hello welcome to python world
  7. Jan 1 00:00:02 hello welcome to python world
  8. >>> 
Jul 15 '15 #2

P: 8
@bvdet: Thanks you very much!! I have already been tried this solution but here the problem is if the input message file size is more then it takes more time....

Below is the program which I have tried:
Expand|Select|Wrap|Line Numbers
  1. inputFile = open("in.txt", "r")
  2. log = []
  3. for line in inputFile:
  4.     if line in log and line[0:5] != "Jan 1":
  5.         pass
  6.     else:
  7.         log.append(line)
  8. inputFile.close()
  9. outFile = open("out.txt", "w")
  10. for item in log:
  11.     outFile.write(item)
  12. outFile.close()
  13.  
Note: I have tried with input file size as ~70000 kb and it takes ~9 minutes to complete the execution.

Pls let me know if we can do it some elegant way.....
Jul 15 '15 #3

bvdet
Expert Mod 2.5K+
P: 2,851
Try writing to the file one time.
Expand|Select|Wrap|Line Numbers
  1. outFile.write("\n".join(log))
Jul 15 '15 #4

P: 8
@bvet: You mean something like below:
Expand|Select|Wrap|Line Numbers
  1. inputFile = open("in.txt", "r")
  2. outFile = open("out.txt", "w")
  3. log = []
  4. for line in inputFile:
  5.    if line in log and line[0:5] != "Jan 1":
  6.       pass
  7.    else:
  8.       log.append(line)
  9.    outFile.write("\n".join(log))
  10. inputFile.close()
  11. outFile.close()
  12.  
Pls correct me if i am wrong.
Jul 16 '15 #5

bvdet
Expert Mod 2.5K+
P: 2,851
No, write to the file outside of the for loop:
Expand|Select|Wrap|Line Numbers
  1. outFile.write("\n".join(log))
  2. inputFile.close()
  3. outFile.close()
Jul 17 '15 #6

P: 8
@bvdet: Thank you for your help!!! This is working but still there is performance issue whenever input file size is more....

Could you please take a look into below code...
Expand|Select|Wrap|Line Numbers
  1. def remove_Duplicate_Lines(inputfile, outputfile):
  2.    with open(inputfile) as fin, open(outputfile, 'w') as out:
  3.       lines = (line.rstrip() for line in fin)
  4.       unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )
  5.       out.writelines("\n".join(unique_lines.iterkeys()))
  6.    return 0
  7.  
Jul 18 '15 #7

bvdet
Expert Mod 2.5K+
P: 2,851
You are iterating over the lines in the file twice. Try eliminating one of them. It is possible OrderedDict may be slower than a for loop. I don't know one way or the other. You can use module timeit to check different methods.
Jul 18 '15 #8

Post your reply

Sign in to post your reply or Sign up for a free account.