How to remove the duplicate lines retaining first occurences

8 New Member

Let's say a input text file "input_msg.txt" file ( file size is 70,000 kb ) contains following records..

Jan 1 02:32:40 hello welcome to python world
Jan 1 02:32:40 hello welcome to python world
Mar 31 23:31:55 learn python
Mar 31 23:31:55 learn python be smart
Mar 31 23:31:56 python is good scripting language
Jan 1 00:00:01 hello welcome to python world
Jan 1 00:00:02 hello welcome to python world
Mar 31 23:31:55 learn python
Mar 31 23:31:56 python is good scripting language

The expected output file ( Let's say outputfile.txt ) should contain below records...

Jan 1 02:32:40 hello welcome to python world
Jan 1 02:32:40 hello welcome to python world
Mar 31 23:31:55 learn python
Mar 31 23:31:55 learn python be smart
Mar 31 23:31:56 python is good scripting language
Jan 1 00:00:01 hello welcome to python world
Jan 1 00:00:02 hello welcome to python world

Note: I need all the records (including duplicate) which are starting with "Jan 1" and also I don't need Duplicate records not starting with "Jan 1"

I have tried the following program where all the duplicate records are getting deleted.

Expand|Select|Wrap|Line Numbers

 
def remove_Duplicate_Lines(inputfile, outputfile):  

   with open(inputfile) as fin, open(outputfile, 'w') as out:

      lines = (line.rstrip() for line in fin)

      unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )

      out.writelines("\n".join(unique_lines.iterkeys()))

 return 0

Oputput of my program are below:

Jan 1 02:32:40 hello welcome to python world
Mar 31 23:31:55 learn python
Mar 31 23:31:55 learn python be smart
Mar 31 23:31:56 python is good scripting language
Jan 1 00:00:01 hello welcome to python world

Your help would be appreciated!!!

Jul 15 '15 #1

Subscribe Reply

2173

bvdet

2,851

Recognized Expert Moderator Specialist

Use a for loop and conditionally append to a list.

Expand|Select|Wrap|Line Numbers

 data = """Jan 1 02:32:40 hello welcome to python world

Jan 1 02:32:40 hello welcome to python world

Mar 31 23:31:55 learn python

Mar 31 23:31:55 learn python be smart

Mar 31 23:31:56 python is good scripting language

Jan 1 00:00:01 hello welcome to python world

Jan 1 00:00:02 hello welcome to python world

Mar 31 23:31:55 learn python

Mar 31 23:31:56 python is good scripting language"""
 
output = []
 
for line in data.split("\n"):

    if line.startswith("Jan 1"):

        output.append(line)

    elif line not in output:

        output.append(line)
 
print "\n".join(output)

The output:

Expand|Select|Wrap|Line Numbers

 >>> Jan 1 02:32:40 hello welcome to python world

Jan 1 02:32:40 hello welcome to python world

Mar 31 23:31:55 learn python

Mar 31 23:31:55 learn python be smart

Mar 31 23:31:56 python is good scripting language

Jan 1 00:00:01 hello welcome to python world

Jan 1 00:00:02 hello welcome to python world

>>>

Jul 15 '15 #2

helloR

New Member

@bvdet: Thanks you very much!! I have already been tried this solution but here the problem is if the input message file size is more then it takes more time....

Below is the program which I have tried:

Expand|Select|Wrap|Line Numbers

 
inputFile = open("in.txt", "r")

log = []

for line in inputFile:

    if line in log and line[0:5] != "Jan 1":

        pass

    else:

        log.append(line)

inputFile.close()

outFile = open("out.txt", "w")

for item in log:

    outFile.write(item)

outFile.close()

Note: I have tried with input file size as ~70000 kb and it takes ~9 minutes to complete the execution.

Pls let me know if we can do it some elegant way.....

Jul 15 '15 #3

bvdet

2,851

Recognized Expert Moderator Specialist

Try writing to the file one time.

Expand|Select|Wrap|Line Numbers

outFile.write("\n".join(log))

Jul 15 '15 #4

helloR

New Member

@bvet: You mean something like below:

Expand|Select|Wrap|Line Numbers

 
inputFile = open("in.txt", "r")

outFile = open("out.txt", "w")

log = []

for line in inputFile:

   if line in log and line[0:5] != "Jan 1":

      pass

   else:

      log.append(line)

   outFile.write("\n".join(log))

inputFile.close()

outFile.close()

Pls correct me if i am wrong.

Jul 16 '15 #5

bvdet

2,851

Recognized Expert Moderator Specialist

No, write to the file outside of the for loop:

Expand|Select|Wrap|Line Numbers

 outFile.write("\n".join(log))

inputFile.close()

outFile.close()

Jul 17 '15 #6

helloR

New Member

@bvdet: Thank you for your help!!! This is working but still there is performance issue whenever input file size is more....

Could you please take a look into below code...

Expand|Select|Wrap|Line Numbers

 
def remove_Duplicate_Lines(inputfile, outputfile):

   with open(inputfile) as fin, open(outputfile, 'w') as out:

      lines = (line.rstrip() for line in fin)

      unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )

      out.writelines("\n".join(unique_lines.iterkeys()))

   return 0

Jul 18 '15 #7

bvdet

2,851

Recognized Expert Moderator Specialist

You are iterating over the lines in the file twice. Try eliminating one of them. It is possible OrderedDict may be slower than a for loop. I don't know one way or the other. You can use module timeit to check different methods.

Jul 18 '15 #8

Similar topics

23898

How to remove duplicate words from text

by: Voetleuce en fênsievry | last post by:

Hello everyone. I'm not a JavaScript author myself, but I'm looking for a method to remove duplicate words from a piece of text. This text would presumably be pasted into a text box. I have,...

Javascript

2117

Can any body tell me how to find duplicate lines in C

by: RSBakshi | last post by:

Can any body tell me how to find duplicate lines in C i have tried to find using Binary tree and Text files but not suceeded .. It works for Word but not for lines please help me you can...

C / C++

5222

how to remove duplicate entries in a vector of Pair?

by: Allerdyce.John | last post by:

I have a vector of Pair of int: typedef pair<int, int> MyPair; typedef vector <MyPair > MyVector I would like to remove entries if their first are equal, or if their value is swap ( first of...

C / C++

2590

Remove duplicate nodes in XML file

by: dazzle | last post by:

I have an XML file and I would like to remove duplicate nodes within it but I can't get my head round on how to do this. Example XML file: <root> <plugin> <title>A9</title> <url>some...

ASP.NET

1660

To remove some lines from a file

by: umut.tabak | last post by:

Dear all, I am quite new to python. I would like to remove two lines from a file. The file starts with some comments, the comment sign is $. The file structure is $ $ $ $

Python

16582

Remove duplicate object values in array

by: JTreefrog | last post by:

Hello - I've read a ton of stuff about deleting duplicate values in an array. They are all very useful - they just haven't addressed an array of objects. Here's my array: var sDat = ; The...

Javascript

1819

Crystal Reports Duplicate Lines and Resourcea

by: JRWarring | last post by:

I have a VB application with about 75 installations that uses the Crystal Control (Version 7.0) to print Crystal RPTs. On about 7 of these installations the clients are getting duplicate lines...

Visual Basic 4 / 5 / 6

1592

To remove duplicate lines from the output

by: Hariny | last post by:

Hi, Here is my query: select a.out_tar||'~'|| a.out_csi_id||'~'||a.out_cust_name||'~'||a.out_id||'~'||b.out_id||'~'||a.out_category||'~'|| b.out_category||'~'|| to_char(a.out_start_date, 'fm...

Oracle Database

3461

how to remove duplicate records.

by: tosachinji | last post by:

Hi I am new to xslt. Could you please tell me, how can we remove duplicate records from a xml file. Here is the xml file: <Row> <Cell><Data>Active</Data></Cell> <Cell><Data>D</Data></Cell>...

XML

7132

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

7178

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7223

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7390

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5475

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

4602

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3103

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

3094

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

665

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP