Theoretical Problem

Materialised

I have been given the task, of developing a program to sit next to a cgi
based c program (I know this is offtopic but my question does only refer
to the standard c part of the code).

Basically what the program needs to do, is open a text file, containing
hundreds maybe even thousands of items, and remove duplicate items.
Also if the line length exceds a certian amount of characters, I wish to
remove it.

I was thinking of using malloc to create a dynamically generated array
to load the elements into, then loop through them all removing duplicate
items.

This seems a little resource heavy considering the size of the file I am
dealing with. Does anyone have any suggestions on how to better handle
the task at hand?

Thanks in advance
Mick

Nov 13 '05 #1

Subscribe Reply

1872

Ben Pfaff

Materialised <ma**********@p rivacy.net> writes:

Basically what the program needs to do, is open a text file,
containing hundreds maybe even thousands of items, and remove
duplicate items.
Also if the line length exceds a certian amount of characters, I wish
to remove it.

I think you'd be best off combining existing tools to do this.
For example, on most OSes you could use `sort < input | uniq -c >
output', or something similar, to do the removal of duplicates.
--
"Some people *are* arrogant, and others read the FAQ."
--Chris Dollin

Nov 13 '05 #2

Eric

Materialised <ma**********@p rivacy.net> wrote:

Does anyone have any suggestions on how to better handle
the task at hand?

Assuming you cannot use existing tools, as Ben suggested, I cannot think
of a better method to solve your problem.

I would suggest using a dynamic data structure such that as you read in
each item, you insert the item and keep everything sorted.

As you do the insert then, you can also check to see if it is a
duplicate and abort the insert.

This way, you only have to do one pass over the data to remove the
duplicates and once complete you can just write everything back out.

As for the data structure, I might suggest using a red-black tree.

--
== Eric Gorr ========= http://www.ericgorr.net ========= ICQ:9293199 ===
"Therefore the considerations of the intelligent always include both
benefit and harm." - Sun Tzu
== Insults, like violence, are the last refuge of the incompetent... ===

Nov 13 '05 #3

Ben Pfaff

eg************* @verizon.net (Eric) writes:

I would suggest using a dynamic data structure such that as you read in
each item, you insert the item and keep everything sorted.

As you do the insert then, you can also check to see if it is a
duplicate and abort the insert.

This way, you only have to do one pass over the data to remove the
duplicates and once complete you can just write everything back out.

As for the data structure, I might suggest using a red-black tree.

I can't think why a red-black tree would be superior to a hash
table here. Hash tables work fine for finding duplicates.
--
"Welcome to the wonderful world of undefined behavior, where the demons
are nasal and the DeathStation users are nervous." --Daniel Fox

Nov 13 '05 #4

Tim Woodall

eg************* @verizon.net (Eric) wrote in message news:<1g4quh4.1 56ilwu1bpw5w8N% eg************* @verizon.net>.. .

Materialised <ma**********@p rivacy.net> wrote:
Does anyone have any suggestions on how to better handle
the task at hand?

Assuming you cannot use existing tools, as Ben suggested, I cannot think
of a better method to solve your problem.

I would suggest using a dynamic data structure such that as you read in
each item, you insert the item and keep everything sorted.

As you do the insert then, you can also check to see if it is a
duplicate and abort the insert.

This way, you only have to do one pass over the data to remove the
duplicates and once complete you can just write everything back out.

As for the data structure, I might suggest using a red-black tree.

If there are millions of lines then you will probably do better
storing a hash[1] than storing the actual data. You can write out each
line as you read it in unless the hash has already been seen before.
Of course this doesn't sort the data which may be an additional
requirement.

[1] e.g. MD5 or SHA-1. You don't want to get any collisions.

Nov 13 '05 #5

glen herrmannsfeldt

Materialised wrote:

(snip)

Basically what the program needs to do, is open a text file, containing
hundreds maybe even thousands of items, and remove duplicate items.
Also if the line length exceds a certian amount of characters, I wish to
remove it.

(snip)

Does it need to keep 100% of the non duplicated lines, or is
99.99999999% good enough? I would keep a hash table of the hashed
values of the lines. Maybe a 64 bit or even 96 or 128 bit CRC. There
is an extremely small chance that two different lines would generate the
same hash value, but it is not zero. Storing only the hash but not the
lines saves a lot of memory. Note that N lines have about N**2/2
possible pairs (the birthday problem), so a 32 bit CRC may not be good
enough.

Otherwise, doing it as an external sort such as the unix sort command,
is pretty fast and almost unlimited in the length of file you can
process. (You need enough disk space to hold the file and the
intermediate sort work files.)

-- glen

Nov 13 '05 #6

Eric

Ben Pfaff <bl*@cs.stanfor d.edu> wrote:

eg************* @verizon.net (Eric) writes:
I would suggest using a dynamic data structure such that as you read in
each item, you insert the item and keep everything sorted.

As you do the insert then, you can also check to see if it is a
duplicate and abort the insert.

This way, you only have to do one pass over the data to remove the
duplicates and once complete you can just write everything back out.

As for the data structure, I might suggest using a red-black tree.

I can't think why a red-black tree would be superior to a hash
table here. Hash tables work fine for finding duplicates.

That works to, as long as one is careful to avoid an inordinate number
of collisions.

--
== Eric Gorr ========= http://www.ericgorr.net ========= ICQ:9293199 ===
"Therefore the considerations of the intelligent always include both
benefit and harm." - Sun Tzu
== Insults, like violence, are the last refuge of the incompetent... ===

Nov 13 '05 #7

Eric

Eric <eg************ *@verizon.net> wrote:

Ben Pfaff <bl*@cs.stanfor d.edu> wrote:
eg************* @verizon.net (Eric) writes:
I would suggest using a dynamic data structure such that as you read in
each item, you insert the item and keep everything sorted.

As you do the insert then, you can also check to see if it is a
duplicate and abort the insert.

This way, you only have to do one pass over the data to remove the
duplicates and once complete you can just write everything back out.

As for the data structure, I might suggest using a red-black tree.

I can't think why a red-black tree would be superior to a hash
table here. Hash tables work fine for finding duplicates.

That works to, as long as one is careful to avoid an inordinate number
of collisions.

Of course, you need some kind of data structure to hold the hash table
and since we do not apparently know how big that table needs to be, this
could lead back to a red-black tree. The two ideas are not necessarily
mutually exclusive.

--
== Eric Gorr ========= http://www.ericgorr.net ========= ICQ:9293199 ===
"Therefore the considerations of the intelligent always include both
benefit and harm." - Sun Tzu
== Insults, like violence, are the last refuge of the incompetent... ===

Nov 13 '05 #8

Bigdakine

>Subject: Theoretical Problem

From: Materialised ma**********@pr ivacy.net
Date: 11/20/03 3:27 PM Hawaiian Standard Time
Message-id: <bp************ *@ID-204621.news.uni-berlin.de>

I have been given the task, of developing a program to sit next to a cgi
based c program (I know this is offtopic but my question does only refer
to the standard c part of the code).

Basically what the program needs to do, is open a text file, containing
hundreds maybe even thousands of items, and remove duplicate items.
Also if the line length exceds a certian amount of characters, I wish to
remove it.

I was thinking of using malloc to create a dynamically generated array
to load the elements into, then loop through them all removing duplicate
items.

This seems a little resource heavy considering the size of the file I am
dealing with. Does anyone have any suggestions on how to better handle
the task at hand?

Hash Table will make searching quicker.

Stuart

Stuart
Dr. Stuart A. Weinstein
Ewa Beach Institute of Tectonics
"To err is human, but to really foul things up
requires a creationist"

Nov 13 '05 #9

Anuj Heer

what u can do is probably sort the items and then remove duplicate
entries from the list whis as my friends have said is slower because
it requires multiple passes through the data..what u can do instead is
an insertion sort to another file and that ould help u solve the
problem faster..
hp*****@vcustom er.net

Nov 13 '05 #10

Similar topics

117

7261

Solution to the halting Problem?

by: Peter Olcott | last post by:

www.halting-problem.com

C / C++

5222

Problem with virtual method

by: Jon Davis | last post by:

If I have a class with a virtual method, and a child class that overrides the virtual method, and then I create an instance of the child class AS A base class... BaseClass bc = new ChildClass(); .... and then call the virtual method, why is it that the base class's method is called instead of the overridden method? How do I fix this if I don't know at runtime what the child class is? I'm using Activator.CreateInstance() to load the...

C# / C Sharp

3812

Problem: Request time out

by: Ammar | last post by:

Dear All, I'm facing a small problem. I have a portal web site, that contains articles, for each article, the end user can send a comment about the article. The problem is: I the comment length is more that 1249 bytes, then the progress bar of the browser will move too slow and then displaying that the page not found!!!! If the message is less than or equal to 1249 then no problem.

ASP.NET

4927

SP1 Problem SOAPException doesn't return quote and Umlaute correcty

by: Dany | last post by:

Our web service was working fine until we installed .net Framework 1.1 service pack 1. Uninstalling SP1 is not an option because our largest customer says service packs marked as "critical" by Microsoft must be installed on their servers. Now german Umlaute (ä, ü, ö) and quotes are returned incorrectly in SOAP fault responses. This can be easily verified: Implement the following in a web service method (just raises a SOAPException with a...

.NET Framework

3949

Theoretical definition for the number of unique values?

by: sqlservernewbie | last post by:

Hi Everyone, Here is a theoretical, and definition question for you. In databases, we have: Relation a table with columns and rows

Microsoft SQL Server

1469

Theoretical Capacity Planning for ASP.NET Applications

by: Joseph Ferris | last post by:

Good afternoon, I understand the basic theories of capacity planning as it relates to profiling an existing web site, such as in the examples given in the MSDN Article, "How To: Perform Capacity Planning for .NET Applications" http://msdn2.microsoft.com/en-us/library/ ms979198.aspx]. This assumes the existence of an application to profile, though. What is a good way to perform a more abstract, theoretical assessment

ASP.NET

9586

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9423

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10043

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9861

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8869

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6672

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5298

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

3956

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3561

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP