473,396 Members | 2,158 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Theoretical Problem

I have been given the task, of developing a program to sit next to a cgi
based c program (I know this is offtopic but my question does only refer
to the standard c part of the code).

Basically what the program needs to do, is open a text file, containing
hundreds maybe even thousands of items, and remove duplicate items.
Also if the line length exceds a certian amount of characters, I wish to
remove it.

I was thinking of using malloc to create a dynamically generated array
to load the elements into, then loop through them all removing duplicate
items.

This seems a little resource heavy considering the size of the file I am
dealing with. Does anyone have any suggestions on how to better handle
the task at hand?

Thanks in advance
Mick

Nov 13 '05 #1
16 1834
Materialised <ma**********@privacy.net> writes:
Basically what the program needs to do, is open a text file,
containing hundreds maybe even thousands of items, and remove
duplicate items.
Also if the line length exceds a certian amount of characters, I wish
to remove it.


I think you'd be best off combining existing tools to do this.
For example, on most OSes you could use `sort < input | uniq -c >
output', or something similar, to do the removal of duplicates.
--
"Some people *are* arrogant, and others read the FAQ."
--Chris Dollin
Nov 13 '05 #2
Materialised <ma**********@privacy.net> wrote:
Does anyone have any suggestions on how to better handle
the task at hand?


Assuming you cannot use existing tools, as Ben suggested, I cannot think
of a better method to solve your problem.

I would suggest using a dynamic data structure such that as you read in
each item, you insert the item and keep everything sorted.

As you do the insert then, you can also check to see if it is a
duplicate and abort the insert.

This way, you only have to do one pass over the data to remove the
duplicates and once complete you can just write everything back out.

As for the data structure, I might suggest using a red-black tree.

--
== Eric Gorr ========= http://www.ericgorr.net ========= ICQ:9293199 ===
"Therefore the considerations of the intelligent always include both
benefit and harm." - Sun Tzu
== Insults, like violence, are the last refuge of the incompetent... ===
Nov 13 '05 #3
eg*************@verizon.net (Eric) writes:
I would suggest using a dynamic data structure such that as you read in
each item, you insert the item and keep everything sorted.

As you do the insert then, you can also check to see if it is a
duplicate and abort the insert.

This way, you only have to do one pass over the data to remove the
duplicates and once complete you can just write everything back out.

As for the data structure, I might suggest using a red-black tree.


I can't think why a red-black tree would be superior to a hash
table here. Hash tables work fine for finding duplicates.
--
"Welcome to the wonderful world of undefined behavior, where the demons
are nasal and the DeathStation users are nervous." --Daniel Fox
Nov 13 '05 #4
eg*************@verizon.net (Eric) wrote in message news:<1g4quh4.156ilwu1bpw5w8N%eg*************@veri zon.net>...
Materialised <ma**********@privacy.net> wrote:
Does anyone have any suggestions on how to better handle
the task at hand?


Assuming you cannot use existing tools, as Ben suggested, I cannot think
of a better method to solve your problem.

I would suggest using a dynamic data structure such that as you read in
each item, you insert the item and keep everything sorted.

As you do the insert then, you can also check to see if it is a
duplicate and abort the insert.

This way, you only have to do one pass over the data to remove the
duplicates and once complete you can just write everything back out.

As for the data structure, I might suggest using a red-black tree.

If there are millions of lines then you will probably do better
storing a hash[1] than storing the actual data. You can write out each
line as you read it in unless the hash has already been seen before.
Of course this doesn't sort the data which may be an additional
requirement.

[1] e.g. MD5 or SHA-1. You don't want to get any collisions.
Nov 13 '05 #5
Materialised wrote:

(snip)
Basically what the program needs to do, is open a text file, containing
hundreds maybe even thousands of items, and remove duplicate items.
Also if the line length exceds a certian amount of characters, I wish to
remove it.


(snip)

Does it need to keep 100% of the non duplicated lines, or is
99.99999999% good enough? I would keep a hash table of the hashed
values of the lines. Maybe a 64 bit or even 96 or 128 bit CRC. There
is an extremely small chance that two different lines would generate the
same hash value, but it is not zero. Storing only the hash but not the
lines saves a lot of memory. Note that N lines have about N**2/2
possible pairs (the birthday problem), so a 32 bit CRC may not be good
enough.

Otherwise, doing it as an external sort such as the unix sort command,
is pretty fast and almost unlimited in the length of file you can
process. (You need enough disk space to hold the file and the
intermediate sort work files.)

-- glen

Nov 13 '05 #6
Ben Pfaff <bl*@cs.stanford.edu> wrote:
eg*************@verizon.net (Eric) writes:
I would suggest using a dynamic data structure such that as you read in
each item, you insert the item and keep everything sorted.

As you do the insert then, you can also check to see if it is a
duplicate and abort the insert.

This way, you only have to do one pass over the data to remove the
duplicates and once complete you can just write everything back out.

As for the data structure, I might suggest using a red-black tree.


I can't think why a red-black tree would be superior to a hash
table here. Hash tables work fine for finding duplicates.


That works to, as long as one is careful to avoid an inordinate number
of collisions.

--
== Eric Gorr ========= http://www.ericgorr.net ========= ICQ:9293199 ===
"Therefore the considerations of the intelligent always include both
benefit and harm." - Sun Tzu
== Insults, like violence, are the last refuge of the incompetent... ===
Nov 13 '05 #7
Eric <eg*************@verizon.net> wrote:
Ben Pfaff <bl*@cs.stanford.edu> wrote:
eg*************@verizon.net (Eric) writes:
I would suggest using a dynamic data structure such that as you read in
each item, you insert the item and keep everything sorted.

As you do the insert then, you can also check to see if it is a
duplicate and abort the insert.

This way, you only have to do one pass over the data to remove the
duplicates and once complete you can just write everything back out.

As for the data structure, I might suggest using a red-black tree.


I can't think why a red-black tree would be superior to a hash
table here. Hash tables work fine for finding duplicates.


That works to, as long as one is careful to avoid an inordinate number
of collisions.


Of course, you need some kind of data structure to hold the hash table
and since we do not apparently know how big that table needs to be, this
could lead back to a red-black tree. The two ideas are not necessarily
mutually exclusive.

--
== Eric Gorr ========= http://www.ericgorr.net ========= ICQ:9293199 ===
"Therefore the considerations of the intelligent always include both
benefit and harm." - Sun Tzu
== Insults, like violence, are the last refuge of the incompetent... ===
Nov 13 '05 #8
>Subject: Theoretical Problem
From: Materialised ma**********@privacy.net
Date: 11/20/03 3:27 PM Hawaiian Standard Time
Message-id: <bp*************@ID-204621.news.uni-berlin.de>

I have been given the task, of developing a program to sit next to a cgi
based c program (I know this is offtopic but my question does only refer
to the standard c part of the code).

Basically what the program needs to do, is open a text file, containing
hundreds maybe even thousands of items, and remove duplicate items.
Also if the line length exceds a certian amount of characters, I wish to
remove it.

I was thinking of using malloc to create a dynamically generated array
to load the elements into, then loop through them all removing duplicate
items.

This seems a little resource heavy considering the size of the file I am
dealing with. Does anyone have any suggestions on how to better handle
the task at hand?


Hash Table will make searching quicker.

Stuart

Stuart
Dr. Stuart A. Weinstein
Ewa Beach Institute of Tectonics
"To err is human, but to really foul things up
requires a creationist"
Nov 13 '05 #9
what u can do is probably sort the items and then remove duplicate
entries from the list whis as my friends have said is slower because
it requires multiple passes through the data..what u can do instead is
an insertion sort to another file and that ould help u solve the
problem faster..
hp*****@vcustomer.net
Nov 13 '05 #10
Anuj Heer wrote:
what u can do is probably sort the items and then remove duplicate
entries from the list whis as my friends have said is slower because
it requires multiple passes through the data..what u can do instead is
an insertion sort to another file and that ould help u solve the
problem faster..


No need for multiple passes. If you like trees, pour the file into a tree,
and duplicates will be eliminated automagically. If you prefer hash, hash
each line and filter out the dups that way. Or, better still, use existing
tools, like Ben suggested.

--
Richard Heathfield : bi****@eton.powernet.co.uk
"Usenet is a strange place." - Dennis M Ritchie, 29 July 1999.
C FAQ: http://www.eskimo.com/~scs/C-faq/top.html
K&R answers, C books, etc: http://users.powernet.co.uk/eton
Nov 13 '05 #11
In article <1g************************************@verizon.ne t>,
eg*************@verizon.net says...

Of course, you need some kind of data structure to hold the hash table
and since we do not apparently know how big that table needs to be,


Simple hash chaining seems like a perfectly suitable method for handling
the dynamic size issue. Any decent data structures book should have
example code to get the OP started.

--
Randy Howard _o
2reply remove FOOBAR \<,
______________________()/ ()______________________________________________
SCO Spam-magnet: po********@sco.com
Nov 13 '05 #12
Randy Howard <ra**********@FOOmegapathdslBAR.net> wrote:
In article <1g************************************@verizon.ne t>,
eg*************@verizon.net says...

Of course, you need some kind of data structure to hold the hash table
and since we do not apparently know how big that table needs to be,


Simple hash chaining seems like a perfectly suitable method for handling
the dynamic size issue.


Yes, and I would probably suggest a red-black tree be used for the
chaining. As I mentioned...hash tables and red-black trees are not
mutually exclusive concepts.
--
== Eric Gorr ========= http://www.ericgorr.net ========= ICQ:9293199 ===
"Therefore the considerations of the intelligent always include both
benefit and harm." - Sun Tzu
== Insults, like violence, are the last refuge of the incompetent... ===
Nov 13 '05 #13
In article <1g************************************@verizon.ne t>,
eg*************@verizon.net says...
Randy Howard <ra**********@FOOmegapathdslBAR.net> wrote:
In article <1g4rocq.1qa8btxigvw0vN%eg*************@verizon.ne t>,
eg*************@verizon.net says...

Of course, you need some kind of data structure to hold the hash table
and since we do not apparently know how big that table needs to be,


Simple hash chaining seems like a perfectly suitable method for handling
the dynamic size issue.


Yes, and I would probably suggest a red-black tree be used for the
chaining. As I mentioned...hash tables and red-black trees are not
mutually exclusive concepts.


OK, you've established (repeatedly) that you like RB trees. Move on.

--
Randy Howard _o
2reply remove FOOBAR \<,
______________________()/ ()______________________________________________
SCO Spam-magnet: po********@sco.com
Nov 13 '05 #14
Randy Howard <ra**********@FOOmegapathdslBAR.net> wrote:
In article <1g************************************@verizon.ne t>,
eg*************@verizon.net says...
Randy Howard <ra**********@FOOmegapathdslBAR.net> wrote:
In article <1g4rocq.1qa8btxigvw0vN%eg*************@verizon.ne t>,
eg*************@verizon.net says...
>
> Of course, you need some kind of data structure to hold the hash table
> and since we do not apparently know how big that table needs to be,

Simple hash chaining seems like a perfectly suitable method for handling
the dynamic size issue.


Yes, and I would probably suggest a red-black tree be used for the
chaining. As I mentioned...hash tables and red-black trees are not
mutually exclusive concepts.


OK, you've established (repeatedly) that you like RB trees. Move on.


Do you see a flaw in something I said?

--
== Eric Gorr ========= http://www.ericgorr.net ========= ICQ:9293199 ===
"Therefore the considerations of the intelligent always include both
benefit and harm." - Sun Tzu
== Insults, like violence, are the last refuge of the incompetent... ===
Nov 13 '05 #15
In article <1g***********************************@verizon.net >,
eg*************@verizon.net says...
Randy Howard <ra**********@FOOmegapathdslBAR.net> wrote:
In article <1g4zmf6.jp3bqj1tg8lq7N%eg*************@verizon.ne t>,
eg*************@verizon.net says...
Randy Howard <ra**********@FOOmegapathdslBAR.net> wrote:

> In article <1g4rocq.1qa8btxigvw0vN%eg*************@verizon.ne t>,
> eg*************@verizon.net says...
> >
> > Of course, you need some kind of data structure to hold the hash table
> > and since we do not apparently know how big that table needs to be,
>
> Simple hash chaining seems like a perfectly suitable method for handling
> the dynamic size issue.

Yes, and I would probably suggest a red-black tree be used for the
chaining. As I mentioned...hash tables and red-black trees are not
mutually exclusive concepts.


OK, you've established (repeatedly) that you like RB trees. Move on.


Do you see a flaw in something I said?


*sigh* Are you a relative of R. Bayer (symmetric B trees)?

There is nothing wrong with RB trees per se, same goes for a lot of other
similar data structures. However, since you won't let it go, perhaps a
review of section 14.6 of Sedgewick's Algorithms in C, Third Edition,
Parts 1-4 will be enlightening. Pay particular attention to Table 14.1,
in which RB trees used with hash chaining are demonstrated to NOT give
particularly good performance, either for construction or search times
as compared to other methods. In fact, none of the other 4 methods
considered have worse performance than RB hashing. Go figure.

As the empirical study in the text points out, simple linear probing
with table expansion by doubling is considerably faster, and far simpler
to code. (RB hashing took 2.5X longer for table construction, and 5.6X
longer to handle missed searches than the method you recommend ad nauseam.
The LP method also uses considerably less memory overhead while providing
performance benefits.

--
Randy Howard _o
2reply remove FOOBAR \<,
______________________()/ ()______________________________________________
SCO Spam-magnet: po********@sco.com
Nov 13 '05 #16
Randy Howard <ra**********@FOOmegapathdslBAR.net> wrote:
Pay particular attention to Table 14.1,
in which RB trees used with hash chaining are demonstrated to NOT give
particularly good performance, either for construction or search times
as compared to other methods.
As is specifically mentioned, this table only demonstrates this for
32-bit Ints that are easily hashed. The case being dealt with in this
thread is quite different. As such, Table 14.1 is simply irrelevant.
RB hashing took 2.5X longer for table construction, and 5.6X
longer to handle missed searches than the method you recommend ad nauseam.


The kind of hashing being talked about by Sedgewick does not consider
the need to not insert duplicates. The hashing being considered here
assuming that each new node to be inserted is unique. Note the automatic
'N++' in the sample code which is a total node count.

In other words, after one has computed the hash value and goes to
insert, when not using an RB tree, one must check each and every node
(let's say there are M nodes) at that hash value. With RB, one only
needs to check log(M) nodes. Now considering that we are inserting
strings and must do string comparisons, the use of RB can be a
significant savings.

Furthermore, hashing performs well only when the hash function can both:

1. Be computed quickly and easily
2. Is capable of distributing evenly over the table

Such a function is dependent upon knowing the characteristics of the
data being fed into it and there does not exist a universal hash
function.

Now, in 14.6, Sedgewick does talk about 'Ordered Hashing', which may be
useful for this case, but he does not perform any analysis with respect
to Ordered Hashing and instead informs the reader to look elsewhere.

I am not familiar with the various aspects of Ordered Hashing, but my
intuition and a bit of thought tells me that the use of RB will likely
have very good performance in this case and, in worst-case situations,
be able to deal with it just as easily.

So, my recommendation remains to just use an RB tree.

However, if a good hash function can be found (and there are no
guarantees here), this can speed things up quite nicely to find a bucket
to stick things into and then for the nodes in that bucket, use an RB
tree.

--
== Eric Gorr ========= http://www.ericgorr.net ========= ICQ:9293199 ===
"Therefore the considerations of the intelligent always include both
benefit and harm." - Sun Tzu
== Insults, like violence, are the last refuge of the incompetent... ===
Nov 13 '05 #17

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

117
by: Peter Olcott | last post by:
www.halting-problem.com
28
by: Jon Davis | last post by:
If I have a class with a virtual method, and a child class that overrides the virtual method, and then I create an instance of the child class AS A base class... BaseClass bc = new ChildClass();...
6
by: Ammar | last post by:
Dear All, I'm facing a small problem. I have a portal web site, that contains articles, for each article, the end user can send a comment about the article. The problem is: I the comment length...
16
by: Dany | last post by:
Our web service was working fine until we installed .net Framework 1.1 service pack 1. Uninstalling SP1 is not an option because our largest customer says service packs marked as "critical" by...
11
by: sqlservernewbie | last post by:
Hi Everyone, Here is a theoretical, and definition question for you. In databases, we have: Relation a table with columns and rows
0
by: Joseph Ferris | last post by:
Good afternoon, I understand the basic theories of capacity planning as it relates to profiling an existing web site, such as in the examples given in the MSDN Article, "How To: Perform Capacity...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.