473,320 Members | 1,881 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Avoiding dupes when merging files

Hi all.

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!

Thanks

Nov 21 '05 #1
12 1381
go************@hotmail.com wrote in news:1101328833.131813.52400
@c13g2000cwb.googlegroups.com:
Hi all.

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!

Take a look at the Microsoft Text Driver - you can run SQL queries on the
text file. Perhaps you can just query each file checking for dupes?

Or you could load the data into a datatable (or hash table type object?),
with the PK set as the filename... if a duplicate shows up, the datatable
should throw a duplicate PK exception which you would catch and ignore.

Or lastly... perhaps you should think of a different method of storing the
data? Maybe a database is a better idea than text files?

--
Lucas Tam (RE********@rogers.com)
Please delete "REMOVE" from the e-mail address when replying.
http://members.ebay.com/aboutme/coolspot18/
Nov 21 '05 #2
> Take a look at the Microsoft Text Driver - you can run SQL queries on the
text file. Perhaps you can just query each file checking for dupes?

Or you could load the data into a datatable (or hash table type object?),
with the PK set as the filename... if a duplicate shows up, the datatable
should throw a duplicate PK exception which you would catch and ignore.

Or lastly... perhaps you should think of a different method of storing the
data? Maybe a database is a better idea than text files?


I like the idea of the PK exception as it will give an error that i can
trap. I am being forced to use text files though for simplicity. Do you
have any sample code for implementing a datatable/PK exception as this is
new to me!

Bob
Nov 21 '05 #3
"Bob Hollness" <bo*@blockbuster.com> wrote in
news:uH**************@TK2MSFTNGP11.phx.gbl:
I like the idea of the PK exception as it will give an error that i
can trap. I am being forced to use text files though for simplicity.
Do you have any sample code for implementing a datatable/PK exception
as this is new to me!


Here's the example from MSDN:

http://msdn.microsoft.com/library/de...l=/library/en-
us/cpref/html/frlrfsystemdatadatatableclassprimarykeytopic.asp

I've used it a couple of times and it works fine.

Here is what you do in short:

1. Add your columns to a datatable.
2. Add the same column from step 2 into a primary key array.
3. Add the primary key array to the DataTable.PrimaryKey property.

--
Lucas Tam (RE********@rogers.com)
Please delete "REMOVE" from the e-mail address when replying.
http://members.ebay.com/aboutme/coolspot18/
Nov 21 '05 #4
>
Take a look at the Microsoft Text Driver - you can run SQL queries on the
text file. Perhaps you can just query each file checking for dupes?

Or you could load the data into a datatable (or hash table type object?),
with the PK set as the filename... if a duplicate shows up, the datatable
should throw a duplicate PK exception which you would catch and ignore.

Or lastly... perhaps you should think of a different method of storing the
data? Maybe a database is a better idea than text files?

--
Lucas Tam (RE********@rogers.com)
Please delete "REMOVE" from the e-mail address when replying.
http://members.ebay.com/aboutme/coolspot18/


Thanks for the fast reply. I have to use text files so that really is not
an option. Any pointers or some sample code on how to use the datatable? I
like the idea of being able to trap a dupicate OK error.

Bob
Nov 21 '05 #5
"Bob Hollness" <bo*@blockbuster.com> wrote in
news:#U**************@TK2MSFTNGP11.phx.gbl:

Take a look at the Microsoft Text Driver - you can run SQL queries on
the text file. Perhaps you can just query each file checking for
dupes?

Or you could load the data into a datatable (or hash table type
object?), with the PK set as the filename... if a duplicate shows up,
the datatable should throw a duplicate PK exception which you would
catch and ignore.

Or lastly... perhaps you should think of a different method of
storing the data? Maybe a database is a better idea than text files?

--
Lucas Tam (RE********@rogers.com)
Please delete "REMOVE" from the e-mail address when replying.
http://members.ebay.com/aboutme/coolspot18/


Thanks for the fast reply. I have to use text files so that really is
not an option. Any pointers or some sample code on how to use the
datatable? I like the idea of being able to trap a dupicate OK error.


I replied to your message a bit early in the day, but I'm not sure if
you received it:

Here's the example from MSDN (particularly the SetPrimaryKeys Sub):

http://msdn.microsoft.com/library/de...l=/library/en-
us/cpref/html/frlrfsystemdatadatatableclassprimarykeytopic.asp

I've used it a couple of times and it works fine.

Here is what you do in short:

1. Add your columns to a datatable.
2. Add the same column from step 2 into a primary key array.
3. Add the primary key array to the DataTable.PrimaryKey property.
--
Lucas Tam (RE********@rogers.com)
Please delete "REMOVE" from the e-mail address when replying.
http://members.ebay.com/aboutme/coolspot18/
Nov 21 '05 #6
Thanks for this. But I guess i need something a little more basic. Also to
do it in memory or straight to disk. I guess i'll keep playing with the
loops

--
Bob Hollness

-------------------------------------
I'll have a B please Bob
"Lucas Tam" <RE********@rogers.com> wrote in message
news:Xn***************************@140.99.99.130.. .
"Bob Hollness" <bo*@blockbuster.com> wrote in
news:uH**************@TK2MSFTNGP11.phx.gbl:
I like the idea of the PK exception as it will give an error that i
can trap. I am being forced to use text files though for simplicity.
Do you have any sample code for implementing a datatable/PK exception
as this is new to me!


Here's the example from MSDN:

http://msdn.microsoft.com/library/de...l=/library/en-
us/cpref/html/frlrfsystemdatadatatableclassprimarykeytopic.asp

I've used it a couple of times and it works fine.

Here is what you do in short:

1. Add your columns to a datatable.
2. Add the same column from step 2 into a primary key array.
3. Add the primary key array to the DataTable.PrimaryKey property.

--
Lucas Tam (RE********@rogers.com)
Please delete "REMOVE" from the e-mail address when replying.
http://members.ebay.com/aboutme/coolspot18/

Nov 21 '05 #7
> Hi all.

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!

Thanks


OK. This is the solution I came up with. Not as elegant as one would have
hoped. but then again, only I get to see how it functions under the bonnet
(hood for the Americans) !!! And of course, this is still to be tidied up
and made pretty. Feel free to pull it apart and embarrass me.......
Sub FindDupes(ByVal File2Compare As String, ByVal OriginalFile As
String, ByVal OutputFile As String)

Dim File1Reader As New StreamReader(File2Compare)
Dim File2Reader 'As New StreamReader(OriginalFile)
Dim File3Writer As New StreamWriter(OutputFile)
Dim Line1 As String = ""
Dim Line2 As String = ""
Dim Found As Boolean

Do
Line1 = File1Reader.ReadLine
Found = False

If Not Line1 Is Nothing Then

File2Reader = New StreamReader(OriginalFile)

Do
Line2 = File2Reader.ReadLine()
If Line1 = Line2 Then
Found = True
Exit Do
End If
Loop Until Line2 Is Nothing

If Found = False Then
File3Writer.WriteLine(Line1)
End If

Found = False

File2Reader.Close()

End If
Loop Until Line1 Is Nothing

File1Reader.Close()
File2Reader.Close()
File3Writer.Close()

--
Bob Hollness

-------------------------------------
I'll have a B please Bob
Nov 21 '05 #8
"Bob Hollness" <bo*@blockbuster.com> wrote in news:uUD3YV00EHA.1392
@TK2MSFTNGP14.phx.gbl:
Feel free to pull it apart and embarrass me.......


Very inefficent when compared to Cor's elegant example of a hash table!

Nov 21 '05 #9

"Bob Hollness" <bo*@blockbuster.com> wrote

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.
And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!


OK. This is the solution I came up with. Not as elegant as one would have
hoped. but then again, only I get to see how it functions under the bonnet
(hood for the Americans) !!! And of course, this is still to be tidied up
and made pretty. Feel free to pull it apart and embarrass me.......


As Cor suggested use a Hashtable, (or you might call it a Dictionary) it will
be much more efficient, and easier to code....

Paste the following in to a routine to see it in action:

HTH
LFS
Dim item As String
Dim hash As New System.Collections.Hashtable
Dim file1 As String() = New String() { _
"Pretend this is text from a file.", _
"It is contained in an array only for", _
"demo purposes."}
Dim file2 As String() = New String() { _
"This is the text from a second file.", _
"The next line is a duplicate line and", _
"will overwrite the original entry:", _
"It is contained (DUPLICATE)", _
"Only the first 10 characters", _
"were used toward duplicate testing."}

For Each item In file1
hash.Item(item.Substring(0, 10)) = item
Next

For Each item In file2
hash.Item(item.Substring(0, 10)) = item
Next

Dim entry As System.Collections.DictionaryEntry
For Each entry In hash
Debug.WriteLine(entry.Value)
Next

Debug.WriteLine("")
Debug.WriteLine("Note that the order is not maintained, and")
Debug.WriteLine("the duplicate line's original value was")
Debug.WriteLine("overwritten by the later (duplicate) entry.")

Nov 21 '05 #10
"Bob Hollness" <bo*@blockbuster.com> wrote in message
news:uU**************@TK2MSFTNGP14.phx.gbl...
Hi all.

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!

Thanks


OK. This is the solution I came up with. Not as elegant as one would
have hoped. but then again, only I get to see how it functions under the
bonnet (hood for the Americans) !!! And of course, this is still to be
tidied up and made pretty. Feel free to pull it apart and embarrass
me.......
Sub FindDupes(ByVal File2Compare As String, ByVal OriginalFile As
String, ByVal OutputFile As String)

Dim File1Reader As New StreamReader(File2Compare)
Dim File2Reader 'As New StreamReader(OriginalFile)
Dim File3Writer As New StreamWriter(OutputFile)
Dim Line1 As String = ""
Dim Line2 As String = ""
Dim Found As Boolean

Do
Line1 = File1Reader.ReadLine
Found = False

If Not Line1 Is Nothing Then

File2Reader = New StreamReader(OriginalFile)

Do
Line2 = File2Reader.ReadLine()
If Line1 = Line2 Then
Found = True
Exit Do
End If
Loop Until Line2 Is Nothing

If Found = False Then
File3Writer.WriteLine(Line1)
End If

Found = False

File2Reader.Close()

End If
Loop Until Line1 Is Nothing

File1Reader.Close()
File2Reader.Close()
File3Writer.Close()

--
Bob Hollness

-------------------------------------
I'll have a B please Bob


P.S. Yes I know that half the code is missing. It was late when I posted
this. I will update it with the missing parts this weekend.

--

Bob

--------------------------------------
I'll have a B please Bob.
Nov 21 '05 #11
"Anon-E-Moose" <an**********@yahoo.com> wrote in message
news:Xn********************************@140.99.99. 130...
"Bob Hollness" <bo*@blockbuster.com> wrote in news:uUD3YV00EHA.1392
@TK2MSFTNGP14.phx.gbl:
Feel free to pull it apart and embarrass me.......


Very inefficent when compared to Cor's elegant example of a hash table!


I thought hashing would work because when the hash is calculated for 2
identical strings, the hashes would be the same. So it was just a case of
comparing hashes. But someone else told me that this is not the case
because hashes are generated and not calculated, so the hashes would be
different. Is this not so?

--

Bob

--------------------------------------
I'll have a B please Bob.
Nov 21 '05 #12

<go************@hotmail.com> wrote in message
news:11*********************@c13g2000cwb.googlegro ups.com...
Hi all.

I currently have 2 text files which contain lists of file names. These
text files are updated by my code. What I want to do is be able to
merge these text files discarding the duplicates.

And to make it harder (or not???!!) my criteria for defining the
duplicate is the left 15 (or so) characters of the file path.
Help, as always, is greatly appreciated!

Thanks


OK, you got me. I have not been thinking straight. I have since given it
further thought and the hash table is clearly the better way to go,
especially because of the file sizes i eventually will be using. So, i am
currently writing it all now (or just customising the samples placed
here......? ;-) )

thanks for the code samples Cor, as always, you have been helpful.
--
Bob Hollness

-------------------------------------
I'll have a B please Bob
Nov 21 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Mike | last post by:
Hi! I also asked this question in C# group with no results: I have 2 datasets loaded with data from two xml files having the same schema. The files contain data from yesterday and today. I'd...
2
by: Nikhil Prashar | last post by:
I'm trying to merge two XML files that have the same structure but not necessarily the same nodes in the same order. I've tried opening the files as datasets and using the DataSet.Merge() function,...
0
by: steve | last post by:
Hi there, I am trying to import data from 2 dbf files into excel using the 'get external data' option which launches ms query. Ultimately I am merging data with a right join statement. I can...
3
by: Georges Heinesch | last post by:
Hi. This issue semms trivial, but I didn't get it working so far. I have a database, which contains dupes. I'd like to create a query, which shows all dupes (not only one record, but all...
1
by: gdarian216 | last post by:
I am tring to get rid of dupes and his code is taking the first input and repeating it. I don't know why. this is what i have so far can anyone help #include <iostream> using namespace std; ...
10
by: n o s p a m p l e a s e | last post by:
Is it possible to merge two DLL files into one? If so, how? Thanx/NSP
7
by: Jan | last post by:
Hi: When I searched the newsgroup for this problem, I saw two or three instances of the question being asked, but it was never answered. Not too promising, but here goes: I have a form with...
0
by: veer | last post by:
Hello sir. I am making a program on merging in Visual Basic. The program is that I have a folder which is not on my hard drive contain 80 Mdb files and each Mdb file contains two tables. I want to...
10
by: username88 | last post by:
I am having trouble with a query for my database. It is a name & address database with columns like firstname, lastname, email, etc. I am trying to show dupes in my 20,000 name database. I have...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.