473,241 Members | 1,875 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,241 software developers and data experts.

Joining Big Files

I have 4 text files each approx 50mb.

I need to join these into one large text file.

I only need to do this very occasionally, as the problem has occurred
because of upload limitations.

Bearing in mind filesize and memory useage, would I be better reading
every line in every file and writing each line to the output file or
is there someway I could execute some shell command.

I am working on a hosted website with no direct telnet or similar
access.

I would appreciate any tips, best code, as I do not want to break
anything on the website.

Richard

Aug 25 '07 #1
7 1492
On Sat, 25 Aug 2007 02:57:24 -0700, mosscliffe wrote:
I have 4 text files each approx 50mb.

I need to join these into one large text file.

I only need to do this very occasionally, as the problem has occurred
because of upload limitations.

Bearing in mind filesize and memory useage, would I be better reading
every line in every file and writing each line to the output file or
is there someway I could execute some shell command.
There are some copy functions that work with file like objects in the
`shutil` module.

Ciao,
Marc 'BlackJack' Rintsch
Aug 25 '07 #2
On Aug 25, 4:57 am, mosscliffe <mcl.off...@googlemail.comwrote:
I have 4 text files each approx 50mb.

I need to join these into one large text file.

I only need to do this very occasionally, as the problem has occurred
because of upload limitations.

Bearing in mind filesize and memory useage, would I be better reading
every line in every file and writing each line to the output file or
is there someway I could execute some shell command.

I am working on a hosted website with no direct telnet or similar
access.

I would appreciate any tips, best code, as I do not want to break
anything on the website.

Richard
I would probably open the files in binary mode and copy 128 KB - 512
KB at a time. I wouldn't copy a line at a time as it is possible that
one of the files contains a very long line.
Aug 26 '07 #3
On Aug 25, 4:57 am, mosscliffe <mcl.off...@googlemail.comwrote:
I have 4 text files each approx 50mb.
<yawn50mb? Really? Did you actually try this and find out it was a
problem?

Try this:
import time

start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)

-- Paul

Aug 26 '07 #4
On Aug 25, 8:15 pm, Paul McGuire <pt...@austin.rr.comwrote:
On Aug 25, 4:57 am, mosscliffe <mcl.off...@googlemail.comwrote:
I have 4 text files each approx 50mb.

<yawn50mb? Really? Did you actually try this and find out it was a
problem?

Try this:
import time

start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)

-- Paul
My bad, my test file was not a text file, but a binary file.
Retesting with a 50Mb text file took 24.6 seconds on my machine.

Still in your working range? If not, then you will need to pursue
more exotic approaches. But 25 seconds on an infrequent basis does
not sound too bad, especially since I don't think you will really get
any substantial boost from them (to benchmark this, I timed a raw
"copy" command at the OS level of the resulting 200Mb file, and this
took about 20 seconds).

Keep it simple.

-- Paul

Aug 26 '07 #5
On Aug 26, 6:48 am, Paul McGuire <pt...@austin.rr.comwrote:
On Aug 25, 8:15 pm, Paul McGuire <pt...@austin.rr.comwrote:
On Aug 25, 4:57 am, mosscliffe <mcl.off...@googlemail.comwrote:
I have 4 text files each approx 50mb.
<yawn50mb? Really? Did you actually try this and find out it was a
problem?
Try this:
import time
start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()
print end-start,"seconds"
For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)
-- Paul

My bad, my test file was not a text file, but a binary file.
Retesting with a 50Mb text file took 24.6 seconds on my machine.

Still in your working range? If not, then you will need to pursue
more exotic approaches. But 25 seconds on an infrequent basis does
not sound too bad, especially since I don't think you will really get
any substantial boost from them (to benchmark this, I timed a raw
"copy" command at the OS level of the resulting 200Mb file, and this
took about 20 seconds).

Keep it simple.

-- Paul
There are (at least) another couple of approaches possible, each with
some possible tradeoffs or requirements:

Approach 1. (Least amount of code to write - not that the others are
large :)

Just use os.system() and the UNIX cat command - the requirement here
is that:
a) your web site is hosted on *nix (ok, you can do it if on Windows
too - use copy instead of cat, you might have to add a "cmd /c "
prefix in front of the copy command, and you have to use the right
copy command syntax for concatenating multiple input files into one
output file).

b) your hosting plan allows you to execute OS level commands like cat,
and cat is in your OS PATH (not PYTHONPATH). (Similar comments apply
for Windows hosts).

import os
os.system("cat file1.txt file2.txt file3.txt file4.txt >
file_out.txt")

cat will take care of buffering, etc. transparently to you.

Approach 2: Read (in a loop, as you originally thought of doing) each
line of each of the 4 input files and write it to the output file:

("Reusing" Paul McGuire's code above:)

outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
for lin in infile:
outfile.write(lin)
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

# You may need to check that newlines are not removed in the above
code, in the output file. Can't remember right now. If they are, just
add one back with:

outfile.write(lin + "\n") instead of outfile.write(lin) .

( Code not tested, test it locally first, though looks ok to me. )

The reason why this _may_ not be much slower than manually coded
buffering approaches, is that:

a) Python's standard library is written in C (which is fast),
including use of stdio (the C Standard IO library, which already does
intelligent buffering)
b) OS's do I/O buffering anyway, so do hard disk controllers
c) from some recent Python version, I think it was 2.2, that idiom
"for lin in infile" has been (based on somethng I read in the Python
Cookbook) stated to be pretty efficient anyway (and yet (slightly)
more readable that earlier followed approaches of reading a text
file).

Given all the above facts, it probably isn't worth your while to try
and optimize the code unless and until you find (by measurements) that
it's too slow - which is a good practice anyway:

http://en.wikipedia.org/wiki/Optimiz...mputer_science)

Excerpt from the above page (its long but worth reading, IMO):

"Donald Knuth said, paraphrasing Hoare[1],

"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." (Knuth, Donald.
Structured Programming with go to Statements, ACM Journal Computing
Surveys, Vol 6, No. 4, Dec. 1974. p.268.)

Charles Cook commented,

"I agree with this. It's usually not worth spending a lot of time
micro-optimizing code before it's obvious where the performance
bottlenecks are. But, conversely, when designing software at a system
level, performance issues should always be considered from the
beginning. A good software developer will do this automatically,
having developed a feel for where performance issues will cause
problems. An inexperienced developer will not bother, misguidedly
believing that a bit of fine tuning at a later stage will fix any
problems." [2]
"

HTH
Vasudev
-----------------------------------------
Vasudev Ram
http://www.dancingbison.com
http://jugad.livejournal.com
http://sourceforge.net/projects/xtopdf
-----------------------------------------
Aug 26 '07 #6
mcl
On 26 Aug, 15:45, vasudevram <vasudev...@gmail.comwrote:
On Aug 26, 6:48 am, Paul McGuire <pt...@austin.rr.comwrote:
On Aug 25, 8:15 pm, Paul McGuire <pt...@austin.rr.comwrote:
On Aug 25, 4:57 am,mosscliffe<mcl.off...@googlemail.comwrote:
I have 4 text files each approx 50mb.
<yawn50mb? Really? Did you actually try this and find out it was a
problem?
Try this:
import time
start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()
print end-start,"seconds"
For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)
-- Paul
My bad, my test file was not a text file, but a binary file.
Retesting with a 50Mb text file took 24.6 seconds on my machine.
Still in your working range? If not, then you will need to pursue
more exotic approaches. But 25 seconds on an infrequent basis does
not sound too bad, especially since I don't think you will really get
any substantial boost from them (to benchmark this, I timed a raw
"copy" command at the OS level of the resulting 200Mb file, and this
took about 20 seconds).
Keep it simple.
-- Paul

There are (at least) another couple of approaches possible, each with
some possible tradeoffs or requirements:

Approach 1. (Least amount of code to write - not that the others are
large :)

Just use os.system() and the UNIX cat command - the requirement here
is that:
a) your web site is hosted on *nix (ok, you can do it if on Windows
too - use copy instead of cat, you might have to add a "cmd /c "
prefix in front of the copy command, and you have to use the right
copy command syntax for concatenating multiple input files into one
output file).

b) your hosting plan allows you to execute OS level commands like cat,
and cat is in your OS PATH (not PYTHONPATH). (Similar comments apply
for Windows hosts).

import os
os.system("cat file1.txt file2.txt file3.txt file4.txt >
file_out.txt")

cat will take care of buffering, etc. transparently to you.

Approach 2: Read (in a loop, as you originally thought of doing) each
line of each of the 4 input files and write it to the output file:

("Reusing" Paul McGuire's code above:)

outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
for lin in infile:
outfile.write(lin)
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

# You may need to check that newlines are not removed in the above
code, in the output file. Can't remember right now. If they are, just
add one back with:

outfile.write(lin + "\n") instead of outfile.write(lin) .

( Code not tested, test it locally first, though looks ok to me. )

The reason why this _may_ not be much slower than manually coded
buffering approaches, is that:

a) Python's standard library is written in C (which is fast),
including use of stdio (the C Standard IO library, which already does
intelligent buffering)
b) OS's do I/O buffering anyway, so do hard disk controllers
c) from some recent Python version, I think it was 2.2, that idiom
"for lin in infile" has been (based on somethng I read in the Python
Cookbook) stated to be pretty efficient anyway (and yet (slightly)
more readable that earlier followed approaches of reading a text
file).

Given all the above facts, it probably isn't worth your while to try
and optimize the code unless and until you find (by measurements) that
it's too slow - which is a good practice anyway:

http://en.wikipedia.org/wiki/Optimiz...mputer_science)

Excerpt from the above page (its long but worth reading, IMO):

"Donald Knuth said, paraphrasing Hoare[1],

"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." (Knuth, Donald.
Structured Programming with go to Statements, ACM Journal Computing
Surveys, Vol 6, No. 4, Dec. 1974. p.268.)

Charles Cook commented,

"I agree with this. It's usually not worth spending a lot of time
micro-optimizing code before it's obvious where the performance
bottlenecks are. But, conversely, when designing software at a system
level, performance issues should always be considered from the
beginning. A good software developer will do this automatically,
having developed a feel for where performance issues will cause
problems. An inexperienced developer will not bother, misguidedly
believing that a bit of fine tuning at a later stage will fix any
problems." [2]
"

HTH
Vasudev
-----------------------------------------
Vasudev Ramhttp://www.dancingbison.comhttp://jugad.livejournal.comhttp://sourceforge.net/projects/xtopdf
-----------------------------------------
All,

Thank you very much.

As my background is much smaller memory machines than today's giants -
64k being a big machine and 640k being gigantic. I get very worried
about crashing machines when copying or editing big files, especially
in a multi-user environment.

Mr Knuth - that brings back memories. I rememeber implementing some
of his sort routines on a mainframe with 24 tape units and an 8k drum
and almost eliminating one shift per day of computer operator time.

Thanks again

Richard

Aug 26 '07 #7
On Aug 27, 12:43 am, mcl <mcl.off...@googlemail.comwrote:
All,

Thank you very much.

As my background is much smaller memory machines than today's giants -
64k being abigmachine and 640k being gigantic. I get very worried
about crashing machines when copying or editingbigfiles, especially
in a multi-user environment.

Mr Knuth - that brings back memories. I rememeber implementing some
of his sort routines on a mainframe with 24 tape units and an 8k drum
and almost eliminating one shift per day of computer operator time.

Thanks again

Richard
I can imagine ... though I don't go back that far.
Cool ...

Vasudev

Aug 31 '07 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Mark | last post by:
Hi all, I have 2 files containing Id numbers and surnames (these files essentially contain the same data) I want to select distinct() and join on id number to return a recordset containing every...
3
by: JHenstay | last post by:
I've been doing quite alot of reading on C++ and classes, however, everything I read just talks about the code itself and not the location of the code. My question is, what if you want to...
9
by: Eric Sabine | last post by:
Can someone give me a practical example of why I would join threads? I am assuming that you would typically join a background thread with the UI thread and not a background to a background, but...
5
by: Paul Czubilinski | last post by:
Hello, I would like to join few pdf files uploaded separetly into my website into one downloable pdf file. Is it possible in php or is it neccessary to download all these files one by one? ...
5
by: Hugh Janus | last post by:
Hi group, I have an app that streams files over the network. What I want to be able to do now is select a folder and stream the entire contents of that folder over the network. I could simply...
1
by: sarffi | last post by:
Hi. i m getting a problem regarding joining of xml files of size greater than 1GB in java.the error i mgetting is "out of heap memory space" in java....So,plz suggest me the possible solutions....
2
by: Supermansteel | last post by:
I am joining these 2 tables together in Access 2003 and can't figure out the exact way of writing this script......Can anyone help? I have the following SQL: SELECT...
4
by: rhino | last post by:
I'm very new to XML and maybe just a touch impatient because I'm going to ask a moderately advanced question even though I'm just learning the basics. I've spent many years working with...
3
sumittyagi
by: sumittyagi | last post by:
Hi All, I am stuck with one tricky situation here. The situation is as follows:- I have two files, both files have two columns - space seperated key value pairs. Now say files are f1 and...
0
by: jianzs | last post by:
Introduction Cloud-native applications are conventionally identified as those designed and nurtured on cloud infrastructure. Such applications, rooted in cloud technologies, skillfully benefit from...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.