473,569 Members | 2,756 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Joining Big Files

I have 4 text files each approx 50mb.

I need to join these into one large text file.

I only need to do this very occasionally, as the problem has occurred
because of upload limitations.

Bearing in mind filesize and memory useage, would I be better reading
every line in every file and writing each line to the output file or
is there someway I could execute some shell command.

I am working on a hosted website with no direct telnet or similar
access.

I would appreciate any tips, best code, as I do not want to break
anything on the website.

Richard

Aug 25 '07 #1
7 1505
On Sat, 25 Aug 2007 02:57:24 -0700, mosscliffe wrote:
I have 4 text files each approx 50mb.

I need to join these into one large text file.

I only need to do this very occasionally, as the problem has occurred
because of upload limitations.

Bearing in mind filesize and memory useage, would I be better reading
every line in every file and writing each line to the output file or
is there someway I could execute some shell command.
There are some copy functions that work with file like objects in the
`shutil` module.

Ciao,
Marc 'BlackJack' Rintsch
Aug 25 '07 #2
On Aug 25, 4:57 am, mosscliffe <mcl.off...@goo glemail.comwrot e:
I have 4 text files each approx 50mb.

I need to join these into one large text file.

I only need to do this very occasionally, as the problem has occurred
because of upload limitations.

Bearing in mind filesize and memory useage, would I be better reading
every line in every file and writing each line to the output file or
is there someway I could execute some shell command.

I am working on a hosted website with no direct telnet or similar
access.

I would appreciate any tips, best code, as I do not want to break
anything on the website.

Richard
I would probably open the files in binary mode and copy 128 KB - 512
KB at a time. I wouldn't copy a line at a time as it is possible that
one of the files contains a very long line.
Aug 26 '07 #3
On Aug 25, 4:57 am, mosscliffe <mcl.off...@goo glemail.comwrot e:
I have 4 text files each approx 50mb.
<yawn50mb? Really? Did you actually try this and find out it was a
problem?

Try this:
import time

start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w ")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)

-- Paul

Aug 26 '07 #4
On Aug 25, 8:15 pm, Paul McGuire <pt...@austin.r r.comwrote:
On Aug 25, 4:57 am, mosscliffe <mcl.off...@goo glemail.comwrot e:
I have 4 text files each approx 50mb.

<yawn50mb? Really? Did you actually try this and find out it was a
problem?

Try this:
import time

start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w ")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)

-- Paul
My bad, my test file was not a text file, but a binary file.
Retesting with a 50Mb text file took 24.6 seconds on my machine.

Still in your working range? If not, then you will need to pursue
more exotic approaches. But 25 seconds on an infrequent basis does
not sound too bad, especially since I don't think you will really get
any substantial boost from them (to benchmark this, I timed a raw
"copy" command at the OS level of the resulting 200Mb file, and this
took about 20 seconds).

Keep it simple.

-- Paul

Aug 26 '07 #5
On Aug 26, 6:48 am, Paul McGuire <pt...@austin.r r.comwrote:
On Aug 25, 8:15 pm, Paul McGuire <pt...@austin.r r.comwrote:
On Aug 25, 4:57 am, mosscliffe <mcl.off...@goo glemail.comwrot e:
I have 4 text files each approx 50mb.
<yawn50mb? Really? Did you actually try this and find out it was a
problem?
Try this:
import time
start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w ")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()
print end-start,"seconds"
For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)
-- Paul

My bad, my test file was not a text file, but a binary file.
Retesting with a 50Mb text file took 24.6 seconds on my machine.

Still in your working range? If not, then you will need to pursue
more exotic approaches. But 25 seconds on an infrequent basis does
not sound too bad, especially since I don't think you will really get
any substantial boost from them (to benchmark this, I timed a raw
"copy" command at the OS level of the resulting 200Mb file, and this
took about 20 seconds).

Keep it simple.

-- Paul
There are (at least) another couple of approaches possible, each with
some possible tradeoffs or requirements:

Approach 1. (Least amount of code to write - not that the others are
large :)

Just use os.system() and the UNIX cat command - the requirement here
is that:
a) your web site is hosted on *nix (ok, you can do it if on Windows
too - use copy instead of cat, you might have to add a "cmd /c "
prefix in front of the copy command, and you have to use the right
copy command syntax for concatenating multiple input files into one
output file).

b) your hosting plan allows you to execute OS level commands like cat,
and cat is in your OS PATH (not PYTHONPATH). (Similar comments apply
for Windows hosts).

import os
os.system("cat file1.txt file2.txt file3.txt file4.txt >
file_out.txt")

cat will take care of buffering, etc. transparently to you.

Approach 2: Read (in a loop, as you originally thought of doing) each
line of each of the 4 input files and write it to the output file:

("Reusing" Paul McGuire's code above:)

outname = "temp.dat"
outfile = file(outname,"w ")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
for lin in infile:
outfile.write(l in)
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

# You may need to check that newlines are not removed in the above
code, in the output file. Can't remember right now. If they are, just
add one back with:

outfile.write(l in + "\n") instead of outfile.write(l in) .

( Code not tested, test it locally first, though looks ok to me. )

The reason why this _may_ not be much slower than manually coded
buffering approaches, is that:

a) Python's standard library is written in C (which is fast),
including use of stdio (the C Standard IO library, which already does
intelligent buffering)
b) OS's do I/O buffering anyway, so do hard disk controllers
c) from some recent Python version, I think it was 2.2, that idiom
"for lin in infile" has been (based on somethng I read in the Python
Cookbook) stated to be pretty efficient anyway (and yet (slightly)
more readable that earlier followed approaches of reading a text
file).

Given all the above facts, it probably isn't worth your while to try
and optimize the code unless and until you find (by measurements) that
it's too slow - which is a good practice anyway:

http://en.wikipedia.org/wiki/Optimiz...mputer_science)

Excerpt from the above page (its long but worth reading, IMO):

"Donald Knuth said, paraphrasing Hoare[1],

"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." (Knuth, Donald.
Structured Programming with go to Statements, ACM Journal Computing
Surveys, Vol 6, No. 4, Dec. 1974. p.268.)

Charles Cook commented,

"I agree with this. It's usually not worth spending a lot of time
micro-optimizing code before it's obvious where the performance
bottlenecks are. But, conversely, when designing software at a system
level, performance issues should always be considered from the
beginning. A good software developer will do this automatically,
having developed a feel for where performance issues will cause
problems. An inexperienced developer will not bother, misguidedly
believing that a bit of fine tuning at a later stage will fix any
problems." [2]
"

HTH
Vasudev
-----------------------------------------
Vasudev Ram
http://www.dancingbison.com
http://jugad.livejournal.com
http://sourceforge.net/projects/xtopdf
-----------------------------------------
Aug 26 '07 #6
mcl
On 26 Aug, 15:45, vasudevram <vasudev...@gma il.comwrote:
On Aug 26, 6:48 am, Paul McGuire <pt...@austin.r r.comwrote:
On Aug 25, 8:15 pm, Paul McGuire <pt...@austin.r r.comwrote:
On Aug 25, 4:57 am,mosscliffe<m cl.off...@googl email.comwrote:
I have 4 text files each approx 50mb.
<yawn50mb? Really? Did you actually try this and find out it was a
problem?
Try this:
import time
start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w ")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()
print end-start,"seconds"
For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)
-- Paul
My bad, my test file was not a text file, but a binary file.
Retesting with a 50Mb text file took 24.6 seconds on my machine.
Still in your working range? If not, then you will need to pursue
more exotic approaches. But 25 seconds on an infrequent basis does
not sound too bad, especially since I don't think you will really get
any substantial boost from them (to benchmark this, I timed a raw
"copy" command at the OS level of the resulting 200Mb file, and this
took about 20 seconds).
Keep it simple.
-- Paul

There are (at least) another couple of approaches possible, each with
some possible tradeoffs or requirements:

Approach 1. (Least amount of code to write - not that the others are
large :)

Just use os.system() and the UNIX cat command - the requirement here
is that:
a) your web site is hosted on *nix (ok, you can do it if on Windows
too - use copy instead of cat, you might have to add a "cmd /c "
prefix in front of the copy command, and you have to use the right
copy command syntax for concatenating multiple input files into one
output file).

b) your hosting plan allows you to execute OS level commands like cat,
and cat is in your OS PATH (not PYTHONPATH). (Similar comments apply
for Windows hosts).

import os
os.system("cat file1.txt file2.txt file3.txt file4.txt >
file_out.txt")

cat will take care of buffering, etc. transparently to you.

Approach 2: Read (in a loop, as you originally thought of doing) each
line of each of the 4 input files and write it to the output file:

("Reusing" Paul McGuire's code above:)

outname = "temp.dat"
outfile = file(outname,"w ")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
for lin in infile:
outfile.write(l in)
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

# You may need to check that newlines are not removed in the above
code, in the output file. Can't remember right now. If they are, just
add one back with:

outfile.write(l in + "\n") instead of outfile.write(l in) .

( Code not tested, test it locally first, though looks ok to me. )

The reason why this _may_ not be much slower than manually coded
buffering approaches, is that:

a) Python's standard library is written in C (which is fast),
including use of stdio (the C Standard IO library, which already does
intelligent buffering)
b) OS's do I/O buffering anyway, so do hard disk controllers
c) from some recent Python version, I think it was 2.2, that idiom
"for lin in infile" has been (based on somethng I read in the Python
Cookbook) stated to be pretty efficient anyway (and yet (slightly)
more readable that earlier followed approaches of reading a text
file).

Given all the above facts, it probably isn't worth your while to try
and optimize the code unless and until you find (by measurements) that
it's too slow - which is a good practice anyway:

http://en.wikipedia.org/wiki/Optimiz...mputer_science)

Excerpt from the above page (its long but worth reading, IMO):

"Donald Knuth said, paraphrasing Hoare[1],

"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." (Knuth, Donald.
Structured Programming with go to Statements, ACM Journal Computing
Surveys, Vol 6, No. 4, Dec. 1974. p.268.)

Charles Cook commented,

"I agree with this. It's usually not worth spending a lot of time
micro-optimizing code before it's obvious where the performance
bottlenecks are. But, conversely, when designing software at a system
level, performance issues should always be considered from the
beginning. A good software developer will do this automatically,
having developed a feel for where performance issues will cause
problems. An inexperienced developer will not bother, misguidedly
believing that a bit of fine tuning at a later stage will fix any
problems." [2]
"

HTH
Vasudev
-----------------------------------------
Vasudev Ramhttp://www.dancingbiso n.comhttp://jugad.livejourn al.comhttp://sourceforge.net/projects/xtopdf
-----------------------------------------
All,

Thank you very much.

As my background is much smaller memory machines than today's giants -
64k being a big machine and 640k being gigantic. I get very worried
about crashing machines when copying or editing big files, especially
in a multi-user environment.

Mr Knuth - that brings back memories. I rememeber implementing some
of his sort routines on a mainframe with 24 tape units and an 8k drum
and almost eliminating one shift per day of computer operator time.

Thanks again

Richard

Aug 26 '07 #7
On Aug 27, 12:43 am, mcl <mcl.off...@goo glemail.comwrot e:
All,

Thank you very much.

As my background is much smaller memory machines than today's giants -
64k being abigmachine and 640k being gigantic. I get very worried
about crashing machines when copying or editingbigfiles , especially
in a multi-user environment.

Mr Knuth - that brings back memories. I rememeber implementing some
of his sort routines on a mainframe with 24 tape units and an 8k drum
and almost eliminating one shift per day of computer operator time.

Thanks again

Richard
I can imagine ... though I don't go back that far.
Cool ...

Vasudev

Aug 31 '07 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
553
by: Mark | last post by:
Hi all, I have 2 files containing Id numbers and surnames (these files essentially contain the same data) I want to select distinct() and join on id number to return a recordset containing every individual listed in both the files HOWEVER, in some cases an incomplete ID number has been collected into one of the 2 files -is there a way to...
3
1669
by: JHenstay | last post by:
I've been doing quite alot of reading on C++ and classes, however, everything I read just talks about the code itself and not the location of the code. My question is, what if you want to seperate classes into their own CPP files, what changes are needed to the files and how do you actually compile and build an executable. For instance,...
9
1866
by: Eric Sabine | last post by:
Can someone give me a practical example of why I would join threads? I am assuming that you would typically join a background thread with the UI thread and not a background to a background, but since I'm asking in the first place, assume that assumption to be very assuming. thanks, Eric
5
5872
by: Paul Czubilinski | last post by:
Hello, I would like to join few pdf files uploaded separetly into my website into one downloable pdf file. Is it possible in php or is it neccessary to download all these files one by one? Thx for help, Paul
5
1344
by: Hugh Janus | last post by:
Hi group, I have an app that streams files over the network. What I want to be able to do now is select a folder and stream the entire contents of that folder over the network. I could simply just loop through each file and transmit them one at a time but I would rather somehow stream all the files into a single file and then transmit...
1
1274
by: sarffi | last post by:
Hi. i m getting a problem regarding joining of xml files of size greater than 1GB in java.the error i mgetting is "out of heap memory space" in java....So,plz suggest me the possible solutions....
2
2249
by: Supermansteel | last post by:
I am joining these 2 tables together in Access 2003 and can't figure out the exact way of writing this script......Can anyone help? I have the following SQL: SELECT PL_Input.Date_ID,Count(PL_Input.+PL_Input.+PL_Input.+PL_Input.+PL_Input.) AS FROM Country INNER JOIN PL_Input ON Country.Country_ID = PL_Input.Country_ID WHERE (((. & ""...
4
1977
by: rhino | last post by:
I'm very new to XML and maybe just a touch impatient because I'm going to ask a moderately advanced question even though I'm just learning the basics. I've spent many years working with databases, both hierarchical and relational. So far, XML is obviously hierarchical in nature. I'm wondering if there is anything analogous to a relational...
3
4782
sumittyagi
by: sumittyagi | last post by:
Hi All, I am stuck with one tricky situation here. The situation is as follows:- I have two files, both files have two columns - space seperated key value pairs. Now say files are f1 and f2. and columns in f1 are f1c1, f1c2; and columns in f2 are f2c1 and f2c2.
0
7612
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8120
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7672
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
7968
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6283
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5512
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5219
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
1
1212
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
937
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.