Joining Big Files

mosscliffe

I have 4 text files each approx 50mb.

I need to join these into one large text file.

I only need to do this very occasionally, as the problem has occurred
because of upload limitations.

Bearing in mind filesize and memory useage, would I be better reading
every line in every file and writing each line to the output file or
is there someway I could execute some shell command.

I am working on a hosted website with no direct telnet or similar
access.

I would appreciate any tips, best code, as I do not want to break
anything on the website.

Richard

Aug 25 '07 #1

Subscribe Post Reply

1499

Marc 'BlackJack' Rintsch

On Sat, 25 Aug 2007 02:57:24 -0700, mosscliffe wrote:

I have 4 text files each approx 50mb.

I need to join these into one large text file.

I only need to do this very occasionally, as the problem has occurred
because of upload limitations.

Bearing in mind filesize and memory useage, would I be better reading
every line in every file and writing each line to the output file or
is there someway I could execute some shell command.

There are some copy functions that work with file like objects in the
`shutil` module.

Ciao,
Marc 'BlackJack' Rintsch

Aug 25 '07 #2

beginner

On Aug 25, 4:57 am, mosscliffe <mcl.off...@googlemail.comwrote:

I have 4 text files each approx 50mb.

I need to join these into one large text file.

I only need to do this very occasionally, as the problem has occurred
because of upload limitations.

Bearing in mind filesize and memory useage, would I be better reading
every line in every file and writing each line to the output file or
is there someway I could execute some shell command.

I am working on a hosted website with no direct telnet or similar
access.

I would appreciate any tips, best code, as I do not want to break
anything on the website.

Richard

I would probably open the files in binary mode and copy 128 KB - 512
KB at a time. I wouldn't copy a line at a time as it is possible that
one of the files contains a very long line.

Aug 26 '07 #3

Paul McGuire

On Aug 25, 4:57 am, mosscliffe <mcl.off...@googlemail.comwrote:

I have 4 text files each approx 50mb.

<yawn50mb? Really? Did you actually try this and find out it was a
problem?

Try this:
import time

start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)

-- Paul

Aug 26 '07 #4

Paul McGuire

On Aug 25, 8:15 pm, Paul McGuire <pt...@austin.rr.comwrote:

On Aug 25, 4:57 am, mosscliffe <mcl.off...@googlemail.comwrote:

I have 4 text files each approx 50mb.

<yawn50mb? Really? Did you actually try this and find out it was a
problem?

Try this:
import time

start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)

-- Paul

My bad, my test file was not a text file, but a binary file.
Retesting with a 50Mb text file took 24.6 seconds on my machine.

Still in your working range? If not, then you will need to pursue
more exotic approaches. But 25 seconds on an infrequent basis does
not sound too bad, especially since I don't think you will really get
any substantial boost from them (to benchmark this, I timed a raw
"copy" command at the OS level of the resulting 200Mb file, and this
took about 20 seconds).

Keep it simple.

-- Paul

Aug 26 '07 #5

vasudevram

On Aug 26, 6:48 am, Paul McGuire <pt...@austin.rr.comwrote:

On Aug 25, 8:15 pm, Paul McGuire <pt...@austin.rr.comwrote:

On Aug 25, 4:57 am, mosscliffe <mcl.off...@googlemail.comwrote:

I have 4 text files each approx 50mb.

<yawn50mb? Really? Did you actually try this and find out it was a
problem?

Try this:
import time

start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)

-- Paul

My bad, my test file was not a text file, but a binary file.
Retesting with a 50Mb text file took 24.6 seconds on my machine.

Still in your working range? If not, then you will need to pursue
more exotic approaches. But 25 seconds on an infrequent basis does
not sound too bad, especially since I don't think you will really get
any substantial boost from them (to benchmark this, I timed a raw
"copy" command at the OS level of the resulting 200Mb file, and this
took about 20 seconds).

Keep it simple.

-- Paul

There are (at least) another couple of approaches possible, each with
some possible tradeoffs or requirements:

Approach 1. (Least amount of code to write - not that the others are
large :)

Just use os.system() and the UNIX cat command - the requirement here
is that:
a) your web site is hosted on *nix (ok, you can do it if on Windows
too - use copy instead of cat, you might have to add a "cmd /c "
prefix in front of the copy command, and you have to use the right
copy command syntax for concatenating multiple input files into one
output file).

b) your hosting plan allows you to execute OS level commands like cat,
and cat is in your OS PATH (not PYTHONPATH). (Similar comments apply
for Windows hosts).

import os
os.system("cat file1.txt file2.txt file3.txt file4.txt >
file_out.txt")

cat will take care of buffering, etc. transparently to you.

Approach 2: Read (in a loop, as you originally thought of doing) each
line of each of the 4 input files and write it to the output file:

("Reusing" Paul McGuire's code above:)

outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
for lin in infile:
outfile.write(lin)
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

# You may need to check that newlines are not removed in the above
code, in the output file. Can't remember right now. If they are, just
add one back with:

outfile.write(lin + "\n") instead of outfile.write(lin) .

( Code not tested, test it locally first, though looks ok to me. )

The reason why this _may_ not be much slower than manually coded
buffering approaches, is that:

a) Python's standard library is written in C (which is fast),
including use of stdio (the C Standard IO library, which already does
intelligent buffering)
b) OS's do I/O buffering anyway, so do hard disk controllers
c) from some recent Python version, I think it was 2.2, that idiom
"for lin in infile" has been (based on somethng I read in the Python
Cookbook) stated to be pretty efficient anyway (and yet (slightly)
more readable that earlier followed approaches of reading a text
file).

Given all the above facts, it probably isn't worth your while to try
and optimize the code unless and until you find (by measurements) that
it's too slow - which is a good practice anyway:

http://en.wikipedia.org/wiki/Optimiz...mputer_science)

Excerpt from the above page (its long but worth reading, IMO):

"Donald Knuth said, paraphrasing Hoare[1],

"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." (Knuth, Donald.
Structured Programming with go to Statements, ACM Journal Computing
Surveys, Vol 6, No. 4, Dec. 1974. p.268.)

Charles Cook commented,

"I agree with this. It's usually not worth spending a lot of time
micro-optimizing code before it's obvious where the performance
bottlenecks are. But, conversely, when designing software at a system
level, performance issues should always be considered from the
beginning. A good software developer will do this automatically,
having developed a feel for where performance issues will cause
problems. An inexperienced developer will not bother, misguidedly
believing that a bit of fine tuning at a later stage will fix any
problems." [2]
"

HTH
Vasudev
-----------------------------------------
Vasudev Ram
http://www.dancingbison.com
http://jugad.livejournal.com
http://sourceforge.net/projects/xtopdf
-----------------------------------------

Aug 26 '07 #6

mcl

On 26 Aug, 15:45, vasudevram <vasudev...@gmail.comwrote:

On Aug 26, 6:48 am, Paul McGuire <pt...@austin.rr.comwrote:

On Aug 25, 8:15 pm, Paul McGuire <pt...@austin.rr.comwrote:

On Aug 25, 4:57 am,mosscliffe<mcl.off...@googlemail.comwrote:

I have 4 text files each approx 50mb.

<yawn50mb? Really? Did you actually try this and find out it was a
problem?

Try this:
import time

start = time.clock()
outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
outfile.write( infile.read() )
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

For 4 30Mb files, this takes just over 1.3 seconds on my system. (You
may need to open files in binary mode, depending on the contents, but
I was in a hurry.)

-- Paul

My bad, my test file was not a text file, but a binary file.
Retesting with a 50Mb text file took 24.6 seconds on my machine.

Still in your working range? If not, then you will need to pursue
more exotic approaches. But 25 seconds on an infrequent basis does
not sound too bad, especially since I don't think you will really get
any substantial boost from them (to benchmark this, I timed a raw
"copy" command at the OS level of the resulting 200Mb file, and this
took about 20 seconds).

Keep it simple.

-- Paul

There are (at least) another couple of approaches possible, each with
some possible tradeoffs or requirements:

Approach 1. (Least amount of code to write - not that the others are
large :)

Just use os.system() and the UNIX cat command - the requirement here
is that:
a) your web site is hosted on *nix (ok, you can do it if on Windows
too - use copy instead of cat, you might have to add a "cmd /c "
prefix in front of the copy command, and you have to use the right
copy command syntax for concatenating multiple input files into one
output file).

b) your hosting plan allows you to execute OS level commands like cat,
and cat is in your OS PATH (not PYTHONPATH). (Similar comments apply
for Windows hosts).

import os
os.system("cat file1.txt file2.txt file3.txt file4.txt >
file_out.txt")

cat will take care of buffering, etc. transparently to you.

Approach 2: Read (in a loop, as you originally thought of doing) each
line of each of the 4 input files and write it to the output file:

("Reusing" Paul McGuire's code above:)

outname = "temp.dat"
outfile = file(outname,"w")
for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']:
infile = file(inname)
for lin in infile:
outfile.write(lin)
infile.close()
outfile.close()
end = time.clock()

print end-start,"seconds"

# You may need to check that newlines are not removed in the above
code, in the output file. Can't remember right now. If they are, just
add one back with:

outfile.write(lin + "\n") instead of outfile.write(lin) .

( Code not tested, test it locally first, though looks ok to me. )

The reason why this _may_ not be much slower than manually coded
buffering approaches, is that:

a) Python's standard library is written in C (which is fast),
including use of stdio (the C Standard IO library, which already does
intelligent buffering)
b) OS's do I/O buffering anyway, so do hard disk controllers
c) from some recent Python version, I think it was 2.2, that idiom
"for lin in infile" has been (based on somethng I read in the Python
Cookbook) stated to be pretty efficient anyway (and yet (slightly)
more readable that earlier followed approaches of reading a text
file).

Given all the above facts, it probably isn't worth your while to try
and optimize the code unless and until you find (by measurements) that
it's too slow - which is a good practice anyway:

http://en.wikipedia.org/wiki/Optimiz...mputer_science)

Excerpt from the above page (its long but worth reading, IMO):

"Donald Knuth said, paraphrasing Hoare[1],

"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." (Knuth, Donald.
Structured Programming with go to Statements, ACM Journal Computing
Surveys, Vol 6, No. 4, Dec. 1974. p.268.)

Charles Cook commented,

"I agree with this. It's usually not worth spending a lot of time
micro-optimizing code before it's obvious where the performance
bottlenecks are. But, conversely, when designing software at a system
level, performance issues should always be considered from the
beginning. A good software developer will do this automatically,
having developed a feel for where performance issues will cause
problems. An inexperienced developer will not bother, misguidedly
believing that a bit of fine tuning at a later stage will fix any
problems." [2]
"

HTH
Vasudev
-----------------------------------------
Vasudev Ramhttp://www.dancingbison.comhttp://jugad.livejournal.comhttp://sourceforge.net/projects/xtopdf
-----------------------------------------

All,

Thank you very much.

As my background is much smaller memory machines than today's giants -
64k being a big machine and 640k being gigantic. I get very worried
about crashing machines when copying or editing big files, especially
in a multi-user environment.

Mr Knuth - that brings back memories. I rememeber implementing some
of his sort routines on a mainframe with 24 tape units and an 8k drum
and almost eliminating one shift per day of computer operator time.

Thanks again

Richard

Aug 26 '07 #7

vasudevram

On Aug 27, 12:43 am, mcl <mcl.off...@googlemail.comwrote:

All,

Thank you very much.

As my background is much smaller memory machines than today's giants -
64k being abigmachine and 640k being gigantic. I get very worried
about crashing machines when copying or editingbigfiles, especially
in a multi-user environment.

Mr Knuth - that brings back memories. I rememeber implementing some
of his sort routines on a mainframe with 24 tape units and an 8k drum
and almost eliminating one shift per day of computer operator time.

Thanks again

Richard

I can imagine ... though I don't go back that far.
Cool ...

Vasudev

Aug 31 '07 #8

Similar topics

Joining on partial matches

by: Mark | last post by:

Hi all, I have 2 files containing Id numbers and surnames (these files essentially contain the same data) I want to select distinct() and join on id number to return a recordset containing every...

Microsoft SQL Server

Beginner Help: Joining Multiple classes in multiple files?

by: JHenstay | last post by:

I've been doing quite alot of reading on C++ and classes, however, everything I read just talks about the code itself and not the location of the code. My question is, what if you want to...

C / C++

Joining threads; why?

by: Eric Sabine | last post by:

Can someone give me a practical example of why I would join threads? I am assuming that you would typically join a background thread with the UI thread and not a background to a background, but...

C# / C Sharp

joining pdf files in php

by: Paul Czubilinski | last post by:

Hello, I would like to join few pdf files uploaded separetly into my website into one downloable pdf file. Is it possible in php or is it neccessary to download all these files one by one? ...

PHP

Joining files

by: Hugh Janus | last post by:

Hi group, I have an app that streams files over the network. What I want to be able to do now is select a folder and stream the entire contents of that folder over the network. I could simply...

Visual Basic .NET

xml joining in java

by: sarffi | last post by:

Hi. i m getting a problem regarding joining of xml files of size greater than 1GB in java.the error i mgetting is "out of heap memory space" in java....So,plz suggest me the possible solutions....

Java

Joining Tables

by: Supermansteel | last post by:

I am joining these 2 tables together in Access 2003 and can't figure out the exact way of writing this script......Can anyone help? I have the following SQL: SELECT...

Microsoft Access / VBA

Joining XML files?

by: rhino | last post by:

I'm very new to XML and maybe just a touch impatient because I'm going to ask a moderately advanced question even though I'm just learning the basics. I've spent many years working with...

.NET Framework

Joining(merging) two files.

by: sumittyagi | last post by:

Hi All, I am stuck with one tricky situation here. The situation is as follows:- I have two files, both files have two columns - space seperated key value pairs. Now say files are f1 and...

Linux

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++