473,385 Members | 1,375 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Break up Large file into smaller ones

1
Hi,

I have a large txt file (1GB) which I need to break into smaller files based on the contents of a column in the file.

The values in the column of intrest starts low, then increases, then starts low again eg
1
2
3
0
3
4

what I need is to break this up into smaller files when the column does not increase
so the eg becomes
file1
1
2
3

file 2
0
3
4

I have code that reads the large file into an array of arrays (takes ages!) then checks consecutive rows of the field of interest to see if they decrease, if not print row to file else print row to new file.

I have some code, any suggestions to increase efficiency much appreciated!!


Expand|Select|Wrap|Line Numbers
  1. my(@ibdfile,@tmp) = ();
  2.  open( INFILE, "< $infile" ) or die "Can't open $infile : $!";
  3.  while( <INFILE> ) {
  4.              next unless $. > 0;
  5.              @tmp = split;
  6.              push  @ibdfile, [@tmp];
  7.              }
  8.  
  9.   my $chr=1;
  10.   my $n = @ibdfile;
  11.   for (my $j=0; $j < $n ; $j++){
  12.                if ( $ibdfile[$j][3] >= $ibdfile[$j+1][3]){
  13.                   my $outfile = "$chr chr.ibd";
  14.                   open OUTFILE, ">>$outfile" or die "can't open '$outfile': $!";
  15.                   print OUTFILE "@{$ibdfile[$j]}\n";
  16.                   close OUTFILE;
  17.                  }
  18.             else {
  19.             $chr++;
  20.           }
  21.    }
Dec 10 '07 #1
4 2900
numberwhun
3,509 Expert Mod 2GB
Hi,

I have a large txt file (1GB) which I need to break into smaller files based on the contents of a column in the file.

The values in the column of intrest starts low, then increases, then starts low again eg
1
2
3
0
3
4

what I need is to break this up into smaller files when the column does not increase
so the eg becomes
file1
1
2
3

file 2
0
3
4

I have code that reads the large file into an array of arrays (takes ages!) then checks consecutive rows of the field of interest to see if they decrease, if not print row to file else print row to new file.

I have some code, any suggestions to increase efficiency much appreciated!!


Expand|Select|Wrap|Line Numbers
  1. my(@ibdfile,@tmp) = ();
  2.  open( INFILE, "< $infile" ) or die "Can't open $infile : $!";
  3.  while( <INFILE> ) {
  4.              next unless $. > 0;
  5.              @tmp = split;
  6.              push  @ibdfile, [@tmp];
  7.              }
  8.  
  9.   my $chr=1;
  10.   my $n = @ibdfile;
  11.   for (my $j=0; $j < $n ; $j++){
  12.                if ( $ibdfile[$j][3] >= $ibdfile[$j+1][3]){
  13.                   my $outfile = "$chr chr.ibd";
  14.                   open OUTFILE, ">>$outfile" or die "can't open '$outfile': $!";
  15.                   print OUTFILE "@{$ibdfile[$j]}\n";
  16.                   close OUTFILE;
  17.                  }
  18.             else {
  19.             $chr++;
  20.           }
  21.    }
Well, if I had a sample of your data (and a little more time on my hands at the moment), I could probably give you a more definitive, code-driven answer, but here is what I would do (unicode wise).

I would work with two counters. One to append to the end of a general file name and one for comparison to the column. The one for the column I would initially set to 0 and when you grab the appropriate column in the first record, compare it against the variable (counter). If it is greater than it (assuming you start with zero and the first record starts with a number greater than zero) then you output it to the file. If it is smaller, then you create a new file, with the other counter's number appended to it and increment that other counter.
Additionally, after you have done the first comparison, you will want to set the counter for the column to whatever value was in the column so that can be compared.

Sorry, I tried to get this all down as clear as I could, although it may be a bit sketchy. Let me know if you understand it or not.

Regards,

Jeff
Dec 10 '07 #2
KevinADC
4,059 Expert 2GB
untested code:

Expand|Select|Wrap|Line Numbers
  1. my $chr = 0;
  2. open( INFILE, "<", $infile) or die "Can't open $infile : $!";
  3. open (my $OUTFILE, ">>", ++$chr . ' chr.ibd') or die "can't open outfile: $!";
  4. my $first_line =  <INFILE>;
  5. my $last = (split(/\s+/,$first_line))[3];
  6. while( <INFILE> ) {
  7.    my $next = (split(/\s+/))[3] );
  8.    if ( $last >= $next ) {
  9.       close $OUTFILE;
  10.       open (my $OUTFILE, ">>", ++$chr . ' chr.ibd') or die "can't open outfile': $!";
  11.    }
  12.    print $OUTFILE;
  13.    $last = $next; 
  14. }
Dec 11 '07 #3
KevinADC
4,059 Expert 2GB
I still had this code in my perl IDE and noticed an error in line 7:

my $next = (split(/\s+/))[3] );

the last parenthesis needs to be removed, should be:

my $next = (split(/\s+/))[3];
Dec 11 '07 #4
jagjot
6
What you can do is

1. Read this entire file into an array.
2. use a loop to write the elements of the array to an output_file_var (where var is the var of the loop) till the loop encounters a zero , n then start the loop again by incrementing the var.
3. this way you will have multiple small files.
4. You have to take care to execute the loop as many number of times as there are 0s in the array which can easily be found.
Dec 13 '07 #5

Sign in to post your reply or Sign up for a free account.

Similar topics

36
by: Andrea Griffini | last post by:
I did it. I proposed python as the main language for our next CAD/CAM software because I think that it has all the potential needed for it. I'm not sure yet if the decision will get through, but...
4
by: Dag Sunde | last post by:
Just wondering if anyone have looked into this? How to split up ones JavaScript library? A lot of very specific (and small) .js files, or a few larger files. I'm thinking about load-time...
2
by: Marc Thompson | last post by:
Hi, I have several large XML files (500-700 MB) that get updated once a day. I have an app that will need to Query (read only operations is all I'll ever need) these files. Obviously, querying...
12
by: Sharon | last post by:
I’m wrote a small DLL that used the FreeImage.DLL (that can be found at http://www.codeproject.com/bitmap/graphicsuite.asp). I also wrote a small console application in C++ (unmanaged) that uses...
7
by: mef526 | last post by:
I have had this problem for months now and it has been nagging me. I have a large project that has several C++ DLL's, one of them uses malloc / calloc / free for the other DLL's. I process very...
4
by: Chris | last post by:
Where I work, we basically have 1 large ASP.NET application that we work on. This is compiled into one big DLL. I think it would be a good idea to somehow break up the project, so that if I...
6
by: Lenny Wintfeld | last post by:
Hi I'm attempting additions/changes to a Java program that (among other things) uses XSLT to transform a large (96 Mb) XML file. It runs fine on small XML files but generates OutOfMemory...
25
by: tekctrl | last post by:
Anyone: I have a simple MSAccess DB which was created from an old ASCII flatfile. It works fine except for something that just started happening. I'll enter info in a record, save the record,...
13
by: sevenz24 | last post by:
So i have my images set up like this : http://cgi.ebay.com/Tippmann-98-paintBall-Marker-Gun-Paint-Ball_W0QQitemZ250274334261QQihZ015QQcategoryZ47248QQssPageNameZWDVWQQrdZ1QQcmdZViewItem Scroll...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.