473,385 Members | 1,780 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Large amount of files to parse/organize, tips on algorithm?

cnb
I have a bunch of files consisting of moviereviews.

For each file I construct a list of reviews and then for each new file
I merge the reviews so that in the end have a list of reviewers and
for each reviewer all their reviews.

What is the fastest way to do this?

1. Create one file with reviews, open next file an for each review see
if the reviewer exists, then add the review else create new reviewer.

2. create all the separate files with reviews then mergesort them?

Sep 2 '08 #1
6 1168
On Tue, 02 Sep 2008 09:48:32 -0700, cnb wrote:
I have a bunch of files consisting of moviereviews.

For each file I construct a list of reviews and then for each new file I
merge the reviews so that in the end have a list of reviewers and for
each reviewer all their reviews.

What is the fastest way to do this?
Use the timeit module to find out.

1. Create one file with reviews, open next file an for each review see
if the reviewer exists, then add the review else create new reviewer.

2. create all the separate files with reviews then mergesort them?
The answer will depend on whether you have three reviews or three
million, whether each review is twenty words or twenty thousand words,
and whether you have to do the merging once only or over and over again.
--
Steven
Sep 2 '08 #2
cnb
On Sep 2, 7:06*pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.auwrote:
On Tue, 02 Sep 2008 09:48:32 -0700, cnb wrote:
I have a bunch of files consisting of moviereviews.
For each file I construct a list of reviews and then for each new file I
merge the reviews so that in the end have a list of reviewers and for
each reviewer all their reviews.
What is the fastest way to do this?

Use the timeit module to find out.
1. Create one file with reviews, open next file an for each review see
if the reviewer exists, then add the review else create new reviewer.
2. create all the separate files with reviews then mergesort them?

The answer will depend on whether you have three reviews or three
million, whether each review is twenty words or twenty thousand words,
and whether you have to do the merging once only or over and over again.

--
Steven


I merge once. each review has 3 fields, date rating customerid. in
total ill be parsing between 10K and 100K, eventually 450K reviews.
Sep 2 '08 #3
cnb
over 17000 files...

netflixprize.
Sep 2 '08 #4
I think you really want use a relational database of some sort for this.

On Tue, Sep 2, 2008 at 2:02 PM, cnb <ci**********@yahoo.sewrote:
over 17000 files...

netflixprize.
--
http://mail.python.org/mailman/listinfo/python-list
Sep 2 '08 #5
cnb <ci**********@yahoo.sewrites:
For each file I construct a list of reviews and then for each new file
I merge the reviews so that in the end have a list of reviewers and
for each reviewer all their reviews.

What is the fastest way to do this?
Scan through all the files sequentially, emitting records like

(movie, reviewer, review)

Then use an external sort utility to sort/merge that output file
on each of the 3 columns. Beats writing code.
Sep 2 '08 #6
On Sep 2, 1:02*pm, cnb <circularf...@yahoo.sewrote:
over 17000 files...

netflixprize.
http://wiki.python.org/moin/NetflixPrizeBOF

specifically:

http://pyflix.python-hosting.com/
Sep 2 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Edvard Majakari | last post by:
Hi all ya unit-testing experts there :) Code I'm working on has to parse large and complex files and detect equally complex and large amount of errors before the contents of the file is fed to...
1
by: Donald Firesmith | last post by:
We are converting the OPEN Process Framework Repository (www.donald-firesmith.com) of over 1,100 free open source reusable process components for building development methods for...
1
by: Robert May | last post by:
Hi, I am trying to execute some code compiled by g++ on Linux and have found that after some time, the program allocates a huge amount of swap space (250MB on my machine which has 512MB...
6
by: soren juhu | last post by:
Hi, I am developing a C Program for reading over a million files of size 1 kilobytes each and sending the contents to another program using some middle ware. I need some help on designing the...
1
by: Bart | last post by:
Dear all, I would like to encrypt a large amount of data by using public/private keys, but I read on MSDN: "Symmetric encryption is performed on streams and is therefore useful to encrypt large...
2
by: ABC | last post by:
I have a very large amount VB code convert into C Sharp. I just complete it use tools. But it still have some codes errors. There are object to datatype problems. I need to edit many files to...
2
by: jdev8080 | last post by:
We are looking at creating large XML files containing binary data (encoded as base64) and passing them to transformers that will parse and transform the data into different formats. Basically,...
1
by: Arkady Renko | last post by:
Gday Guys I'm attempting to create zip files on the fly for some highly compressible, yet very large files stored on my Web server. At present I'm using a class from the Zend library by Eric...
17
by: byte8bits | last post by:
How does C++ safely open and read very large files? For example, say I have 1GB of physical memory and I open a 4GB file and attempt to read it like so: #include <iostream> #include <fstream>...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.