Hello all;
I have a .csv file that contains messages exported from one discussion forum that I want to import into another forum (phpBB), but I need to do some data manipulation on the original export file first.
The original .csv file contains a subject or topic field on each line. Subsequent lines contain replies and other subjects. The file is in chronological order.
Here is a sample layout:
ForumID,Subject,UNIX TimeStamp,Poster_Name,Message
12,Wings,830799174,John Doe,Lorem ipsum...
13,Wheels,831213087,Jane Q. Public,Dolor sit amet...
12,RE: Wings,860225019,Bob Builder,Praesent adipiscing sapien ut purus....
I need to insert a topic_id in each line of the .csv file. The topic_id for the sample above would be 1 for the first and third line and 2 for the second line.
Here's how I imagine this automated insertion of topic_id would work:
A script would look at each row and read the Subject, then compare it to a table with Subjects. If the Subject does not exist, then it would create a topic_id for that subject and assign it to the row and write the topic_id and subject to the table. If the subject already exists, then it would look up the topic_id from the table and assign it to the row, and so on.
I think the "RE:" in each follow-up post may be a complicating factor, but since all posts are in chronological order, the "RE:" could be stripped from the Subject column.
A larger sample of the actual file can be seen here:
http://www.smofco.com/dicky/Sample_fields_for_import_Rev2.csv
The complete file contains about 35,000 lines or posts.
Is Perl a good programing language for such a task?
I hate to shamelessly ask for assistance in writing this, but my knowledge of Perl is minimal. However, I am willing to learn! So, any pointers, hints, ideas, code samples, etc. will be greatly appreciated.
Cheers,
Omar Filipovic
13 1606
Where does the topic id get inserted?
Where does the topic id get inserted?
Anywhere; it can be at the end of each line, for example.
Or it can be in a separate output file and I can later simply add this file as a column in Excel to the existing csv file, as long as that output file is in the same sort order as the original csv and each topic_id is on a separate line. I know it's not an elegant solution, but it would work.
However, if somehow a row got skipped and at the end there are 30,000 posts and 29,999 topic_ids then I'm stuck...
I can also separate out only the Subject column from the csv file if that will simplify things.
Are the "subjects" always unique?
Are the "subjects" always unique?
No. If I understand the question correctly...
For example, there may be a subject "Wings" in January of 1999 with several follow-ups of "RE: Wings", and then an unrelated discussion in December 2003 also with the subject "Wings" and its follow-ups.
However, this is rare and I'm willing to live with these cases being lumped into the same discussion.
Was that the question?
Yes, that was the question.
I know you said you did not want to just have the code written for you, and normally I never do that until a person shows some effort. But the problem was interesting, (and as you can see the code is not very simple), so I decided to write some code and use the sample data file you posted a link to. Now that file had some inconsistencies in it, like RE:, Re: and R:. And there are some lines that have no topic they are associated with, but that might be because it's just some sample lines from your archive file. You may also have to install the Text::CSV_XS module to make this code work. The topic_id is added as a new second column to the output file:
ForumID, TopicID,Subject,UNIX TimeStamp,Poster_Name,Message - use strict;
-
use warnings;
-
use Text::CSV_XS;
-
-
my %subjects = ();
-
my $topic_id = 1;
-
my $csv = Text::CSV_XS->new({binary=>1});
-
my $infile = 'c:/perl_test/sample.csv';
-
my $outfile = 'c:/perl_test/sample_out.csv';
-
my $errors = 'c:/perl_test/error.txt';
-
-
open IN, $infile or die "$!";
-
open OUT, ">", $outfile or die "$!";
-
open ERROR, ">", $errors or die "$!";
-
-
# Next two lines assume you want to get
-
# and print the headers from the archive csv file.
-
my @header = split(',', scalar <IN>); # get the headers
-
print OUT "$header[0],TopicID,", join(',',@header[1..$#header]);
-
-
LOOP: while (<IN>) {
-
chomp;
-
my $this_subject = '';
-
unless ( $csv->parse($_) ) {
-
print ERROR $csv->error_input();
-
next LOOP;
-
}
-
my @fields = $csv->fields;
-
my $s = lc $fields[1]; # lower case the subject and store in $s
-
$s =~ s/^s+//; # remove leading spaces
-
$s =~ s/\s+$//; # remove trailing spaces
-
if ($s =~ /^re?:\s*([\s\S]+)/o) {# assumes we have found a reply to a subject
-
$this_subject = $1; # this is for later when we "combine" the fields
-
}
-
elsif ($s) { # assumes we have a new subject
-
$this_subject = $s; # this is for later when we "combine" the fields
-
$subjects{$s}=$topic_id; # assign the topic_id to a new subject
-
$topic_id++;
-
}
-
else {
-
print ERROR "No subject found: $_";
-
next LOOP;
-
}
-
$csv->combine($fields[0],$subjects{$this_subject},@fields[1..$#fields]);
-
print OUT $csv->string(),"\n";;
-
}
-
close IN;
-
-
run the above code using your sample file to get a feel for what it does. Here is a link to the Text::CSV_XS module. cpan: Text::CSV_XS
Kevin, thank you! This is awesome and there is no way I could have done this myself.
It works but because the infile isn't perfect, the outfile suffers.
It seems there are some posts missing at the beginning of the original file, so for some subjects, the first occurrence begins with an "RE:"
How would I set it up so that a first unique Subject receives a new topic_id regardless if it has a RE: at the beginning or not?
Another option might be to strip the RE:, Re: and R: out completely and just go by the first occurrence of a unique Subject?
I've uploaded the entire original csv file here: http://www.smofco.com/dicky/ but it's ~38MB
Another issue I've noticed: There are topics that come up like this: seats
..
seats
..
Re: seats
Re: seats
..
seats The first two without an "Re:" received separate topic_id's, the two "Re: seats" topics did not receive a topic_id, and the third "seats" without an "Re:" received yet a third topic_id.
Like this: 15063 seats
..
15064 seats
..
Re: seats
Re: seats
..
15071 seats This just adds to my thinking that the "Re:" prefixes are going to be a stumbling block. I guess different mail clients apply them differently and/or some users edited the subject line for their replies.
Another option might be to strip the RE:, Re: and : out completely and just go by the first occurrence of a unique Subject?
That sounds like the best compromise to make. Try this: - use strict;
-
use warnings;
-
use Text::CSV_XS;
-
-
my %subjects = ();
-
my $topic_id = 0;
-
my $csv = Text::CSV_XS->new({binary=>1});
-
my $infile = 'c:/perl_test/sample.csv';
-
my $outfile = 'c:/perl_test/sample_out.csv';
-
my $errors = 'c:/perl_test/error.txt';
-
-
open IN, $infile or die "$!";
-
open OUT, ">", $outfile or die "$!";
-
-
# Next two lines assume you want to get
-
# and print the headers from the archive csv file.
-
my @header = split(',', scalar <IN>); # get the headers
-
print OUT "$header[0],TopicID,", join(',',@header[1..$#header]);
-
-
LOOP: while (<IN>) {
-
chomp;
-
unless ( $csv->parse($_) ) {
-
print ERROR $csv->error_input();
-
next LOOP;
-
}
-
my @fields = $csv->fields;
-
my $s = lc $fields[1]; # lower case the subject and store in $s
-
$s =~ s/^s+//; # remove leading spaces
-
$s =~ s/\s+$//; # remove trailing spaces
-
$s =~ s/^re?:\s+//;
-
if (exists $subjects{$s}) {# check for existing subject
-
$csv->combine($fields[0],$subjects{$s},@fields[1..$#fields]);
-
print OUT $csv->string(),"\n";;
-
}
-
else { # assume we have a new subject
-
$subjects{$s} = ++$topic_id; # assign the topic_id to a new subject
-
$csv->combine($fields[0],$subjects{$s},@fields[1..$#fields]);
-
print OUT $csv->string(),"\n";;
-
}
-
}
-
close IN;
-
close OUT;
Works!
The script was missing - open ERROR, ">", $errors or die "$!";
but there were no errors reported even after fixing it. Pretty cool!
Kevin, I really appreciate your assistance. Well, not so much "assistance" but work since you did everything.
Cheers,
Omar
Works!
The script was missing - open ERROR, ">", $errors or die "$!";
but there were no errors reported even after fixing it. Pretty cool!
Kevin, I really appreciate your assistance. Well, not so much "assistance" but work since you did everything.
Cheers,
Omar
Oh, yea, I missed that I still had this line in the modified script: print ERROR $csv->error_input();
So openinng the ERROR file should still be done just in case there was a parsing error.
You're welcome, but please note, this is not how things are generally done around here. I found your problem to be interesting enough that I wanted to figure it out, we normally only help people that show some effort in writing their code first. Plus, it seemed to me that you genuinely needed help and weren't just trying to take advantage of some free help.
Regards,
Kevin
So much for my education in Perl... give a man a fish...
Seriously though - before I posted here I didn't even have Perl installed on my machine. Much more to learn.
Cheers,
Omar
So much for my education in Perl... give a man a fish...
Seriously though - before I posted here I didn't even have Perl installed on my machine. Much more to learn.
Cheers,
Omar
When you have a one-off project, learning a language is really not practical. I understand this is a learning forum but I also understand that many people don't need or want to learn, they just want to get a job done. You could have probably hired a programmer for about $50 USD to do this work, which is easy enough for an experienced perl coder. It took me about 1 hour to come up with the first code I posted. Anyways, if you ever do want to learn perl, there are a number of good perl forums with people willing to help you along that journey.
Regards,
Kevin
When you have a one-off project, learning a language is really not practical.
Unless you really enjoy tinkering.
Sign in to post your reply or Sign up for a free account.
Similar topics
by: Keith |
last post by:
All:
What is the difference between Perl (CGI) and PHP (Apache module)? I thought both used servers to direct the
user to the appropiate Perl or PHP program in order to execute the program to...
|
by: Kevin Horton |
last post by:
I'm a perl newbie, with next to no programming experience (I did a bunch
of Fortran 25 years ago, but nothing since).
I have a problem I need to solve, and I'm wondering whether perl is the
best...
|
by: Martin Foster |
last post by:
Hi.
I would like to be able to mimic the unix tool 'uniq' within a Perl script.
I have a file with entries that look like this
4 10 21 37 58 83 111 145 184 226...
|
by: scadav |
last post by:
I am new to Perl and I am trying for figure out how to solve this
problem. If anyone can give me some suggestions, I would greatly
appreciate it.
I am trying to read a log file and generate...
|
by: pc |
last post by:
hi all,
I have been blessed with the task of writing a web based database
representing the state of our globally installed isam databases.
there are basically four steps in setting this up:
...
|
by: Timofmars |
last post by:
I'm try to Unload DB2 data from a table into a record sequential file
on NT. I can an unload on Mainframe, but it doesn't seem to be an
option in NT.
In NT, all I can do is export/import. I can...
|
by: codemaster |
last post by:
Hi,
I am a total novice to perl. I am moving files from one directory to another and trying to rename a file appending the timestamp. But for some reason, my code is not able to rename.
...
|
by: 8anos |
last post by:
Hello,
I am new at the community and newbie at programming :)
As you may know rapidshare provides a perl script for linux, to upload files at their servers.
You can find the original scripts at...
|
by: KevinADC |
last post by:
Note: You may skip to the end of the article if all you want is the perl code.
Introduction
Uploading files from a local computer to a remote web server has many useful purposes, the most...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
| |