By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,847 Members | 1,326 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,847 IT Pros & Developers. It's quick & easy.

Is this possible in Perl? Mail archive data manipulation

P: 8
Hello all;

I have a .csv file that contains messages exported from one discussion forum that I want to import into another forum (phpBB), but I need to do some data manipulation on the original export file first.

The original .csv file contains a subject or topic field on each line. Subsequent lines contain replies and other subjects. The file is in chronological order.

Here is a sample layout:

ForumID,Subject,UNIX TimeStamp,Poster_Name,Message
12,Wings,830799174,John Doe,Lorem ipsum...
13,Wheels,831213087,Jane Q. Public,Dolor sit amet...
12,RE: Wings,860225019,Bob Builder,Praesent adipiscing sapien ut purus....

I need to insert a topic_id in each line of the .csv file. The topic_id for the sample above would be 1 for the first and third line and 2 for the second line.

Here's how I imagine this automated insertion of topic_id would work:

A script would look at each row and read the Subject, then compare it to a table with Subjects. If the Subject does not exist, then it would create a topic_id for that subject and assign it to the row and write the topic_id and subject to the table. If the subject already exists, then it would look up the topic_id from the table and assign it to the row, and so on.

I think the "RE:" in each follow-up post may be a complicating factor, but since all posts are in chronological order, the "RE:" could be stripped from the Subject column.

A larger sample of the actual file can be seen here:
http://www.smofco.com/dicky/Sample_fields_for_import_Rev2.csv

The complete file contains about 35,000 lines or posts.

Is Perl a good programing language for such a task?

I hate to shamelessly ask for assistance in writing this, but my knowledge of Perl is minimal. However, I am willing to learn! So, any pointers, hints, ideas, code samples, etc. will be greatly appreciated.

Cheers,
Omar Filipovic
Jan 5 '08 #1
Share this Question
Share on Google+
13 Replies


KevinADC
Expert 2.5K+
P: 4,059
Where does the topic id get inserted?
Jan 5 '08 #2

P: 8
Where does the topic id get inserted?
Anywhere; it can be at the end of each line, for example.

Or it can be in a separate output file and I can later simply add this file as a column in Excel to the existing csv file, as long as that output file is in the same sort order as the original csv and each topic_id is on a separate line. I know it's not an elegant solution, but it would work.

However, if somehow a row got skipped and at the end there are 30,000 posts and 29,999 topic_ids then I'm stuck...

I can also separate out only the Subject column from the csv file if that will simplify things.
Jan 6 '08 #3

KevinADC
Expert 2.5K+
P: 4,059
Are the "subjects" always unique?
Jan 6 '08 #4

P: 8
Are the "subjects" always unique?
No. If I understand the question correctly...

For example, there may be a subject "Wings" in January of 1999 with several follow-ups of "RE: Wings", and then an unrelated discussion in December 2003 also with the subject "Wings" and its follow-ups.

However, this is rare and I'm willing to live with these cases being lumped into the same discussion.

Was that the question?
Jan 6 '08 #5

KevinADC
Expert 2.5K+
P: 4,059
Yes, that was the question.

I know you said you did not want to just have the code written for you, and normally I never do that until a person shows some effort. But the problem was interesting, (and as you can see the code is not very simple), so I decided to write some code and use the sample data file you posted a link to. Now that file had some inconsistencies in it, like RE:, Re: and R:. And there are some lines that have no topic they are associated with, but that might be because it's just some sample lines from your archive file. You may also have to install the Text::CSV_XS module to make this code work. The topic_id is added as a new second column to the output file:

ForumID,TopicID,Subject,UNIX TimeStamp,Poster_Name,Message

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2. use warnings;
  3. use Text::CSV_XS;
  4.  
  5. my %subjects = ();
  6. my $topic_id = 1;
  7. my $csv = Text::CSV_XS->new({binary=>1});
  8. my $infile  = 'c:/perl_test/sample.csv';
  9. my $outfile = 'c:/perl_test/sample_out.csv';
  10. my $errors  = 'c:/perl_test/error.txt';
  11.  
  12. open IN, $infile or die "$!";
  13. open OUT, ">", $outfile or die "$!";
  14. open ERROR, ">", $errors or die "$!";  
  15.  
  16. # Next two lines assume you want to get
  17. # and print the headers from the archive csv file.
  18. my @header = split(',', scalar <IN>); # get the headers 
  19. print OUT "$header[0],TopicID,", join(',',@header[1..$#header]);
  20.  
  21. LOOP: while (<IN>) {
  22.    chomp;
  23.    my $this_subject = '';
  24.    unless ( $csv->parse($_) ) {
  25.       print ERROR $csv->error_input();
  26.       next LOOP;
  27.    }
  28.    my @fields = $csv->fields;
  29.    my $s = lc $fields[1]; # lower case the subject and store in $s
  30.    $s =~ s/^s+//;  # remove leading spaces  
  31.    $s =~ s/\s+$//; # remove trailing spaces
  32.    if ($s =~ /^re?:\s*([\s\S]+)/o) {# assumes we have found a reply to a subject
  33.       $this_subject = $1; # this is for later when we "combine" the fields
  34.    }
  35.    elsif ($s) { # assumes we have a new subject
  36.       $this_subject = $s; # this is for later when we "combine" the fields
  37.       $subjects{$s}=$topic_id; # assign the topic_id to a new subject
  38.       $topic_id++;
  39.    }
  40.    else {
  41.       print ERROR "No subject found: $_";
  42.       next LOOP;
  43.    }
  44.    $csv->combine($fields[0],$subjects{$this_subject},@fields[1..$#fields]);
  45.    print OUT $csv->string(),"\n";;
  46. }
  47. close IN;
  48.  
  49.  
run the above code using your sample file to get a feel for what it does. Here is a link to the Text::CSV_XS module.

cpan: Text::CSV_XS
Jan 6 '08 #6

P: 8
Kevin, thank you! This is awesome and there is no way I could have done this myself.

It works but because the infile isn't perfect, the outfile suffers.

It seems there are some posts missing at the beginning of the original file, so for some subjects, the first occurrence begins with an "RE:"

How would I set it up so that a first unique Subject receives a new topic_id regardless if it has a RE: at the beginning or not?

Another option might be to strip the RE:, Re: and R: out completely and just go by the first occurrence of a unique Subject?

I've uploaded the entire original csv file here: http://www.smofco.com/dicky/ but it's ~38MB
Jan 7 '08 #7

P: 8
Another issue I've noticed: There are topics that come up like this:
seats
..
seats
..
Re: seats
Re: seats
..
seats
The first two without an "Re:" received separate topic_id's, the two "Re: seats" topics did not receive a topic_id, and the third "seats" without an "Re:" received yet a third topic_id.

Like this:
15063 seats
..
15064 seats
..
Re: seats
Re: seats
..
15071 seats
This just adds to my thinking that the "Re:" prefixes are going to be a stumbling block. I guess different mail clients apply them differently and/or some users edited the subject line for their replies.
Jan 7 '08 #8

KevinADC
Expert 2.5K+
P: 4,059
Another option might be to strip the RE:, Re: and : out completely and just go by the first occurrence of a unique Subject?
That sounds like the best compromise to make. Try this:

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2. use warnings;
  3. use Text::CSV_XS;
  4.  
  5. my %subjects = ();
  6. my $topic_id = 0;
  7. my $csv = Text::CSV_XS->new({binary=>1});
  8. my $infile  = 'c:/perl_test/sample.csv';
  9. my $outfile = 'c:/perl_test/sample_out.csv';
  10. my $errors  = 'c:/perl_test/error.txt';
  11.  
  12. open IN, $infile or die "$!";
  13. open OUT, ">", $outfile or die "$!";
  14.  
  15. # Next two lines assume you want to get
  16. # and print the headers from the archive csv file.
  17. my @header = split(',', scalar <IN>); # get the headers 
  18. print OUT "$header[0],TopicID,", join(',',@header[1..$#header]);
  19.  
  20. LOOP: while (<IN>) {
  21.    chomp;
  22.    unless ( $csv->parse($_) ) {
  23.       print ERROR $csv->error_input();
  24.       next LOOP;
  25.    }
  26.    my @fields = $csv->fields;
  27.    my $s = lc $fields[1]; # lower case the subject and store in $s
  28.    $s =~ s/^s+//;  # remove leading spaces  
  29.    $s =~ s/\s+$//; # remove trailing spaces
  30.    $s =~ s/^re?:\s+//;
  31.    if (exists $subjects{$s}) {# check for existing subject
  32.       $csv->combine($fields[0],$subjects{$s},@fields[1..$#fields]);
  33.       print OUT $csv->string(),"\n";;
  34.    }
  35.    else { # assume we have a new subject
  36.       $subjects{$s} = ++$topic_id; # assign the topic_id to a new subject
  37.       $csv->combine($fields[0],$subjects{$s},@fields[1..$#fields]);
  38.       print OUT $csv->string(),"\n";;
  39.    }
  40. }
  41. close IN;
  42. close OUT;
Jan 7 '08 #9

P: 8
Works!

The script was missing
Expand|Select|Wrap|Line Numbers
  1. open ERROR, ">", $errors or die "$!";
but there were no errors reported even after fixing it. Pretty cool!

Kevin, I really appreciate your assistance. Well, not so much "assistance" but work since you did everything.

Cheers,
Omar
Jan 7 '08 #10

KevinADC
Expert 2.5K+
P: 4,059
Works!

The script was missing
Expand|Select|Wrap|Line Numbers
  1. open ERROR, ">", $errors or die "$!";
but there were no errors reported even after fixing it. Pretty cool!

Kevin, I really appreciate your assistance. Well, not so much "assistance" but work since you did everything.

Cheers,
Omar
Oh, yea, I missed that I still had this line in the modified script:

print ERROR $csv->error_input();

So openinng the ERROR file should still be done just in case there was a parsing error.

You're welcome, but please note, this is not how things are generally done around here. I found your problem to be interesting enough that I wanted to figure it out, we normally only help people that show some effort in writing their code first. Plus, it seemed to me that you genuinely needed help and weren't just trying to take advantage of some free help.

Regards,
Kevin
Jan 7 '08 #11

P: 8
So much for my education in Perl... give a man a fish...

Seriously though - before I posted here I didn't even have Perl installed on my machine. Much more to learn.

Cheers,
Omar
Jan 7 '08 #12

KevinADC
Expert 2.5K+
P: 4,059
So much for my education in Perl... give a man a fish...

Seriously though - before I posted here I didn't even have Perl installed on my machine. Much more to learn.

Cheers,
Omar
When you have a one-off project, learning a language is really not practical. I understand this is a learning forum but I also understand that many people don't need or want to learn, they just want to get a job done. You could have probably hired a programmer for about $50 USD to do this work, which is easy enough for an experienced perl coder. It took me about 1 hour to come up with the first code I posted. Anyways, if you ever do want to learn perl, there are a number of good perl forums with people willing to help you along that journey.

Regards,
Kevin
Jan 7 '08 #13

P: 8
When you have a one-off project, learning a language is really not practical.
Unless you really enjoy tinkering.
Jan 9 '08 #14

Post your reply

Sign in to post your reply or Sign up for a free account.