Is this possible in Perl? Mail archive data manipulation

Hello all;

I have a .csv file that contains messages exported from one discussion forum that I want to import into another forum (phpBB), but I need to do some data manipulation on the original export file first.

The original .csv file contains a subject or topic field on each line. Subsequent lines contain replies and other subjects. The file is in chronological order.

Here is a sample layout:

ForumID,Subject,UNIX TimeStamp,Poster_Name,Message
12,Wings,830799174,John Doe,Lorem ipsum...
13,Wheels,831213087,Jane Q. Public,Dolor sit amet...
12,RE: Wings,860225019,Bob Builder,Praesent adipiscing sapien ut purus....

I need to insert a topic_id in each line of the .csv file. The topic_id for the sample above would be 1 for the first and third line and 2 for the second line.

Here's how I imagine this automated insertion of topic_id would work:

A script would look at each row and read the Subject, then compare it to a table with Subjects. If the Subject does not exist, then it would create a topic_id for that subject and assign it to the row and write the topic_id and subject to the table. If the subject already exists, then it would look up the topic_id from the table and assign it to the row, and so on.

I think the "RE:" in each follow-up post may be a complicating factor, but since all posts are in chronological order, the "RE:" could be stripped from the Subject column.

A larger sample of the actual file can be seen here:
http://www.smofco.com/dicky/Sample_fields_for_import_Rev2.csv

The complete file contains about 35,000 lines or posts.

Is Perl a good programing language for such a task?

I hate to shamelessly ask for assistance in writing this, but my knowledge of Perl is minimal. However, I am willing to learn! So, any pointers, hints, ideas, code samples, etc. will be greatly appreciated.

Cheers,
Omar Filipovic

Jan 5 '08 #1

Subscribe Post Reply

1606

KevinADC

4,059

Expert 2GB

Where does the topic id get inserted?

Jan 5 '08 #2

filipo

Where does the topic id get inserted?

Anywhere; it can be at the end of each line, for example.

Or it can be in a separate output file and I can later simply add this file as a column in Excel to the existing csv file, as long as that output file is in the same sort order as the original csv and each topic_id is on a separate line. I know it's not an elegant solution, but it would work.

However, if somehow a row got skipped and at the end there are 30,000 posts and 29,999 topic_ids then I'm stuck...

I can also separate out only the Subject column from the csv file if that will simplify things.

Jan 6 '08 #3

KevinADC

4,059

Expert 2GB

Are the "subjects" always unique?

Jan 6 '08 #4

filipo

Are the "subjects" always unique?

No. If I understand the question correctly...

For example, there may be a subject "Wings" in January of 1999 with several follow-ups of "RE: Wings", and then an unrelated discussion in December 2003 also with the subject "Wings" and its follow-ups.

However, this is rare and I'm willing to live with these cases being lumped into the same discussion.

Was that the question?

Jan 6 '08 #5

KevinADC

4,059

Expert 2GB

Yes, that was the question.

I know you said you did not want to just have the code written for you, and normally I never do that until a person shows some effort. But the problem was interesting, (and as you can see the code is not very simple), so I decided to write some code and use the sample data file you posted a link to. Now that file had some inconsistencies in it, like RE:, Re: and R:. And there are some lines that have no topic they are associated with, but that might be because it's just some sample lines from your archive file. You may also have to install the Text::CSV_XS module to make this code work. The topic_id is added as a new second column to the output file:

ForumID,TopicID,Subject,UNIX TimeStamp,Poster_Name,Message

Expand|Select|Wrap|Line Numbers

 use strict;

use warnings;

use Text::CSV_XS;
 
my %subjects = ();

my $topic_id = 1;

my $csv = Text::CSV_XS->new({binary=>1});

my $infile  = 'c:/perl_test/sample.csv';

my $outfile = 'c:/perl_test/sample_out.csv';

my $errors  = 'c:/perl_test/error.txt';
 
open IN, $infile or die "$!";

open OUT, ">", $outfile or die "$!";

open ERROR, ">", $errors or die "$!";  
 
# Next two lines assume you want to get

# and print the headers from the archive csv file.

my @header = split(',', scalar <IN>); # get the headers 

print OUT "$header[0],TopicID,", join(',',@header[1..$#header]);
 
LOOP: while (<IN>) {

   chomp;

   my $this_subject = '';

   unless ( $csv->parse($_) ) {

      print ERROR $csv->error_input();

      next LOOP;

   }

   my @fields = $csv->fields;

   my $s = lc $fields[1]; # lower case the subject and store in $s

   $s =~ s/^s+//;  # remove leading spaces  

   $s =~ s/\s+$//; # remove trailing spaces

   if ($s =~ /^re?:\s*([\s\S]+)/o) {# assumes we have found a reply to a subject

      $this_subject = $1; # this is for later when we "combine" the fields

   }

   elsif ($s) { # assumes we have a new subject

      $this_subject = $s; # this is for later when we "combine" the fields

      $subjects{$s}=$topic_id; # assign the topic_id to a new subject

      $topic_id++;

   }

   else {

      print ERROR "No subject found: $_";

      next LOOP;

   }

   $csv->combine($fields[0],$subjects{$this_subject},@fields[1..$#fields]);

   print OUT $csv->string(),"\n";;

}

close IN;

run the above code using your sample file to get a feel for what it does. Here is a link to the Text::CSV_XS module.

cpan: Text::CSV_XS

Jan 6 '08 #6

filipo

Kevin, thank you! This is awesome and there is no way I could have done this myself.

It works but because the infile isn't perfect, the outfile suffers.

It seems there are some posts missing at the beginning of the original file, so for some subjects, the first occurrence begins with an "RE:"

How would I set it up so that a first unique Subject receives a new topic_id regardless if it has a RE: at the beginning or not?

Another option might be to strip the RE:, Re: and R: out completely and just go by the first occurrence of a unique Subject?

I've uploaded the entire original csv file here: http://www.smofco.com/dicky/ but it's ~38MB

Jan 7 '08 #7

filipo

Another issue I've noticed: There are topics that come up like this:

seats
..
seats
..
Re: seats
Re: seats
..
seats

The first two without an "Re:" received separate topic_id's, the two "Re: seats" topics did not receive a topic_id, and the third "seats" without an "Re:" received yet a third topic_id.

Like this:

15063 seats
..
15064 seats
..
Re: seats
Re: seats
..
15071 seats

This just adds to my thinking that the "Re:" prefixes are going to be a stumbling block. I guess different mail clients apply them differently and/or some users edited the subject line for their replies.

Jan 7 '08 #8

KevinADC

4,059

Expert 2GB

Another option might be to strip the RE:, Re: and : out completely and just go by the first occurrence of a unique Subject?

That sounds like the best compromise to make. Try this:

Expand|Select|Wrap|Line Numbers

 use strict;

use warnings;

use Text::CSV_XS;
 
my %subjects = ();

my $topic_id = 0;

my $csv = Text::CSV_XS->new({binary=>1});

my $infile  = 'c:/perl_test/sample.csv';

my $outfile = 'c:/perl_test/sample_out.csv';

my $errors  = 'c:/perl_test/error.txt';
 
open IN, $infile or die "$!";

open OUT, ">", $outfile or die "$!";
 
# Next two lines assume you want to get

# and print the headers from the archive csv file.

my @header = split(',', scalar <IN>); # get the headers 

print OUT "$header[0],TopicID,", join(',',@header[1..$#header]);
 
LOOP: while (<IN>) {

   chomp;

   unless ( $csv->parse($_) ) {

      print ERROR $csv->error_input();

      next LOOP;

   }

   my @fields = $csv->fields;

   my $s = lc $fields[1]; # lower case the subject and store in $s

   $s =~ s/^s+//;  # remove leading spaces  

   $s =~ s/\s+$//; # remove trailing spaces

   $s =~ s/^re?:\s+//;

   if (exists $subjects{$s}) {# check for existing subject

      $csv->combine($fields[0],$subjects{$s},@fields[1..$#fields]);

      print OUT $csv->string(),"\n";;

   }

   else { # assume we have a new subject

      $subjects{$s} = ++$topic_id; # assign the topic_id to a new subject

      $csv->combine($fields[0],$subjects{$s},@fields[1..$#fields]);

      print OUT $csv->string(),"\n";;

   }

}

close IN;

close OUT;

Jan 7 '08 #9

filipo

Works!

The script was missing

Expand|Select|Wrap|Line Numbers

open ERROR, ">", $errors or die "$!";

but there were no errors reported even after fixing it. Pretty cool!

Kevin, I really appreciate your assistance. Well, not so much "assistance" but work since you did everything.

Cheers,
Omar

Jan 7 '08 #10

KevinADC

4,059

Expert 2GB

Works!

The script was missing

Expand|Select|Wrap|Line Numbers

open ERROR, ">", $errors or die "$!";

but there were no errors reported even after fixing it. Pretty cool!

Kevin, I really appreciate your assistance. Well, not so much "assistance" but work since you did everything.

Cheers,
Omar

Oh, yea, I missed that I still had this line in the modified script:

print ERROR $csv->error_input();

So openinng the ERROR file should still be done just in case there was a parsing error.

You're welcome, but please note, this is not how things are generally done around here. I found your problem to be interesting enough that I wanted to figure it out, we normally only help people that show some effort in writing their code first. Plus, it seemed to me that you genuinely needed help and weren't just trying to take advantage of some free help.

Regards,
Kevin

Jan 7 '08 #11

filipo

So much for my education in Perl... give a man a fish...

Seriously though - before I posted here I didn't even have Perl installed on my machine. Much more to learn.

Cheers,
Omar

Jan 7 '08 #12

KevinADC

4,059

Expert 2GB

So much for my education in Perl... give a man a fish...

Seriously though - before I posted here I didn't even have Perl installed on my machine. Much more to learn.

Cheers,
Omar

When you have a one-off project, learning a language is really not practical. I understand this is a learning forum but I also understand that many people don't need or want to learn, they just want to get a job done. You could have probably hired a programmer for about $50 USD to do this work, which is easy enough for an experienced perl coder. It took me about 1 hour to come up with the first code I posted. Anyways, if you ever do want to learn perl, there are a number of good perl forums with people willing to help you along that journey.

Regards,
Kevin

Jan 7 '08 #13

filipo

When you have a one-off project, learning a language is really not practical.

Unless you really enjoy tinkering.

Jan 9 '08 #14

Similar topics

Perl vs PHP?

by: Keith | last post by:

All: What is the difference between Perl (CGI) and PHP (Apache module)? I thought both used servers to direct the user to the appropiate Perl or PHP program in order to execute the program to...

PHP

Perl performance?

by: Kevin Horton | last post by:

I'm a perl newbie, with next to no programming experience (I did a bunch of Fortran 25 years ago, but nothing since). I have a problem I need to solve, and I'm wondering whether perl is the best...

Perl

Perl script to mimic uniq

by: Martin Foster | last post by:

Hi. I would like to be able to mimic the unix tool 'uniq' within a Perl script. I have a file with entries that look like this 4 10 21 37 58 83 111 145 184 226...

Perl

Help for a new Perl User - Looking for suggestion

by: scadav | last post by:

I am new to Perl and I am trying for figure out how to solve this problem. If anyone can give me some suggestions, I would greatly appreciate it. I am trying to read a log file and generate...

Perl

perl and php, horses for courses?

by: pc | last post by:

hi all, I have been blessed with the task of writing a web based database representing the state of our globally installed isam databases. there are basically four steps in setting this up: ...

PHP

DB2 Unload (to record sequential file) on NT possible?

by: Timofmars | last post by:

I'm try to Unload DB2 data from a table into a record sequential file on NT. I can an unload on Mainframe, but it doesn't seem to be an option in NT. In NT, all I can do is export/import. I can...

DB2 Database

Unable to Rename a file in perl

by: codemaster | last post by:

Hi, I am a total novice to perl. I am moving files from one directory to another and trying to rename a file appending the timestamp. But for some reason, my code is not able to rename. ...

Perl

rapidshare upload perl script for linux

by: 8anos | last post by:

Hello, I am new at the community and newbie at programming :) As you may know rapidshare provides a perl script for linux, to upload files at their servers. You can find the original scripts at...

Perl

How to Upload Files using the CGI.pm Module and Perl

by: KevinADC | last post by:

Note: You may skip to the end of the article if all you want is the perl code. Introduction Uploading files from a local computer to a remote web server has many useful purposes, the most...

Perl

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice