473,396 Members | 1,982 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Is this possible in Perl? Mail archive data manipulation

8
Hello all;

I have a .csv file that contains messages exported from one discussion forum that I want to import into another forum (phpBB), but I need to do some data manipulation on the original export file first.

The original .csv file contains a subject or topic field on each line. Subsequent lines contain replies and other subjects. The file is in chronological order.

Here is a sample layout:

ForumID,Subject,UNIX TimeStamp,Poster_Name,Message
12,Wings,830799174,John Doe,Lorem ipsum...
13,Wheels,831213087,Jane Q. Public,Dolor sit amet...
12,RE: Wings,860225019,Bob Builder,Praesent adipiscing sapien ut purus....

I need to insert a topic_id in each line of the .csv file. The topic_id for the sample above would be 1 for the first and third line and 2 for the second line.

Here's how I imagine this automated insertion of topic_id would work:

A script would look at each row and read the Subject, then compare it to a table with Subjects. If the Subject does not exist, then it would create a topic_id for that subject and assign it to the row and write the topic_id and subject to the table. If the subject already exists, then it would look up the topic_id from the table and assign it to the row, and so on.

I think the "RE:" in each follow-up post may be a complicating factor, but since all posts are in chronological order, the "RE:" could be stripped from the Subject column.

A larger sample of the actual file can be seen here:
http://www.smofco.com/dicky/Sample_fields_for_import_Rev2.csv

The complete file contains about 35,000 lines or posts.

Is Perl a good programing language for such a task?

I hate to shamelessly ask for assistance in writing this, but my knowledge of Perl is minimal. However, I am willing to learn! So, any pointers, hints, ideas, code samples, etc. will be greatly appreciated.

Cheers,
Omar Filipovic
Jan 5 '08 #1
13 1606
KevinADC
4,059 Expert 2GB
Where does the topic id get inserted?
Jan 5 '08 #2
filipo
8
Where does the topic id get inserted?
Anywhere; it can be at the end of each line, for example.

Or it can be in a separate output file and I can later simply add this file as a column in Excel to the existing csv file, as long as that output file is in the same sort order as the original csv and each topic_id is on a separate line. I know it's not an elegant solution, but it would work.

However, if somehow a row got skipped and at the end there are 30,000 posts and 29,999 topic_ids then I'm stuck...

I can also separate out only the Subject column from the csv file if that will simplify things.
Jan 6 '08 #3
KevinADC
4,059 Expert 2GB
Are the "subjects" always unique?
Jan 6 '08 #4
filipo
8
Are the "subjects" always unique?
No. If I understand the question correctly...

For example, there may be a subject "Wings" in January of 1999 with several follow-ups of "RE: Wings", and then an unrelated discussion in December 2003 also with the subject "Wings" and its follow-ups.

However, this is rare and I'm willing to live with these cases being lumped into the same discussion.

Was that the question?
Jan 6 '08 #5
KevinADC
4,059 Expert 2GB
Yes, that was the question.

I know you said you did not want to just have the code written for you, and normally I never do that until a person shows some effort. But the problem was interesting, (and as you can see the code is not very simple), so I decided to write some code and use the sample data file you posted a link to. Now that file had some inconsistencies in it, like RE:, Re: and R:. And there are some lines that have no topic they are associated with, but that might be because it's just some sample lines from your archive file. You may also have to install the Text::CSV_XS module to make this code work. The topic_id is added as a new second column to the output file:

ForumID,TopicID,Subject,UNIX TimeStamp,Poster_Name,Message

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2. use warnings;
  3. use Text::CSV_XS;
  4.  
  5. my %subjects = ();
  6. my $topic_id = 1;
  7. my $csv = Text::CSV_XS->new({binary=>1});
  8. my $infile  = 'c:/perl_test/sample.csv';
  9. my $outfile = 'c:/perl_test/sample_out.csv';
  10. my $errors  = 'c:/perl_test/error.txt';
  11.  
  12. open IN, $infile or die "$!";
  13. open OUT, ">", $outfile or die "$!";
  14. open ERROR, ">", $errors or die "$!";  
  15.  
  16. # Next two lines assume you want to get
  17. # and print the headers from the archive csv file.
  18. my @header = split(',', scalar <IN>); # get the headers 
  19. print OUT "$header[0],TopicID,", join(',',@header[1..$#header]);
  20.  
  21. LOOP: while (<IN>) {
  22.    chomp;
  23.    my $this_subject = '';
  24.    unless ( $csv->parse($_) ) {
  25.       print ERROR $csv->error_input();
  26.       next LOOP;
  27.    }
  28.    my @fields = $csv->fields;
  29.    my $s = lc $fields[1]; # lower case the subject and store in $s
  30.    $s =~ s/^s+//;  # remove leading spaces  
  31.    $s =~ s/\s+$//; # remove trailing spaces
  32.    if ($s =~ /^re?:\s*([\s\S]+)/o) {# assumes we have found a reply to a subject
  33.       $this_subject = $1; # this is for later when we "combine" the fields
  34.    }
  35.    elsif ($s) { # assumes we have a new subject
  36.       $this_subject = $s; # this is for later when we "combine" the fields
  37.       $subjects{$s}=$topic_id; # assign the topic_id to a new subject
  38.       $topic_id++;
  39.    }
  40.    else {
  41.       print ERROR "No subject found: $_";
  42.       next LOOP;
  43.    }
  44.    $csv->combine($fields[0],$subjects{$this_subject},@fields[1..$#fields]);
  45.    print OUT $csv->string(),"\n";;
  46. }
  47. close IN;
  48.  
  49.  
run the above code using your sample file to get a feel for what it does. Here is a link to the Text::CSV_XS module.

cpan: Text::CSV_XS
Jan 6 '08 #6
filipo
8
Kevin, thank you! This is awesome and there is no way I could have done this myself.

It works but because the infile isn't perfect, the outfile suffers.

It seems there are some posts missing at the beginning of the original file, so for some subjects, the first occurrence begins with an "RE:"

How would I set it up so that a first unique Subject receives a new topic_id regardless if it has a RE: at the beginning or not?

Another option might be to strip the RE:, Re: and R: out completely and just go by the first occurrence of a unique Subject?

I've uploaded the entire original csv file here: http://www.smofco.com/dicky/ but it's ~38MB
Jan 7 '08 #7
filipo
8
Another issue I've noticed: There are topics that come up like this:
seats
..
seats
..
Re: seats
Re: seats
..
seats
The first two without an "Re:" received separate topic_id's, the two "Re: seats" topics did not receive a topic_id, and the third "seats" without an "Re:" received yet a third topic_id.

Like this:
15063 seats
..
15064 seats
..
Re: seats
Re: seats
..
15071 seats
This just adds to my thinking that the "Re:" prefixes are going to be a stumbling block. I guess different mail clients apply them differently and/or some users edited the subject line for their replies.
Jan 7 '08 #8
KevinADC
4,059 Expert 2GB
Another option might be to strip the RE:, Re: and : out completely and just go by the first occurrence of a unique Subject?
That sounds like the best compromise to make. Try this:

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2. use warnings;
  3. use Text::CSV_XS;
  4.  
  5. my %subjects = ();
  6. my $topic_id = 0;
  7. my $csv = Text::CSV_XS->new({binary=>1});
  8. my $infile  = 'c:/perl_test/sample.csv';
  9. my $outfile = 'c:/perl_test/sample_out.csv';
  10. my $errors  = 'c:/perl_test/error.txt';
  11.  
  12. open IN, $infile or die "$!";
  13. open OUT, ">", $outfile or die "$!";
  14.  
  15. # Next two lines assume you want to get
  16. # and print the headers from the archive csv file.
  17. my @header = split(',', scalar <IN>); # get the headers 
  18. print OUT "$header[0],TopicID,", join(',',@header[1..$#header]);
  19.  
  20. LOOP: while (<IN>) {
  21.    chomp;
  22.    unless ( $csv->parse($_) ) {
  23.       print ERROR $csv->error_input();
  24.       next LOOP;
  25.    }
  26.    my @fields = $csv->fields;
  27.    my $s = lc $fields[1]; # lower case the subject and store in $s
  28.    $s =~ s/^s+//;  # remove leading spaces  
  29.    $s =~ s/\s+$//; # remove trailing spaces
  30.    $s =~ s/^re?:\s+//;
  31.    if (exists $subjects{$s}) {# check for existing subject
  32.       $csv->combine($fields[0],$subjects{$s},@fields[1..$#fields]);
  33.       print OUT $csv->string(),"\n";;
  34.    }
  35.    else { # assume we have a new subject
  36.       $subjects{$s} = ++$topic_id; # assign the topic_id to a new subject
  37.       $csv->combine($fields[0],$subjects{$s},@fields[1..$#fields]);
  38.       print OUT $csv->string(),"\n";;
  39.    }
  40. }
  41. close IN;
  42. close OUT;
Jan 7 '08 #9
filipo
8
Works!

The script was missing
Expand|Select|Wrap|Line Numbers
  1. open ERROR, ">", $errors or die "$!";
but there were no errors reported even after fixing it. Pretty cool!

Kevin, I really appreciate your assistance. Well, not so much "assistance" but work since you did everything.

Cheers,
Omar
Jan 7 '08 #10
KevinADC
4,059 Expert 2GB
Works!

The script was missing
Expand|Select|Wrap|Line Numbers
  1. open ERROR, ">", $errors or die "$!";
but there were no errors reported even after fixing it. Pretty cool!

Kevin, I really appreciate your assistance. Well, not so much "assistance" but work since you did everything.

Cheers,
Omar
Oh, yea, I missed that I still had this line in the modified script:

print ERROR $csv->error_input();

So openinng the ERROR file should still be done just in case there was a parsing error.

You're welcome, but please note, this is not how things are generally done around here. I found your problem to be interesting enough that I wanted to figure it out, we normally only help people that show some effort in writing their code first. Plus, it seemed to me that you genuinely needed help and weren't just trying to take advantage of some free help.

Regards,
Kevin
Jan 7 '08 #11
filipo
8
So much for my education in Perl... give a man a fish...

Seriously though - before I posted here I didn't even have Perl installed on my machine. Much more to learn.

Cheers,
Omar
Jan 7 '08 #12
KevinADC
4,059 Expert 2GB
So much for my education in Perl... give a man a fish...

Seriously though - before I posted here I didn't even have Perl installed on my machine. Much more to learn.

Cheers,
Omar
When you have a one-off project, learning a language is really not practical. I understand this is a learning forum but I also understand that many people don't need or want to learn, they just want to get a job done. You could have probably hired a programmer for about $50 USD to do this work, which is easy enough for an experienced perl coder. It took me about 1 hour to come up with the first code I posted. Anyways, if you ever do want to learn perl, there are a number of good perl forums with people willing to help you along that journey.

Regards,
Kevin
Jan 7 '08 #13
filipo
8
When you have a one-off project, learning a language is really not practical.
Unless you really enjoy tinkering.
Jan 9 '08 #14

Sign in to post your reply or Sign up for a free account.

Similar topics

4
by: Keith | last post by:
All: What is the difference between Perl (CGI) and PHP (Apache module)? I thought both used servers to direct the user to the appropiate Perl or PHP program in order to execute the program to...
2
by: Kevin Horton | last post by:
I'm a perl newbie, with next to no programming experience (I did a bunch of Fortran 25 years ago, but nothing since). I have a problem I need to solve, and I'm wondering whether perl is the best...
9
by: Martin Foster | last post by:
Hi. I would like to be able to mimic the unix tool 'uniq' within a Perl script. I have a file with entries that look like this 4 10 21 37 58 83 111 145 184 226...
2
by: scadav | last post by:
I am new to Perl and I am trying for figure out how to solve this problem. If anyone can give me some suggestions, I would greatly appreciate it. I am trying to read a log file and generate...
3
by: pc | last post by:
hi all, I have been blessed with the task of writing a web based database representing the state of our globally installed isam databases. there are basically four steps in setting this up: ...
11
by: Timofmars | last post by:
I'm try to Unload DB2 data from a table into a record sequential file on NT. I can an unload on Mainframe, but it doesn't seem to be an option in NT. In NT, all I can do is export/import. I can...
1
by: codemaster | last post by:
Hi, I am a total novice to perl. I am moving files from one directory to another and trying to rename a file appending the timestamp. But for some reason, my code is not able to rename. ...
9
by: 8anos | last post by:
Hello, I am new at the community and newbie at programming :) As you may know rapidshare provides a perl script for linux, to upload files at their servers. You can find the original scripts at...
21
KevinADC
by: KevinADC | last post by:
Note: You may skip to the end of the article if all you want is the perl code. Introduction Uploading files from a local computer to a remote web server has many useful purposes, the most...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.