473,385 Members | 2,013 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Please Help!!more string manipulation Qs...in C++

Hp
Hi All,
Thanks a lot for all your replies.

My requirement is as follows:
I need to read a text file, eliminate certain special characters(like !
, - = + ), and then convert it to lower case and then remove certain
stopwords(like and, a, an, by, the etc) which is there in another txt
file.
Then, i need to run it thru a stemmer(a program which converts words
like running to run, ie, converts them to roots words).
Then i need to create a term-by-document matrix, which would be a
matrix, where in M(i,j) will give the number of times the term j occurs
in the document i.

My situation as of now is as below:
I have read the file contents into a string variable, removed/replaced
the special characters with a space using the replace function, and
then converted the string completely to lower case, using the transform
function.

I would really appreciate .any help, thanks i advance.

Thanks,
Hp

Oct 25 '05 #1
12 1811
On 24 Oct 2005 19:45:33 -0700, "Hp" <pr****************@gmail.com>
wrote:
Hi All,
Thanks a lot for all your replies.

My requirement is as follows:
I need to read a text file, eliminate certain special characters(like !
, - = + ), and then convert it to lower case and then remove certain
stopwords(like and, a, an, by, the etc) which is there in another txt
file.
Then, i need to run it thru a stemmer(a program which converts words
like running to run, ie, converts them to roots words).
Then i need to create a term-by-document matrix, which would be a
matrix, where in M(i,j) will give the number of times the term j occurs
in the document i.

My situation as of now is as below:
I have read the file contents into a string variable, removed/replaced
the special characters with a space using the replace function, and
then converted the string completely to lower case, using the transform
function.

I would really appreciate .any help, thanks i advance.

Thanks,
Hp


Is this homework??? Sure sounds like it.

If not, why do you have to use C++ at all? Perl or awk, using regular
expressions, is probably much easier for something like this.

At any rate, your question has to do with algorithms, not with the
language itself. Therefore, it is off-topic in this NG.

--
Bob Hairgrove
No**********@Home.com
Oct 25 '05 #2
Hp
It is a project, where i m stuck at a particular point and i dont know
how to proceed. I know the algorithm, its just the implementation that
i cant get, and hence forth it deseves a post in the c++ newsgroups.
Hey bob, I would appreciate a solution to my question and can do
without unnecessary comments!

Oct 25 '05 #3

Hp wrote:
It is a project, where i m stuck at a particular point and i dont know
how to proceed. I know the algorithm, its just the implementation that
i cant get, and hence forth it deseves a post in the c++ newsgroups.
Hey bob, I would appreciate a solution to my question and can do
without unnecessary comments!


Why don't you show some code?
With none of your "project" problems have you shown any code.

Do something! Get stuck, then ask questions!

The comments you get are not unecessary. You are on a C++ _langugae_
newsgroup. Figure something out. Post again when you have _specific_
problems with a language construct and now a "write my program for me"
request!

Cheers,
Andre

Oct 25 '05 #4
Hp
I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;
int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;

vector<string> files;
vector<string> punct;//Vector of strings to remove the punctuations
from each files
cout<<"This is a sample program"<<endl;
punct.push_back(",");punct.push_back(":");punct.pu sh_back(";");
punct.push_back("'");
punct.push_back("'");punct.push_back("=");punct.pu sh_back("-");
punct.push_back(".");punct.push_back(",");punct.pu sh_back(",");

for (int i=0;i<punct.size();i++)
{
cout<<punct.at(i)<<endl;
}

std::replace(file.begin(),file.end(),',','');
std::replace(file.begin(),file.end(),';',' ');
std::replace(file.begin(),file.end(),':','');
std::replace(file.begin(),file.end(),'-',' ');
std::replace(file.begin(),file.end(),'=','');
std::replace(file.begin(),file.end(),'+',' ');
std::replace(file.begin(),file.end(),')','');
std::replace(file.begin(),file.end(),'(',' ');
std::replace(file.begin(),file.end(),'&','');
std::replace(file.begin(),file.end(),'!',' ');
std::replace(file.begin(),file.end(),'.','');
std::replace(file.begin(),file.end(),'/',' ');
//Removing single and double quotes
std::replace(file.begin(),file.end(),'\'','');
std::replace(file.begin(),file.end(),'\"',' ');

std::transform(file.begin(),file.end(),file.begin( ),tolower);

/*if((pos=file.find(remword,0))!=string::npos)
{
file.erase(pos,remword.length());
}
cout << "After removing 'the'" <<endl;
*/

}
-----------------------------------------------------------------------------------

Oct 25 '05 #5

Hp wrote:
I am sorry not to have posted my code, i apologize for the that.

Here is the code:
_Compiling_ code would be nice, too...
using namespace std;
using std::string;
This is redundant. If you include the full namespace (std), you don't
need to list the individual ones. Pick one.
int var_len;
Unused?

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;
This code is pretty much unreadable. You should not mix variable
declaration with code to read in a file like that. Some error checking
would be useful as well.

fopen() feels very "C". You could use a more C++ approach here, like
"ifstream".
vector<string> files;
Unused?
vector<string> punct;//Vector of strings to remove the punctuations
from each files
Looks like you fill this vector but then decided to replace them all
manually anyway?

It may be simpler (if you dont want to use boost::regex) to put all the
unwanted characters into a simple string (not a vector) and iterate
over that.
std::replace(file.begin(),file.end(),',','');
You can't replace with a non-character...
std::transform(file.begin(),file.end(),file.begin( ),tolower);
"tolower" is unfortunately amgigious. You'll have to cast it like this:
std::transform(file.begin(),file.end(),file.begin( ),(int(*)(int))std::tolower);
/*if((pos=file.find(remword,0))!=string::npos)
{
file.erase(pos,remword.length());
}
*/


You'll need a loop here. A single if won't do.

Cheers,
Andre

Oct 25 '05 #6
Hp

in*****@gmail.com wrote:
Hp wrote:
I am sorry not to have posted my code, i apologize for the that.

Here is the code:
_Compiling_ code would be nice, too...
using namespace std;
using std::string;


This is redundant. If you include the full namespace (std), you don't
need to list the individual ones. Pick one.
int var_len;


Unused?

I had declared it for future use.

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;
This code is pretty much unreadable. You should not mix variable
declaration with code to read in a file like that. Some error checking
would be useful as well.

fopen() feels very "C". You could use a more C++ approach here, like
"ifstream".
vector<string> files;


Unused?

I had used this vector to read a set of files and read each file into a
string, giving me a vector of string of files that i need to read and
modify.
vector<string> punct;//Vector of strings to remove the punctuations
from each files
Looks like you fill this vector but then decided to replace them all
manually anyway?

It may be simpler (if you dont want to use boost::regex) to put all the
unwanted characters into a simple string (not a vector) and iterate
over that.

Thank you, i think i will do this.
std::replace(file.begin(),file.end(),',','');
You can't replace with a non-character...

This is a typo error, i have it replaced with a space, which got lost
while cutting and pasting.

std::transform(file.begin(),file.end(),file.begin( ),tolower);
"tolower" is unfortunately amgigious. You'll have to cast it like this:
std::transform(file.begin(),file.end(),file.begin( ),(int(*)(int))std::tolower);

Ironically, the code i have written works:-).
/*if((pos=file.find(remword,0))!=string::npos)
{
file.erase(pos,remword.length());
}
*/


You'll need a loop here. A single if won't do.

The above piece of code doesnt work. I had initialized remword = "the",
but it was removing 'the' from 'there' too, which i dont want. Also, i
want all the occurances of it to be removed, which i can acheive
through a loop.
Cheers,
Andre


Oct 26 '05 #7

Hp wrote:
[snipped posted code]
The above piece of code doesnt work.


Alright, even if I am running pretty high danger of doing your
homework, I'll post my version of the program which will read in a file
and remove the stopwords.

The program reads only one file in though and doesn't build the
document/term matrix for you - that's still up to you.

Please try to understand the code and discuss as necessary to help you
learn something from it.

Here ya go:

#include <iostream>
#include <ostream>
#include <fstream>
#include <sstream>
#include <algorithm>
#include <string>
#include <map>

using namespace std;

const string InvalidChars = ",.!?;:=()+-\'\"&";

char sanitizeChar( const char & c )
{
for( string::const_iterator inv=InvalidChars.begin();
inv!=InvalidChars.end(); ++inv)
{
if ( *inv == c )
return ' ';
}

return tolower( c );
}

int main()
{
ifstream ff_swords( "stopwords.txt" );
ifstream ff_text( "test.txt" );

// TODO: Check if files are open here....

map<string,char> stopwords;

string token;

while( ff_swords >> token )
stopwords[ token ] = 1;

while( ff_text >> token )
{
transform( token.begin(), token.end(), token.begin(),
sanitizeChar );

istringstream ss( token );
while( ss >> token )
{
if ( stopwords.find( token ) != stopwords.end() )
continue;

// TODO: Run token through stemmer here.

// TODO: Add stemmed token to your custom matrix now...

cout << token << endl; // <-- Debug
}
}
}

Oct 26 '05 #8
Hp wrote:

I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;


Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );
// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map

--
Karl Heinz Buchegger
kb******@gascad.at
Oct 27 '05 #9
Hp wrote:

I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;


Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );
// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map

--
Karl Heinz Buchegger
kb******@gascad.at
Oct 27 '05 #10
Hp wrote:

I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;


Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );
// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map

--
Karl Heinz Buchegger
kb******@gascad.at
Oct 27 '05 #11
Hp wrote:

I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;


Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );
// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map

--
Karl Heinz Buchegger
kb******@gascad.at
Oct 27 '05 #12
Hp wrote:

I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;


Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );
// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map

--
Karl Heinz Buchegger
kb******@gascad.at
Oct 27 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: siddhartha mulpuru | last post by:
We have some rows that we need to do some tricky string manipulation on. We have a UserID column which has userid entries in the format firstname.lastname and i need to change each entry to...
9
by: mjakowlew | last post by:
Hi, I'm trying to use some string manipulation from a file's path. filepath='c:\documents\web\zope\file.ext' I need to extract everthing after the last '\' and save it. I've looked around...
0
by: Sam Hart | last post by:
Hi, I need some string manipulation functions in C#, please help. I need to translate ServicesArray into PIN_FLD_SERVICES and FldServiceEmail into PIN_FLD_SERVICE_EMAIL (into upper case),...
4
by: Jim McGivney | last post by:
Does anyone know of a concise article that covers string manipulation, such as insert, join, pad, etc. Thanks, Jim
7
by: John A Grandy | last post by:
what are the preferred VB.NET analogues for IsNumeric() and Len() and CInt() & similar string-manipulation functions in VB6
4
by: WaterWalk | last post by:
Hello, I'm currently learning string manipulation. I'm curious about what is the favored way for string manipulation in C, expecially when strings contain non-ASCII characters. For example, if...
5
by: Cleverbum | last post by:
I'm not really accustomed to string manipulation and so I was wondering if any of you could be any help i speeding up this script intended to change the format of some saved log information into a...
5
by: Niyazi | last post by:
Hi, Does anyone knows any good code for string manipulation similar to RegularExpresion? I might get a value as string in a different format. Example: 20/02/2006 or 20,02,2006 or ...
3
by: crprajan | last post by:
String Manipulation: Given a string like “This is a string”, I want to remove all single characters( alphabets and numerals) like (a, b, 1, 2, .. ) . So the output of the string will be “This is...
1
by: adam bob | last post by:
Hello, I'm struggling with an image mechanism I'm trying to build which basically manipulates a URL string. This is the sort URL that is gained from an InfoPath form ...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.