How to tokenize a collection of text file?

I am working on Information Retrieval field.For this project I need to Tokenize a collection of documents such as text files. I have done how to tokenize a string and one text file.but in the text file i am able to tokenize on the whitespace only,not able to work on hyphen or comma etc.So,I need the java code which will actually tokenize the character while getting , or - or ' etc for a collection of text files. pls help pls....

Aug 13 '13 #1

Subscribe Post Reply

8117

chaarmann

785

Expert 512MB

Just simply replace the whitespace " " in the code with a comma "," etc. to tokenize on other characters.
But most likely you want all of the mentioned characters together to be a token separator. Then tokenize by using regular Expression:

Expand|Select|Wrap|Line Numbers

String tokens[] = textString.split("[\\s\\-,']+".

If you still have problems, then please show the code you have done so far here, so that we can improve and change it.

One tip: to get more and faster answers in general do NOT write "pls help pls". For me it makes the impression that you are in a hurry (abbreviation) and will not value any answer. It's clear that we are here to help you, but I feel pressure if you mention this self-speaking fact and emphasize it with two pleas. You are just lucky that I am in a very good mood right now, else I would not have answered you because of this sentence.

Aug 13 '13 #2

29294

thank you very much...I am able to tokenize the text file on getting white space and any other punctuation.Now I want to tokenize a collection of text files not only a single text file.
I am attaching my code here.one error has occured and i am unable to find why actualy it is occuring.

Attached Files

filetoken.txt (1.1 KB, 843 views)

Aug 13 '13 #3

chaarmann

785

Expert 512MB

Ok, here is the code from the file. I put it here directly using code tags instead of a text file, because then it's easier for others to read (and understand and providing help based on the line number). For the same reason, I cleaned up by removing commented-out code and then indented it properly.
Cleaned-up original code:

Expand|Select|Wrap|Line Numbers

 package stemmer; 

import java.util.*;  // Provides TreeMap, Iterator, Scanner  

import java.io.*;    // Provides FileReader, FileNotFoundException  
 
public class NewEmpty  

{  

   public static void main(String[ ] args)  

   {  

        Scanner br;  
 
        //**READ THE DOCUMENTS**  

        for (int x=0; x<Docs.length; x++)  

        {  

            br = new Scanner(new FileReader(Docs[x]));                   

        }
 
        try

        {

            String strLine= " ";

            String filedata="";

            while ( (strLine =br.readLine()) != null)

            {

                filedata+=strLine+" ";

            }

            StringTokenizer stk=new StringTokenizer(filedata," .,-'");

            while(stk.hasMoreTokens())

            {

               String token=stk.nextToken();

               System.out.println(token);

            }

            br.close();

        }  

        catch (Exception e)

        {

            System.err.println("Error: " + e.getMessage());

        }
 
        // Array of documents  

        String Docs [] = {"words.txt", "words2.txt","words3.txt", "words4.txt",};  

    } 

}

Aug 14 '13 #4

chaarmann

785

Expert 512MB

First, I wonder how it compiled at all. In line 39, you defined the string-array of docs that you want to loop through in line 12. So it must be defined BEFORE line 12

Second, you open a text file for reading in your for-loop and assigning it to "br", but instead of parsing its content, you close the for-loop which will assign the next text file and so on, until you assign the last text file and then you parse only this last one. To fix the code, you must do all the parsing (line 17 to 36) inside the for-loop, not outside.

Third, you have a memory leak. If there occurs an exception while reading a file, you don't close the file, leaving it open forever. You must do your close-command in the "finally" part of your try-catch-command. (Unfortunately the close-command can also throw an error, so it needs a try-catch itself).

Fourth, the string array should be named "docs" instead of "Docs". Only classes should start with uppercase letters, but instances not. Every professional java programmer follows this coding style for good reasons, which I will not explain further here, because it leads too far away.

There are some other minor issues and enhancements, but they don't hinder you to get it running, so I will not mention them now.

Here is the corrected source code. (I cannot try to run it at the moment, but you should do it anyway, so tell me if it's ok now.)

Expand|Select|Wrap|Line Numbers

 package stemmer; 

import java.util.*;  // Provides TreeMap, Iterator, Scanner  

import java.io.*;    // Provides FileReader, FileNotFoundException  
 
public class NewEmpty  

{  

   public static void main(String[ ] args)  

   {  

        // Array of documents  

        String docs [] = {"words.txt", "words2.txt","words3.txt", "words4.txt",};  
 
        // process all documents  

        for (int x=0; x<docs.length; x++)  

        {

            // read document and parse it

            Scanner br = new Scanner(new FileReader(docs[x]));                   

            try

            {

                String strLine= " ";

                String filedata="";

                while ( (strLine =br.readLine()) != null)

                {

                    filedata+=strLine+" ";

                }

                StringTokenizer stk=new StringTokenizer(filedata," .,-'");

                while(stk.hasMoreTokens())

                {

                   String token=stk.nextToken();

                   System.out.println(token);

                }                

            }  

            catch (Exception e)

            {

                System.err.println("Error: " + e.getMessage());

            }

            finally

            {

                try

                {

                    br.close();

                }

                catch (Exception e2)

                {

                    // NOPMD 

                }

            }

        }

    } 

}

Aug 14 '13 #5

Mousumi Dhar

The following code is working fine in NetBeans for the above problem :D

Expand|Select|Wrap|Line Numbers

 package FinalizedPrograms;

import java.io.BufferedReader;

import java.util.*;  // Provides TreeMap, Iterator, Scanner  

import java.io.*;    // Provides FileReader, FileNotFoundException  
 
public class TokenizingMultipleFiles  

{  

   public static void main(String[ ] args)  

   {  

     // Scanner br;  

   // Array of documents  

  String Docs [] = {"temp.txt", "temp1.txt",};

//**FOR LOOP TO READ THE DOCUMENTS**  

for (int x=0; x<Docs.length; x++)  

{  

  try  

      {  

          File f=new File(Docs[x]);

          BufferedReader br = new BufferedReader(new FileReader(f));

         //br = new Scanner(new FileReader(Docs[x]));  

         try{

String strLine= " ";

String filedata="";

while ( (strLine = br.readLine()) != null)   {

filedata+=strLine+" ";

}

StringTokenizer stk=new StringTokenizer(filedata," .,-'[]{}/|@#!$%^&*_-+=?<>:;()");

   while(stk.hasMoreTokens()){

       String token=stk.nextToken();

       System.out.println(token);

   }

   br.close();

   }  

   catch (Exception e){

     System.err.println("Error: " + e.getMessage());

   }
 
      }  

     catch (FileNotFoundException e)  

     {  

 System.err.println(e);  

 return;  

      }  

     } //End of for loop *]
 
}  

}

Aug 14 '13 #6

29294

Expand|Select|Wrap|Line Numbers

 package IR;

import java.io.BufferedReader;

import java.util.*;  // Provides TreeMap, Iterator, Scanner  

import java.io.*;    // Provides FileReader, FileNotFoundException  
 
public class FilesTokenization 

{  

   public static void main(String[ ] args)  

   {  

     // Scanner br;  

   // Array of documents  

  String Docs [] = {"words.txt", "words2.txt","words3.txt", "words4.txt",};

  //start for loop

  for (int x=0; x<Docs.length; x++)  

{  

  try  

      {  

          File f=new File(Docs[x]);

          BufferedReader br = new BufferedReader(new FileReader(f));

         //br = new Scanner(new FileReader(Docs[x]));  
 
try{

String strLine= " ";

String filedata="";

while ( (strLine = br.readLine()) != null)   

{

filedata+=strLine+" ";

}

StringTokenizer stk=new StringTokenizer(filedata," .,-';{}?()");

   while(stk.hasMoreTokens())

   {

       String token=stk.nextToken();

       System.out.println(token);

   }

   br.close();

   }  

     catch (FileNotFoundException e)  

     {  

 System.err.println(e);  

 return;  

      }

      }

catch (Exception e){

     System.err.println("Error: " + e.getMessage());

  }

      }  

}  

}

Aug 14 '13 #7

29294

thank you very much chaarman for pointing out the faults.and I am able to successfully run program and it gives correct output.the code is mentioned in the above post by me.

Aug 14 '13 #8

devkumarOO7

It is very helpful content for me.
Thanks for provide such type of content

Aug 27 '13 #9

vinaykumar1994

Hi.Presently I am doing a project on personalized web search which was related to information retrieval concepts like stemming and tokenization. Can any one help me in providing the related code for my project.

Jan 6 '15 #10

vinaykumar1994

Please mail the code for tokenization to me. My mail id: mailmevinay1994@gmail.com

Jan 6 '15 #11

by: Ron | last post by:

.NET Framework

Modify text file.

by: Job Lot | last post by:

How can I modify values in text file? File is tab delimited as follows Date Buy Sell 13-Jan-2005 0.9970776 0.9901224 18-Jan-2005 0.9910566 0.9841434 I want to modify Buy and Sell...

.NET Framework

How to Parse a CSV formatted text file

by: Ram Laxman | last post by:

Hi all, I have a text file which have data in CSV format. "empno","phonenumber","wardnumber" 12345,2234353,1000202 12326,2243653,1000098 Iam a beginner of C/C++ programming. I don't know how to...

C / C++

conversion from one *format in a text file* -> to xml

by: Raghavendra Mahuli | last post by:

Hello, I have a text file in which records are stored in a particular format. For ex: Node1( att1, att2, node2(attx)) I need to convert it to xml. I know xsl can be used to convert *xml to*...

.NET Framework

Loading a Python collection from an text-file

by: Ilias Lazaridis | last post by:

within a python script, I like to create a collection which I fill with values from an external text-file (user editable). How is this accomplished the easiest way (if possible without the need...

Python

Open text file from the from the end backwards

by: Stupid48 | last post by:

I'm trying to do a simple task but can't seem to find a solution. How do I read lines from a text file backwards. i.e. I want to select the last 20 lines of a text file and display them in...

Visual Basic .NET

Reading form a text file

by: DN2UK | last post by:

I am making a windows form in vb.net that needs to read form a text file and insert the text as objects when the form loads . The textboxes are as follow TxtJobtitle.Text TxtSalary.Text...

.NET Framework

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

How to tokenize a collection of text file?

Similar topics