I'm hoping you can help me with a difficulty I'm having with my Sentiment Analysis project. I've been attempting to apply a simple sentiment analysis model to a dataset of movie reviews, but I'm getting some surprising results. Here's the pertinent section of my code:
Expand|Select|Wrap|Line Numbers
- import pandas as pd
- from sklearn.model_selection import train_test_split
- from sklearn.feature_extraction.text import CountVectorizer
- from sklearn.linear_model import LogisticRegression
- # Load the movie reviews dataset
- data = pd.read_csv('movie_reviews.csv')
- # Preprocess the data
- # ... (code for data preprocessing)
- # Split the dataset into training and testing sets
- X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42)
- # Vectorize the text data using CountVectorizer
- vectorizer = CountVectorizer()
- X_train_vectorized = vectorizer.fit_transform(X_train)
- X_test_vectorized = vectorizer.transform(X_test)
- # Train the Logistic Regression model
- model = LogisticRegression()
- model.fit(X_train_vectorized, y_train)
- # Evaluate the model
- accuracy = model.score(X_test_vectorized, y_test)
- print(f"Accuracy: {accuracy}")
I verified the dataset and read more about it in this article, and it appears to be successfully loaded with both'review' and'sentiment' columns. I also attempted a simple Naive Bayes classifier, but it didn't help much in accuracy.
Could you kindly evaluate the code and let me know if you find any flaws or improvements that may help me enhance the accuracy of my sentiment analysis model?
Thank you in advance for your help!