Mintu Konwar
2 min readAug 16, 2021

Movie Recommendation System with Sentiment Analysis of the reviews(Part 1)

File_1 of Preprocessing

In this preprocessing file, I will import a random dataset from Kaggle and see the no of movies produced every year. The next step is to select only those features from the dataset which will be needed for Model creation and Analysis.

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('2016.csv')df.head(8)
png
# to see no of rows and columns
df.shape
(5042, 28)# we will count the number of movies produced in a year
df.title_year.value_counts(dropna=False).sort_index()
1916.0 1
1920.0 1
1925.0 1
1927.0 1
1929.0 2
...
2013.0 237
2014.0 252
2015.0 226
2016.0 106
NaN 107
Name: title_year, Length: 92, dtype: int64
df = df.loc[:,['director_name','actor_1_name','actor_2_name','actor_3_name','genres','movie_title']]df.head(5)
png

The next step is to clean the dataset (removing NULL values or random characters)

# to replace the NULL values with 'unknown'
df['actor_1_name'] = df['actor_1_name'].replace(np.nan, 'unknown')
df['actor_2_name'] = df['actor_2_name'].replace(np.nan, 'unknown')
df['actor_3_name'] = df['actor_3_name'].replace(np.nan, 'unknown')
df['director_name'] = df['director_name'].replace(np.nan, 'unknown')
df
png
# to replace '|' with ' '
df['genres'] = df['genres'].str.replace('|', ' ')
df
png
# to convert all the titles to lowercase
df['movie_title'] = df['movie_title'].str.lower()
df['movie_title'][100]'the fast and the furious\xa0'# to remove the null char at the end
df['movie_title'] = df['movie_title'].apply(lambda x : x[:-1])
df['movie_title'][100]'the fast and the furious'

I will save the dataframe values to a CSV file as ‘2016data.csv’

df.to_csv('2016data.csv',index=False)
Mintu Konwar
Mintu Konwar

Written by Mintu Konwar

A third-year MCA student at Dibrugarh University with an interest in cybersecurity, software development, IT, and Machine Learning.

No responses yet