2 min readAug 16, 2021
Movie Recommendation System with Sentiment Analysis of the reviews(Part 1)
File_1 of Preprocessing
In this preprocessing file, I will import a random dataset from Kaggle and see the no of movies produced every year. The next step is to select only those features from the dataset which will be needed for Model creation and Analysis.
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")df = pd.read_csv('2016.csv')df.head(8)
# to see no of rows and columns
df.shape(5042, 28)# we will count the number of movies produced in a year
df.title_year.value_counts(dropna=False).sort_index()1916.0 1
1920.0 1
1925.0 1
1927.0 1
1929.0 2
...
2013.0 237
2014.0 252
2015.0 226
2016.0 106
NaN 107
Name: title_year, Length: 92, dtype: int64df = df.loc[:,['director_name','actor_1_name','actor_2_name','actor_3_name','genres','movie_title']]df.head(5)
The next step is to clean the dataset (removing NULL values or random characters)
# to replace the NULL values with 'unknown'
df['actor_1_name'] = df['actor_1_name'].replace(np.nan, 'unknown')
df['actor_2_name'] = df['actor_2_name'].replace(np.nan, 'unknown')
df['actor_3_name'] = df['actor_3_name'].replace(np.nan, 'unknown')
df['director_name'] = df['director_name'].replace(np.nan, 'unknown')df
# to replace '|' with ' '
df['genres'] = df['genres'].str.replace('|', ' ')df
# to convert all the titles to lowercase
df['movie_title'] = df['movie_title'].str.lower()df['movie_title'][100]'the fast and the furious\xa0'# to remove the null char at the end
df['movie_title'] = df['movie_title'].apply(lambda x : x[:-1])df['movie_title'][100]'the fast and the furious'
I will save the dataframe values to a CSV file as ‘2016data.csv’
df.to_csv('2016data.csv',index=False)