Mintu Konwar
4 min readAug 22, 2021

--

Movie Recommendation System with Sentiment Analysis of the reviews(Part 3)

File_3 of preprocessing

In this preprocessing file, I will access the HTML of the webpage and extract useful information/data from it. This technique is called web harvesting or web data extraction.

import pandas as pd
import numpy as np
from tmdbv3api import TMDb
from tmdbv3api import Movie
import json
import requests
import warnings
warnings.filterwarnings("ignore")

Web Scraping 2018 movie details from Wikipedia

movielink = "https://en.wikipedia.org/wiki/List_of_American_films_of_2018"
df1 = pd.read_html(movielink, header=0)[2]
df2 = pd.read_html(movielink, header=0)[3]
df3 = pd.read_html(movielink, header=0)[4]
df4 = pd.read_html(movielink, header=0)[5]
df = df1.append(df2.append(df3.append(df4,ignore_index=True),ignore_index=True),ignore_index=True)df
png

From Wikipedia I got the movies dataset that doesn’t contain any genres for the movies. So in next step I will use a third party website(www.themoviedb.org) for collecting the genres. For that I will need an API key of TMDB website, which will be required for fetching data from that website.

Steps to get an API key:

Create an account at https://www.themoviedb.org/, then go to your account settings and click on the API link in the left hand sidebar to apply for an API key. If a website URL is requested, just state “NA” if you do not have one. Once your request is accepted, the API key will appear on your API sidebar.

tmdb = TMDb()
tmdb.api_key = 'f82d3658e4fe4c21487f2c409f868517'

This get_genre() function will fetch all movies genres from TMDB website with the help of API key

tmdb_movie = Movie()
def get_genre(x):
genres = []
result = tmdb_movie.search(x)
movie_id = result[0].id
response = requests.get('https://api.themoviedb.org/3/movie/{}?api_key=f82d3658e4fe4c21487f2c409f868517'.format(movie_id,tmdb.api_key))
data_json = response.json()
if data_json['genres']:
genre_str = " "
for i in range(0,len(data_json['genres'])):
genres.append(data_json['genres'][i]['name'])
return genre_str.join(genres)
else:
np.NaN
## Here I am mapping the genres to each movie title using lambda function of pandas
df['genres'] = df['Title'].map(lambda x: get_genre(str(x)))
df
png
df_2018 = df[['Title','Cast and crew','genres']]df_2018
png
def get_director(x):
if " (director)" in x:
return x.split(" (director)")[0]
elif " (directors)" in x:
return x.split(" (directors)")[0]
else:
return x.split(" (director/screenplay)")[0]
df_2018['director_name'] = df_2018['Cast and crew'].map(lambda x: get_director(x))def get_actor1(x):
return ((x.split("screenplay); ")[-1]).split(", ")[0])
df_2018['actor_1_name'] = df_2018['Cast and crew'].map(lambda x: get_actor1(x))def get_actor2(x):
if len((x.split("screenplay); ")[-1]).split(", ")) < 2:
return np.NaN
else:
return ((x.split("screenplay); ")[-1]).split(", ")[1])
df_2018['actor_2_name'] = df_2018['Cast and crew'].map(lambda x: get_actor2(x))def get_actor3(x):
if len((x.split("screenplay); ")[-1]).split(", ")) < 3:
return np.NaN
else:
return ((x.split("screenplay); ")[-1]).split(", ")[2])
df_2018['actor_3_name'] = df_2018['Cast and crew'].map(lambda x: get_actor3(x))df_2018
png
df_2018 = df_2018.rename(columns={'Title':'movie_title'})new_df18 = df_2018.loc[:,['director_name','actor_1_name','actor_2_name','actor_3_name','genres','movie_title']]new_df18
png
new_df18['actor_2_name'] = new_df18['actor_2_name'].replace(np.nan, 'unknown')
new_df18['actor_3_name'] = new_df18['actor_3_name'].replace(np.nan, 'unknown')
new_df18['movie_title'] = new_df18['movie_title'].str.lower()new_df18['comb'] = new_df18['actor_1_name'] + ' ' + new_df18['actor_2_name'] + ' '+ new_df18['actor_3_name'] + ' '+ new_df18['director_name'] +' ' + new_df18['genres']new_df18
png

Similarly same steps are followed for 2019 movies

Web Scraping features of 2019 movies from Wikipedia

movielink = "https://en.wikipedia.org/wiki/List_of_American_films_of_2019"
df1 = pd.read_html(movielink, header=0)[3]
df2 = pd.read_html(movielink, header=0)[4]
df3 = pd.read_html(movielink, header=0)[5]
df4 = pd.read_html(movielink, header=0)[6]
df = df1.append(df2.append(df3.append(df4,ignore_index=True),ignore_index=True),ignore_index=True)df
png
df['genres'] = df['Title'].map(lambda x: get_genre(str(x)))df_2019 = df[['Title','Cast and crew','genres']]df_2019
png
df_2019['director_name'] = df_2019['Cast and crew'].map(lambda x: get_director(str(x)))df_2019['actor_1_name'] = df_2019['Cast and crew'].map(lambda x: get_actor1(x))df_2019['actor_2_name'] = df_2019['Cast and crew'].map(lambda x: get_actor2(x))df_2019['actor_3_name'] = df_2019['Cast and crew'].map(lambda x: get_actor3(x))df_2019 = df_2019.rename(columns={'Title':'movie_title'})new_df19 = df_2019.loc[:,['director_name','actor_1_name','actor_2_name','actor_3_name','genres','movie_title']]new_df19['actor_2_name'] = new_df19['actor_2_name'].replace(np.nan, 'unknown')
new_df19['actor_3_name'] = new_df19['actor_3_name'].replace(np.nan, 'unknown')
new_df19['movie_title'] = new_df19['movie_title'].str.lower()new_df19['comb'] = new_df19['actor_1_name'] + ' ' + new_df19['actor_2_name'] + ' '+ new_df19['actor_3_name'] + ' '+ new_df19['director_name'] +' ' + new_df19['genres']new_df19
png

Combining 2018 and 2019 movies dataset

df_1819 = new_df18.append(new_df19,ignore_index=True)df_1819
png
old_df = pd.read_csv('2017data.csv')old_df
png

Combining 2016, 2017 movies to 2018–19 dataset

df_16171819 = old_df.append(df_1819,ignore_index=True)df_16171819
png
## Check for any NULL values 
df_16171819.isna().sum()
director_name 0
actor_1_name 0
actor_2_name 0
actor_3_name 0
genres 3
movie_title 0
comb 3
dtype: int64
df_16171819 = df_16171819.dropna(how='any')

This is the final dataset that contains all the movies till 2019

df_16171819.to_csv('final_16171819.csv',index=False)

--

--

Mintu Konwar

A third-year MCA student at Dibrugarh University with an interest in cybersecurity, software development, IT, and Machine Learning.