2018_19 datasets

4 min readAug 22, 2021

Movie Recommendation System with Sentiment Analysis of the reviews(Part 3)

File_3 of preprocessing

In this preprocessing file, I will access the HTML of the webpage and extract useful information/data from it. This technique is called web harvesting or web data extraction.

import pandas as pd
import numpy as np
from tmdbv3api import TMDb
from tmdbv3api import Movie
import json
import requests
import warnings
warnings.filterwarnings("ignore")

Web Scraping 2018 movie details from Wikipedia

movielink = "https://en.wikipedia.org/wiki/List_of_American_films_of_2018"
df1 = pd.read_html(movielink, header=0)[2]
df2 = pd.read_html(movielink, header=0)[3]
df3 = pd.read_html(movielink, header=0)[4]
df4 = pd.read_html(movielink, header=0)[5]df = df1.append(df2.append(df3.append(df4,ignore_index=True),ignore_index=True),ignore_index=True)df

From Wikipedia I got the movies dataset that doesn’t contain any genres for the movies. So in next step I will use a third party website(www.themoviedb.org) for collecting the genres. For that I will need an API key of TMDB website, which will be required for fetching data from that website.

Steps to get an API key:

Create an account at https://www.themoviedb.org/, then go to your account settings and click on the API link in the left hand sidebar to apply for an API key. If a website URL is requested, just state “NA” if you do not have one. Once your request is accepted, the API key will appear on your API sidebar.

tmdb = TMDb()
tmdb.api_key = 'f82d3658e4fe4c21487f2c409f868517'

This get_genre() function will fetch all movies genres from TMDB website with the help of API key

tmdb_movie = Movie()
def get_genre(x):
    genres = []
    result = tmdb_movie.search(x)
    movie_id = result[0].id
    response = requests.get('https://api.themoviedb.org/3/movie/{}?api_key=f82d3658e4fe4c21487f2c409f868517'.format(movie_id,tmdb.api_key))
    data_json = response.json()
    if data_json['genres']:
        genre_str = " " 
        for i in range(0,len(data_json['genres'])):
            genres.append(data_json['genres'][i]['name'])
        return genre_str.join(genres)
    else:
        np.NaN## Here I am mapping the genres to each movie title using lambda function of pandas
df['genres'] = df['Title'].map(lambda x: get_genre(str(x)))df

df_2018 = df[['Title','Cast and crew','genres']]df_2018

def get_director(x):
    if " (director)" in x:
        return x.split(" (director)")[0]
    elif " (directors)" in x:
        return x.split(" (directors)")[0]
    else:
        return x.split(" (director/screenplay)")[0]df_2018['director_name'] = df_2018['Cast and crew'].map(lambda x: get_director(x))def get_actor1(x):
    return ((x.split("screenplay); ")[-1]).split(", ")[0])df_2018['actor_1_name'] = df_2018['Cast and crew'].map(lambda x: get_actor1(x))def get_actor2(x):
    if len((x.split("screenplay); ")[-1]).split(", ")) < 2:
        return np.NaN
    else:
        return ((x.split("screenplay); ")[-1]).split(", ")[1])df_2018['actor_2_name'] = df_2018['Cast and crew'].map(lambda x: get_actor2(x))def get_actor3(x):
    if len((x.split("screenplay); ")[-1]).split(", ")) < 3:
        return np.NaN
    else:
        return ((x.split("screenplay); ")[-1]).split(", ")[2])df_2018['actor_3_name'] = df_2018['Cast and crew'].map(lambda x: get_actor3(x))df_2018

df_2018 = df_2018.rename(columns={'Title':'movie_title'})new_df18 = df_2018.loc[:,['director_name','actor_1_name','actor_2_name','actor_3_name','genres','movie_title']]new_df18

new_df18['actor_2_name'] = new_df18['actor_2_name'].replace(np.nan, 'unknown')
new_df18['actor_3_name'] = new_df18['actor_3_name'].replace(np.nan, 'unknown')new_df18['movie_title'] = new_df18['movie_title'].str.lower()new_df18['comb'] = new_df18['actor_1_name'] + ' ' + new_df18['actor_2_name'] + ' '+ new_df18['actor_3_name'] + ' '+ new_df18['director_name'] +' ' + new_df18['genres']new_df18

Similarly same steps are followed for 2019 movies

Web Scraping features of 2019 movies from Wikipedia

movielink = "https://en.wikipedia.org/wiki/List_of_American_films_of_2019"
df1 = pd.read_html(movielink, header=0)[3]
df2 = pd.read_html(movielink, header=0)[4]
df3 = pd.read_html(movielink, header=0)[5]
df4 = pd.read_html(movielink, header=0)[6]df = df1.append(df2.append(df3.append(df4,ignore_index=True),ignore_index=True),ignore_index=True)df

df['genres'] = df['Title'].map(lambda x: get_genre(str(x)))df_2019 = df[['Title','Cast and crew','genres']]df_2019

df_2019['director_name'] = df_2019['Cast and crew'].map(lambda x: get_director(str(x)))df_2019['actor_1_name'] = df_2019['Cast and crew'].map(lambda x: get_actor1(x))df_2019['actor_2_name'] = df_2019['Cast and crew'].map(lambda x: get_actor2(x))df_2019['actor_3_name'] = df_2019['Cast and crew'].map(lambda x: get_actor3(x))df_2019 = df_2019.rename(columns={'Title':'movie_title'})new_df19 = df_2019.loc[:,['director_name','actor_1_name','actor_2_name','actor_3_name','genres','movie_title']]new_df19['actor_2_name'] = new_df19['actor_2_name'].replace(np.nan, 'unknown')
new_df19['actor_3_name'] = new_df19['actor_3_name'].replace(np.nan, 'unknown')new_df19['movie_title'] = new_df19['movie_title'].str.lower()new_df19['comb'] = new_df19['actor_1_name'] + ' ' + new_df19['actor_2_name'] + ' '+ new_df19['actor_3_name'] + ' '+ new_df19['director_name'] +' ' + new_df19['genres']new_df19

Combining 2018 and 2019 movies dataset

df_1819 = new_df18.append(new_df19,ignore_index=True)df_1819

old_df = pd.read_csv('2017data.csv')old_df

Combining 2016, 2017 movies to 2018–19 dataset

df_16171819 = old_df.append(df_1819,ignore_index=True)df_16171819

## Check for any NULL values 
df_16171819.isna().sum()director_name    0
actor_1_name     0
actor_2_name     0
actor_3_name     0
genres           3
movie_title      0
comb             3
dtype: int64df_16171819 = df_16171819.dropna(how='any')

This is the final dataset that contains all the movies till 2019

df_16171819.to_csv('final_16171819.csv',index=False)

Written by Mintu Konwar