Introduction¶
Dataset Description¶
I chose the movies data set. This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.
All Columns that appear in the raw data set:
1- id: Unique identifier for each movie.
2- imdb_id: IMDB identifier for the movie, useful for looking up movies on IMDB.
3- popularity: Numeric score representing the popularity of the movie.
4- budget: The production budget of the movie, usually in USD.
5- revenue: The total profits the movie earned, usually in USD.
6- title: The title of the movie.
7- cast: List of main actors in the movie.
8- homepage: URL of the movie's official homepage.
9- director: Name(s) of the director(s) of the movie.
10- tagline: A short promotional phrase or catchphrase for the movie.
11- keywords: Keywords or phrases related to the movie's content or themes.
12- overview: Brief description or synopsis of the movie.
13- runtime: Length of the movie in minutes.
14- genres: Genres associated with the movie, such as Action, Comedy, etc.
15- production_companies: Companies involved in producing the movie.
16- release_date: Release date of the movie.
17- vote_count: Total number of votes the movie received from users.
18- vote_average: Average rating of the movie based on user votes (often out of 10).
19- release_year: The year in which the movie was released.
20- budget_adj: Adjusted budget, possibly taking inflation into account.
21- revenue_adj: Adjusted revenue, possibly accounting for inflation.
I chose to keep columns number(1, 3, 4, 5, 9, 13, 14, 17, 18, 19). Since these are the columns I'm going to need in my investigation.
Questions for Analysis¶
I'm planning to explore the following questions:
1- Does the runtime of a movie correlate with its popularity or revenue?
2- Which directors have the most high-revenue movies?
3- Which movies are most popular from year to year?
4- What kinds of properties are associated with movies that have high revenues?
#importing necessary libraries
import pandas as pd
import numpy as np
%matplotlib inline
Data Wrangling¶
# Load my data and the size of it
df = pd.read_csv('tmdb-movies.csv')
df.shape #(10866, 21)
(10866, 21)
df.head(10)
id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | ... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
3 | 140607 | tt2488496 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | http://www.starwars.com/films/star-wars-episod... | J.J. Abrams | Every generation has a story. | ... | Thirty years after defeating the Galactic Empi... | 136 | Action|Adventure|Science Fiction|Fantasy | Lucasfilm|Truenorth Productions|Bad Robot | 12/15/15 | 5292 | 7.5 | 2015 | 1.839999e+08 | 1.902723e+09 |
4 | 168259 | tt2820852 | 9.335014 | 190000000 | 1506249360 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | http://www.furious7.com/ | James Wan | Vengeance Hits Home | ... | Deckard Shaw seeks revenge against Dominic Tor... | 137 | Action|Crime|Thriller | Universal Pictures|Original Film|Media Rights ... | 4/1/15 | 2947 | 7.3 | 2015 | 1.747999e+08 | 1.385749e+09 |
5 | 281957 | tt1663202 | 9.110700 | 135000000 | 532950503 | The Revenant | Leonardo DiCaprio|Tom Hardy|Will Poulter|Domhn... | http://www.foxmovies.com/movies/the-revenant | Alejandro González Iñárritu | (n. One who has returned, as if from the dead.) | ... | In the 1820s, a frontiersman, Hugh Glass, sets... | 156 | Western|Drama|Adventure|Thriller | Regency Enterprises|Appian Way|CatchPlay|Anony... | 12/25/15 | 3929 | 7.2 | 2015 | 1.241999e+08 | 4.903142e+08 |
6 | 87101 | tt1340138 | 8.654359 | 155000000 | 440603537 | Terminator Genisys | Arnold Schwarzenegger|Jason Clarke|Emilia Clar... | http://www.terminatormovie.com/ | Alan Taylor | Reset the future | ... | The year is 2029. John Connor, leader of the r... | 125 | Science Fiction|Action|Thriller|Adventure | Paramount Pictures|Skydance Productions | 6/23/15 | 2598 | 5.8 | 2015 | 1.425999e+08 | 4.053551e+08 |
7 | 286217 | tt3659388 | 7.667400 | 108000000 | 595380321 | The Martian | Matt Damon|Jessica Chastain|Kristen Wiig|Jeff ... | http://www.foxmovies.com/movies/the-martian | Ridley Scott | Bring Him Home | ... | During a manned mission to Mars, Astronaut Mar... | 141 | Drama|Adventure|Science Fiction | Twentieth Century Fox Film Corporation|Scott F... | 9/30/15 | 4572 | 7.6 | 2015 | 9.935996e+07 | 5.477497e+08 |
8 | 211672 | tt2293640 | 7.404165 | 74000000 | 1156730962 | Minions | Sandra Bullock|Jon Hamm|Michael Keaton|Allison... | http://www.minionsmovie.com/ | Kyle Balda|Pierre Coffin | Before Gru, they had a history of bad bosses | ... | Minions Stuart, Kevin and Bob are recruited by... | 91 | Family|Animation|Adventure|Comedy | Universal Pictures|Illumination Entertainment | 6/17/15 | 2893 | 6.5 | 2015 | 6.807997e+07 | 1.064192e+09 |
9 | 150540 | tt2096673 | 6.326804 | 175000000 | 853708609 | Inside Out | Amy Poehler|Phyllis Smith|Richard Kind|Bill Ha... | http://movies.disney.com/inside-out | Pete Docter | Meet the little voices inside your head. | ... | Growing up can be a bumpy road, and it's no ex... | 94 | Comedy|Animation|Family | Walt Disney Pictures|Pixar Animation Studios|W... | 6/9/15 | 3935 | 8.0 | 2015 | 1.609999e+08 | 7.854116e+08 |
10 rows × 21 columns
#getting more info about columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10866 entries, 0 to 10865 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10866 non-null int64 1 imdb_id 10856 non-null object 2 popularity 10866 non-null float64 3 budget 10866 non-null int64 4 revenue 10866 non-null int64 5 original_title 10866 non-null object 6 cast 10790 non-null object 7 homepage 2936 non-null object 8 director 10822 non-null object 9 tagline 8042 non-null object 10 keywords 9373 non-null object 11 overview 10862 non-null object 12 runtime 10866 non-null int64 13 genres 10843 non-null object 14 production_companies 9836 non-null object 15 release_date 10866 non-null object 16 vote_count 10866 non-null int64 17 vote_average 10866 non-null float64 18 release_year 10866 non-null int64 19 budget_adj 10866 non-null float64 20 revenue_adj 10866 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 1.7+ MB
Data Cleaning¶
# After discussing the structure of the data and any problems that need to be cleaned
#I started by removing unnecessary columnss for my investigation
#to optimize memory usage
df.drop(['imdb_id', 'release_date', 'original_title', 'cast', 'homepage', 'tagline', 'keywords', 'overview', 'production_companies' , 'revenue_adj' , 'budget_adj'], axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10866 entries, 0 to 10865 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10866 non-null int64 1 popularity 10866 non-null float64 2 budget 10866 non-null int64 3 revenue 10866 non-null int64 4 director 10822 non-null object 5 runtime 10866 non-null int64 6 genres 10843 non-null object 7 vote_count 10866 non-null int64 8 vote_average 10866 non-null float64 9 release_year 10866 non-null int64 dtypes: float64(2), int64(6), object(2) memory usage: 849.0+ KB
#removing duplicates
sum(df.duplicated())
df.drop_duplicates(inplace=True)
#dropping null values rows for director & genres columns
#as genre & directors are central to some questions.dropping those rows may be best since they will affect my results
df.dropna(inplace=True)
print(df['director'].isnull().sum()) #to make sure no null values are left
print(df['genres'].isnull().sum()) #to make sure no null values are left
df.info()
0 0 <class 'pandas.core.frame.DataFrame'> Int64Index: 10800 entries, 0 to 10865 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10800 non-null int64 1 popularity 10800 non-null float64 2 budget 10800 non-null int64 3 revenue 10800 non-null int64 4 director 10800 non-null object 5 runtime 10800 non-null int64 6 genres 10800 non-null object 7 vote_count 10800 non-null int64 8 vote_average 10800 non-null float64 9 release_year 10800 non-null int64 dtypes: float64(2), int64(6), object(2) memory usage: 928.1+ KB
# checking if budget col has any 0 values
df[df['budget']==0].shape
(5636, 10)
#getting budget mean for non-zero values
budget_mean = df['budget'][df['budget'] > 0].mean()
budget_mean
30766902.224825718
#replacing each 0 value with the mean
df['budget'].replace(0, budget_mean, inplace=True)
#checking if non-zero values were replaced
df[df['budget']==0].shape
(0, 10)
# checking if revenue col has any 0 values
df[df['revenue']==0].shape
(5952, 10)
#getting revenue mean for non-zero values
rev_mean =df['revenue'][df['revenue'] > 0].mean()
rev_mean
89254997.08642739
#replacing each 0 value with the mean
df['revenue'].replace(0, rev_mean, inplace=True)
df[df['revenue']==0].shape
(0, 10)
# checking if runtime col has any 0 values
df[df['runtime']==0].shape
#getting revenue mean for non-zero values
runtime_mean =df['runtime'][df['runtime'] > 0].mean()
#replacing each 0 value with the mean
df['runtime'].replace(0, runtime_mean , inplace=True)
#creating a function that prints out the max & min values of a column
def Min_Max(a):
print("Minimum Value:" , a.min())
print("Maximum Value:", a.max())
#creating a functions that prints out the floats & ints ranges
def floats_ranges():
print("float16: min:" , np.finfo("float16").min , "| max: ", np.finfo("float16").max )
print("float32: min:" , np.finfo("float32").min , "| max: ", np.finfo("float32").max )
print("float64: min:" , np.finfo("float64").min , "| max: ", np.finfo("float64").max )
def ints_ranges():
print(np.iinfo("int8"))
print(np.iinfo("int16"))
print(np.iinfo("int32"))
print(np.iinfo("int64"))
#checking if the data type int64 suits the range of the value
Min_Max(df['revenue'])
#no need to be changed since its type suits the range
Minimum Value: 2.0 Maximum Value: 2781505847.0
#checking if the data type float64 suits the range of the value
Min_Max(df['popularity'])
Minimum Value: 0.000188 Maximum Value: 32.985763
#checking for a suitable data type range
floats_ranges()
float16: min: -65500.0 | max: 65500.0 float32: min: -3.4028235e+38 | max: 3.4028235e+38 float64: min: -1.7976931348623157e+308 | max: 1.7976931348623157e+308
# it appears that float16 is the most suitable
#changing its data type for memory optimization
df['popularity'] = df['popularity'].astype('float16')
#checking if the data type int64 suits the range of the value
Min_Max(df['id'])
#checking for a better data type range
ints_ranges()
#changing its data type for memory optimization
df['id'] = df['id'].astype('int32')
Minimum Value: 5 Maximum Value: 417859 Machine parameters for int8 --------------------------------------------------------------- min = -128 max = 127 --------------------------------------------------------------- Machine parameters for int16 --------------------------------------------------------------- min = -32768 max = 32767 --------------------------------------------------------------- Machine parameters for int32 --------------------------------------------------------------- min = -2147483648 max = 2147483647 --------------------------------------------------------------- Machine parameters for int64 --------------------------------------------------------------- min = -9223372036854775808 max = 9223372036854775807 ---------------------------------------------------------------
#checking if the data type int64 suits the range of the value
Min_Max(df['budget'])
#checking for a better data type range
ints_ranges()
#changing its data type for memory optimization
df['budget'] = df['budget'].astype('int32')
Minimum Value: 1.0 Maximum Value: 425000000.0 Machine parameters for int8 --------------------------------------------------------------- min = -128 max = 127 --------------------------------------------------------------- Machine parameters for int16 --------------------------------------------------------------- min = -32768 max = 32767 --------------------------------------------------------------- Machine parameters for int32 --------------------------------------------------------------- min = -2147483648 max = 2147483647 --------------------------------------------------------------- Machine parameters for int64 --------------------------------------------------------------- min = -9223372036854775808 max = 9223372036854775807 ---------------------------------------------------------------
#check for unique values to see if its possible to convert to int (there are no floats)
df['runtime'].unique()
#checking if the data type int64 suits the range of the value
Min_Max(df['runtime'])
#changing its data type for memory optimization
ints_ranges()
#changing for a better data type range
df['runtime'] = df['runtime'].astype('int16')
Minimum Value: 2.0 Maximum Value: 900.0 Machine parameters for int8 --------------------------------------------------------------- min = -128 max = 127 --------------------------------------------------------------- Machine parameters for int16 --------------------------------------------------------------- min = -32768 max = 32767 --------------------------------------------------------------- Machine parameters for int32 --------------------------------------------------------------- min = -2147483648 max = 2147483647 --------------------------------------------------------------- Machine parameters for int64 --------------------------------------------------------------- min = -9223372036854775808 max = 9223372036854775807 ---------------------------------------------------------------
#since genres column has multiple values splitted by '|'
# i decided to split them & save genres in an array
df['genres'] = df['genres'].str.split('|')
df['genres']
0 [Action, Adventure, Science Fiction, Thriller] 1 [Action, Adventure, Science Fiction, Thriller] 2 [Adventure, Science Fiction, Thriller] 3 [Action, Adventure, Science Fiction, Fantasy] 4 [Action, Crime, Thriller] ... 10861 [Documentary] 10862 [Action, Adventure, Drama] 10863 [Mystery, Comedy] 10864 [Action, Comedy] 10865 [Horror] Name: genres, Length: 10800, dtype: object
#checking if its possible to change their data type from object to category
print(df['director'].nunique())
print(df['release_year'].nunique())
#better to convert it to category to save memory
df['director'] = df['director'].astype('category')
df['release_year'] = df['release_year'].astype('category')
5056 56
#checking for a better data type range
Min_Max(df['vote_count'])
#checking if the data type int64 suits the range of the value
ints_ranges()
#changing its data type for memory optimization
df['vote_count'] = df['vote_count'].astype('int16')
Minimum Value: 10 Maximum Value: 9767 Machine parameters for int8 --------------------------------------------------------------- min = -128 max = 127 --------------------------------------------------------------- Machine parameters for int16 --------------------------------------------------------------- min = -32768 max = 32767 --------------------------------------------------------------- Machine parameters for int32 --------------------------------------------------------------- min = -2147483648 max = 2147483647 --------------------------------------------------------------- Machine parameters for int64 --------------------------------------------------------------- min = -9223372036854775808 max = 9223372036854775807 ---------------------------------------------------------------
#checking for a better data type range
Min_Max(df['vote_average'])
#checking if the data type int64 suits the range of the value
floats_ranges()
#changing its data type for memory optimization
df['vote_average'] = df['vote_average'].astype('float16')
Minimum Value: 1.5 Maximum Value: 9.2 float16: min: -65500.0 | max: 65500.0 float32: min: -3.4028235e+38 | max: 3.4028235e+38 float64: min: -1.7976931348623157e+308 | max: 1.7976931348623157e+308
#to make sure we dont have to deal with zero or null values
for c in df.columns:
print(df[df[c]==0].shape , df[df[c]==0].isnull() , c)
(0, 10) Empty DataFrame Columns: [id, popularity, budget, revenue, director, runtime, genres, vote_count, vote_average, release_year] Index: [] id (0, 10) Empty DataFrame Columns: [id, popularity, budget, revenue, director, runtime, genres, vote_count, vote_average, release_year] Index: [] popularity (0, 10) Empty DataFrame Columns: [id, popularity, budget, revenue, director, runtime, genres, vote_count, vote_average, release_year] Index: [] budget (0, 10) Empty DataFrame Columns: [id, popularity, budget, revenue, director, runtime, genres, vote_count, vote_average, release_year] Index: [] revenue (0, 10) Empty DataFrame Columns: [id, popularity, budget, revenue, director, runtime, genres, vote_count, vote_average, release_year] Index: [] director (0, 10) Empty DataFrame Columns: [id, popularity, budget, revenue, director, runtime, genres, vote_count, vote_average, release_year] Index: [] runtime (0, 10) Empty DataFrame Columns: [id, popularity, budget, revenue, director, runtime, genres, vote_count, vote_average, release_year] Index: [] genres (0, 10) Empty DataFrame Columns: [id, popularity, budget, revenue, director, runtime, genres, vote_count, vote_average, release_year] Index: [] vote_count (0, 10) Empty DataFrame Columns: [id, popularity, budget, revenue, director, runtime, genres, vote_count, vote_average, release_year] Index: [] vote_average (0, 10) Empty DataFrame Columns: [id, popularity, budget, revenue, director, runtime, genres, vote_count, vote_average, release_year] Index: [] release_year
#data set after cleaned and memory optimized
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10800 entries, 0 to 10865 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10800 non-null int32 1 popularity 10800 non-null float16 2 budget 10800 non-null int32 3 revenue 10800 non-null float64 4 director 10800 non-null category 5 runtime 10800 non-null int16 6 genres 10800 non-null object 7 vote_count 10800 non-null int16 8 vote_average 10800 non-null float16 9 release_year 10800 non-null category dtypes: category(2), float16(2), float64(1), int16(2), int32(2), object(1) memory usage: 624.5+ KB
df.head(100)
id | popularity | budget | revenue | director | runtime | genres | vote_count | vote_average | release_year | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 135397 | 33.000000 | 150000000 | 1.513529e+09 | Colin Trevorrow | 124 | [Action, Adventure, Science Fiction, Thriller] | 5562 | 6.500000 | 2015 |
1 | 76341 | 28.421875 | 150000000 | 3.784364e+08 | George Miller | 120 | [Action, Adventure, Science Fiction, Thriller] | 6185 | 7.101562 | 2015 |
2 | 262500 | 13.109375 | 110000000 | 2.952382e+08 | Robert Schwentke | 119 | [Adventure, Science Fiction, Thriller] | 2480 | 6.300781 | 2015 |
3 | 140607 | 11.171875 | 200000000 | 2.068178e+09 | J.J. Abrams | 136 | [Action, Adventure, Science Fiction, Fantasy] | 5292 | 7.500000 | 2015 |
4 | 168259 | 9.335938 | 190000000 | 1.506249e+09 | James Wan | 137 | [Action, Crime, Thriller] | 2947 | 7.300781 | 2015 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 258509 | 1.841797 | 30766902 | 2.337556e+08 | Walt Becker | 92 | [Adventure, Animation, Comedy, Family] | 278 | 5.699219 | 2015 |
96 | 298382 | 1.823242 | 11930000 | 1.834000e+07 | Jocelyn Moorhouse | 118 | [Drama] | 197 | 6.898438 | 2015 |
97 | 272693 | 1.758789 | 8500000 | 4.352863e+07 | Ari Sandel | 100 | [Romance, Comedy] | 753 | 6.800781 | 2015 |
98 | 283445 | 1.742188 | 10000000 | 5.288202e+07 | Ciaran Foy | 97 | [Horror] | 331 | 5.500000 | 2015 |
99 | 256961 | 1.735352 | 30000000 | 1.075972e+08 | Andy Fickman | 94 | [Action, Adventure, Comedy, Family] | 422 | 5.300781 | 2015 |
100 rows × 10 columns
Exploratory Data Analysis¶
Question 1: Does the runtime of a movie correlate with its popularity or revenue?¶
df.plot(y='runtime' , x='popularity' , kind="scatter");
df.plot(y='runtime' , x='revenue', kind="scatter" );
Answer:¶
In Summary:¶
runtime does not appear to correlate strongly with either popularity or revenue.
Runtime Vs. Popularity:¶
The points are concentrated in a dense cluster where movies with shorter runtimes (under 200 minutes) appear across the entire range of popularity scores.
There doesn’t seem to be a strong positive or negative correlation between runtime and popularity. Some movies with shorter runtimes are highly popular, while others are not, and the popularity score doesn’t consistently increase with runtime.
The data suggests that runtime does not have a significant correlation with popularity; popular movies can be either long or short.
Runtime vs. Revenue:¶
Similar to the popularity plot, there is a dense cluster of movies with runtimes under 200 minutes, which span a wide range of revenue values.
There is no clear upward or downward trend, indicating that longer movies do not necessarily earn more revenue, and shorter movies do not necessarily earn less.
Therefore, runtime does not show a strong correlation with revenue either; high-revenue movies can have various runtimes.
#if we further look at runtime column corrolations with other columns
pd.plotting.scatter_matrix(df.iloc[:, 1:] , figsize=(15,15));
# i used df.iloc[:, 1:] to draw the matrix without the id col
Further Investigation¶
if we look here at the relationship between the runtime and all other columns it also doesnt have a negative or positive corrolation meaning it doesn't affect a movies revenue nor popularity.
Question 2: Which directors have the most high-revenue movies?¶
# assuming the high revenue movies are the ones higher than 50% of the movies
high_revenue = df['revenue'].quantile(0.5)
# getting the hight 50% rvenue movies
high_revenue_df = df[df['revenue'] >= high_revenue]
# counting directors with high revenue movies
director_counts = high_revenue_df['director'].value_counts()
director_counts.head()
Woody Allen 26 Steven Spielberg 25 Clint Eastwood 19 Martin Scorsese 19 Ron Howard 16 Name: director, dtype: int64
#getting top 15 direcots
top_directors = director_counts.head(15)
# plotting high revenue movies Vs their directors
top_directors.plot(figsize=(5,5), title="Top 15 Directors with the Most High-Revenue Movies" , xlabel="Director", ylabel="Number of High Revenue Movies" , kind='bar', color='blue');
Answer:¶
I defined high-revenue movies as those with revenue above the median (50th percentile) revenue of all movies.
According to the plot:
It appears that Woody Allen, Steven Spielberg, Clint Eastwood, Martin Scorsese, and Ron Howard are among the directors with the highest number of high-revenue movies. This suggests that these directors frequently produce movies that perform well financially, indicating both their popularity and potential ability to consistently attract large audiences.
This plot provides a clear, ranked view of which directors have directed the most high-revenue movies.
Question 3: Which movies are most popular from year to year?¶
#lets get most popular movies
most_pop = df['popularity'].quantile(0.5)
# getting the hight 50% rvenue movies
most_pop_df = df[df['popularity'] >= most_pop]
# counting most popular movies according to their release year
years_count = most_pop_df['release_year'].value_counts()
years_count
2014 339 2015 320 2013 298 2011 292 2009 264 2012 261 2010 253 2008 245 2007 233 2006 210 2005 194 2004 174 2003 150 2002 149 2001 133 1996 121 1998 120 1997 115 1995 113 2000 110 1999 108 1994 105 1993 103 1992 72 1990 66 1989 62 1986 62 1988 58 1985 58 1991 57 1987 55 1984 49 1982 39 1983 37 1980 30 1981 28 1979 28 1978 28 1973 25 1977 22 1971 20 1975 19 1968 18 1976 18 1974 16 1963 14 1967 14 1969 13 1972 12 1962 12 1966 12 1960 11 1970 11 1964 10 1961 9 1965 7 Name: release_year, dtype: int64
#plotting a graph for most 15 popular movies according to their release year
years_count.head(15).plot(figsize=(5,5), title="Top 15 most Popular movies with their realease year" , xlabel="Release Year", ylabel="Popularity" , kind='bar', color='green');
Answer:¶
It appears that movies from years [2014, 2015, 2013, 2011, 2009] are most popular.
Question 4: What kinds of properties are associated with movies that have high revenues?¶
#we already got the high revenue movies in the variable high_revenue_df
#now i wanna get their categories at first
#genres are saved in series. i want to explode better for investigation
high_revenue_df = high_revenue_df.assign(genres=high_revenue_df['genres']).explode('genres') #chatgpt
#getting genres with high revenue movies
top_genres = high_revenue_df['genres'].value_counts()
top_genres
Drama 2941 Comedy 2500 Thriller 1873 Action 1606 Horror 1176 Romance 1041 Adventure 1031 Family 955 Science Fiction 865 Crime 769 Fantasy 672 Animation 587 Mystery 516 Documentary 410 Music 262 History 216 War 181 TV Movie 161 Foreign 155 Western 118 Name: genres, dtype: int64
#plotting graph
top_genres.head(15).plot(figsize=(5,5), title="Top 15 Genres with the Most High-Revenue Movies" , xlabel="Genre", ylabel="Number of High Revenue Movies" , kind='bar', color='red');
#since we have a high revenue df
#let me draw a matrix to view its corrolations better
pd.plotting.scatter_matrix(high_revenue_df.iloc[: , 1:] , figsize=(10,10));
Answer:¶
From the scatter plot matrix, we can observe the following trends:
Popularity and Revenue:¶
There seems to be a positive correlation between popularity and revenue, suggesting that more popular movies tend to generate higher revenue.
Budget and Revenue:¶
There's also a noticeable positive correlation between budget and revenue. Higher budgets may contribute to higher production quality, marketing reach, and distribution, all of which could drive up revenue.
Vote Count and Revenue:¶
A strong positive correlation exists between vote count and revenue, indicating that high-revenue movies are often highly rated by many viewers.
Runtime and Revenue:¶
There’s no strong correlation between runtime and revenue, suggesting that longer or shorter films are not necessarily associated with higher revenue.
Vote Average and Revenue:¶
There is a moderate positive trend, showing that higher-rated movies tend to have higher revenue, although this relationship is not as strong as with popularity and vote count.
Conclusions¶
I started at first by cleaning and organizing my data set of movies. (I removedd unnecessary columns, removed duplicates, converted data types to optimize memory usage)
I asked some questions about the data andd here is what i found:
1- Does the runtime of a movie correlate with its popularity or revenue?¶
The runtime of a movie does not significantly influence either its popularity or revenue. Scatter plot analysis shows no strong correlation between runtime and popularity or revenue, as movies with various runtimes are distributed across all popularity and revenue levels. High-popularity or high-revenue movies can have short or long runtimes, indicating that runtime is not a determining factor in a movie's success in terms of popularity or financial performance.
2- Which directors have the most high-revenue movies?¶
Directors like Woody Allen and Steven Spielberg have repeatedly directed high-revenue movies, highlighting their impact and popularity in the film industry. This analysis reflects that certain directors are strongly associated with financially successful films, possibly due to factors like established fan bases, genre specialization, or storytelling style.
3- Which movies are most popular from year to year?¶
The chart showed that the most popular movies are primarily from recent years, with a particular concentration in 2014, 2015, 2013, 2011, and 2009. This trend suggests that recent movies (from the early 2010s onward) have higher popularity ratings, which could be due to better accessibility through streaming platforms, increased marketing, or evolving audience preferences. It appears that newer releases tend to gain more attention and popularity, possibly influenced by contemporary trends in media consumption and digital access.
4- What kinds of properties are associated with movies that have high revenues?¶
It appeared that high-revenue movies tend to have high popularity, larger budgets, and higher vote counts. They are also associated with genres like Drama, Comedy, and Thriller. However, runtime did not appear to be a significant factor in revenue generation, meaning movies of varying lengths can perform well financially. These findings suggest that investing in popular genres and ensuring wide audience appeal can increase the likelihood of achieving high revenue.
Faced limitations:¶
Bias Toward More Recent Movies:¶
As seen in the popularity chart, recent movies are more represented among popular titles. This could be due to factors like improved global distribution or online marketing rather than intrinsic quality or appeal, leading to potential bias. Older movies might not have the same level of recorded popularity or revenue data, skewing analysis.
Limited Genre Information:¶
The genres column contains multiple genres per movie, which can make it difficult to isolate the impact of individual genres on revenue or popularity. The analysis assumes that all genres listed for a movie contribute equally to its popularity or revenue, which may not be accurate.
Further Investigation¶
Analyze Genre Combinations: Since movies often belong to multiple genres, you could explore which genre combinations are associated with higher revenue or popularity. For instance, comparing "Action | Adventure" movies against "Comedy | Romance" might yield insights into the profitability of genre blends.
Investigate Trends Over Time: Explore how certain attributes (like budget, runtime, and genre popularity) have evolved over different periods. Are there shifts in preferred movie lengths, budgets, or genre popularity over the years?
Study Director and Actor Influence: Examine the impact of certain directors or popular actors on a movie's success. Do movies directed by well-known directors or starring specific actors tend to have higher revenue or popularity?
Audience Reviews and Ratings Analysis: Look into the relationship between vote average, vote count, and popularity. Do highly rated movies also tend to be the most popular or profitable, or is there a discrepancy between audience ratings and revenue?
Analyze Changes in Movie-Making Trends: Investigate changes in film-making attributes (e.g., genre focus, runtime, budgets) over time to see how the industry adapts to audience preferences or technological advancements.
# Running this cell will execute a bash command to convert this notebook to an .html file
!python -m nbconvert --to html Investigate_a_Dataset.ipynb
[NbConvertApp] Converting notebook Investigate_a_Dataset.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 7 image(s). [NbConvertApp] Writing 1347661 bytes to Investigate_a_Dataset.html