Pandas Exercises - View and Explore Data

This is Pandas exercise about data analysis and exploration. It's the first exercise from a series (check the end of the post for more details).

You can test knowledge in several areas:

initial data analysis
"viewing data" - as advertised on the Pandas page: Viewing data
summarizing data

We can think of data exploration as getting to know your food. I think that most of us prefer to see, taste and smell the food - before eating. For me it's similar to data.

Data should be "seen, smelled and tasted".

Setup

In this tutorial we will work with the following dataset: The Movies Dataset.

The dataset contains data for 45,000 movies. Data points include cast, crew, plot keywords, revenue, release dates, languages, production companies, countries, vote averages and more.

We will focus on the file: movies_metadata.csv. Data from the CSV file can be read as follows:

import pandas as pddf = pd.read_csv('movies_metadata.csv', low_memory=False)

1: How many rows and columns?

Usually the first thing which I do is to get the DataFrame dimension.

It will give us a rough idea about the size of the DataFrame. Why rough? Because you may have nested hierarchical data - multilevel columns.

df.shape

This will give us:

(45466, 24)

This might save you time and effort. Sometimes it is good to start with smaller data samples.

Bonus question 1

Can you guess how many rows and columns?

df.shape[0]

Bonus Question 2

Can you name synonyms or other names of columns and rows?

rows - observations, records, trials

2: View the top and bottom rows

How can we see the first N records of the dataset? What about the last N rows?

- top rows

df.head()

- last rows

df.tail()

Bonus question 3

How can we see bottom and top rows simultaneously in Pandas?

import numpy as npdf.iloc[np.r_[0:5, -5:0]]

Did you solve that one? Good job, it was a tricky one!

Bonus question 4

Do you know why there are parentheses and sometimes not?
like:

df.head()

and:

df.shape

The reason is because

head()

is a method and should be with parentheses. While

shape

is an attribute - so it should be without parentheses

3: Display the index and columns

Can you display the column names? What about index/row names?

df.columns

df.index

We will get about the columns:

4: Generate descriptive statistics

What about getting a quick statistical summary for a DataFrame in Pandas? This would give us valuable information about:

count
mean
standard deviation
percentile

df.describe()

This would give us:

	revenue	runtime	vote_average	vote_count
count	4.546000e+04	45203.000000	45460.000000	45460.000000
mean	1.120935e+07	94.128199	5.618207	109.897338
std	6.433225e+07	38.407810	1.924216	491.310374
min	0.000000e+00	0.000000	0.000000	0.000000
25%	0.000000e+00	85.000000	5.000000	3.000000
50%	0.000000e+00	95.000000	6.000000	10.000000
75%	0.000000e+00	107.000000	6.800000	34.000000
max	2.787965e+09	1256.000000	10.000000	14075.000000

The method will analyze both numeric and object series, as well as DataFrame column sets of mixed data types.

5: What are the dtypes?

What are the data types of the columns in a DataFrame. Data types might be several kinds like:

numeric
datetime
string/object
nested data - dictionary, list or json
categorical

It can help us when we work with our data. Think of it like knowing the OS of the computer. Working on Linux differs from Windows? At least for me.

So how to get data types?

df.dtypes

This would give us:

adult objectbelongs_to_collection objectbudget objectgenres objecthomepage objectid objectimdb_id objectoriginal_language objectoriginal_title objectoverview objectpopularity objectposter_path objectproduction_companies objectproduction_countries objectrelease_date objectrevenue float64runtime float64spoken_languages objectstatus objecttagline objecttitle objectvideo objectvote_average float64vote_count float64dtype: object

6: Transpose rows

How can we see the first N records of the dataset - transposed?

I like to transpose rows when I'm working with few rows with many columns.

The first few rows looks like that(transposed):

	0	1
adult	False	False
belongs_to_collection	{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}	NaN
budget	30000000	65000000
genres	[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]	[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
homepage	http://toystory.disney.com/toy-story	NaN
id	862	8844
imdb_id	tt0114709	tt0113497
original_language	en	en
original_title	Toy Story	Jumanji
overview	Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circ*mstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.	When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.
popularity	21.946943	17.015539
poster_path	/rhIRbceoE9lR4veEXuwCC2wARtG.jpg	/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
production_companies	[{'name': 'Pixar Animation Studios', 'id': 3}]	[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]
production_countries	[{'iso_3166_1': 'US', 'name': 'United States of America'}]	[{'iso_3166_1': 'US', 'name': 'United States of America'}]
release_date	1995-10-30	1995-12-15
revenue	373554033.0	262797249.0
runtime	81.0	104.0
spoken_languages	[{'iso_639_1': 'en', 'name': 'English'}]	[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]
status	Released	Released
tagline	NaN	Roll the dice and unleash the excitement!
title	Toy Story	Jumanji
video	False	False
vote_average	7.7	6.9
vote_count	5415.0	2413.0

And transposing rows is possible by:

df.head(2).T

Which is the alias of method transpose(). This is another thing to have in mind working with Pandas - aliases.

7: Sort values

Can you get the top most rated movies from this dataset? To do so we will need to sort data by a column.

So the last exercise is to get the names and number of votes for the most voted movies in that dataset.

df.sort_values(by='vote_count', ascending=False)[['title', 'vote_count']].head()

Bonus question 5

Can you find the top 7 values for the numerical columns? Return the result as a DataFrame:

	revenue	runtime	vote_average	vote_count
0	2.787965e+09	1256.0	10.0	14075.0
1	2.068224e+09	1140.0	10.0	12269.0
2	1.845034e+09	1140.0	10.0	12114.0
3	1.519558e+09	931.0	10.0	12000.0
4	1.513529e+09	925.0	10.0	11444.0
5	1.506249e+09	900.0	10.0	11187.0
6	1.405404e+09	877.0	10.0	10297.0

Can you optimize the proposed solution below?

from pandas.api.types import is_numeric_dtypedfs = []for col in df.columns: top_values = [] if is_numeric_dtype(df[col]): top_values = df[col].nlargest(n=7) dfs.append(pd.DataFrame({col: top_values}).reset_index(drop=True))pd.concat(dfs, axis=1)

Were you able to solve it? Well done!

Bonus question 6

Method describe() is good for initial exploratory data analysis. For more details like:

type
unique values
missing values
most frequent values
histogram and more

Can you research and find Python library which can do "serious exploratory data analysis"?

We can use library: pandas-profiling · PyPI

ProfileReport(df, title="Pandas Profiling Report")

Conclusion

Congratulations! You have passed the first set of Pandas exercises. We've tasted the movie dataset.

Now you will have a better understanding of your data and how to do initial data analysis.

In this post, we learned about initial data analysis in Pandas. Most popular methods in Pandas for "viewing data".

We saw some advanced techniques and tricks very useful when data is touched for the first time.

Pandas exercises for beginners

Pandas exercises will be structured as a sample project plus exercises - a place where we can practice data science.

The goal is to test your knowledge and learn in 3 main areas:

data cleaning
data analysis
data collection

The areas above are often reported as problematic in data science.

Data preparation is blamed to consume up to 80% from Data science projects. Dirty data is one of the main factors for failed projects in data science and machine learning.

Pandas Exercises - View and Explore Data - Part 1 (2024)