Pandas Exercises - View and Explore Data - Part 1 (2024)

This is Pandas exercise about data analysis and exploration. It's the first exercise from a series (check the end of the post for more details).

You can test knowledge in several areas:

  • initial data analysis
  • "viewing data" - as advertised on the Pandas page: Viewing data
  • summarizing data

We can think of data exploration as getting to know your food. I think that most of us prefer to see, taste and smell the food - before eating. For me it's similar to data.

Data should be "seen, smelled and tasted".

Setup

In this tutorial we will work with the following dataset: The Movies Dataset.

The dataset contains data for 45,000 movies. Data points include cast, crew, plot keywords, revenue, release dates, languages, production companies, countries, vote averages and more.

We will focus on the file: movies_metadata.csv. Data from the CSV file can be read as follows:

import pandas as pddf = pd.read_csv('movies_metadata.csv', low_memory=False)

1: How many rows and columns?

Usually the first thing which I do is to get the DataFrame dimension.

It will give us a rough idea about the size of the DataFrame. Why rough? Because you may have nested hierarchical data - multilevel columns.

df.shape

This will give us:

(45466, 24)

This might save you time and effort. Sometimes it is good to start with smaller data samples.

Bonus question 1

Can you guess how many rows and columns?

df.shape[0]

Bonus Question 2

Can you name synonyms or other names of columns and rows?

2: View the top and bottom rows

How can we see the first N records of the dataset? What about the last N rows?

- top rows

df.head()

- last rows

df.tail()

Bonus question 3

How can we see bottom and top rows simultaneously in Pandas?

import numpy as npdf.iloc[np.r_[0:5, -5:0]]

Did you solve that one? Good job, it was a tricky one!

Bonus question 4

Do you know why there are parentheses and sometimes not?
like:

df.head()

and:

df.shape

The reason is because

head()

is a method and should be with parentheses. While

shape

is an attribute - so it should be without parentheses

3: Display the index and columns

Can you display the column names? What about index/row names?

df.columns
df.index

We will get about the columns:

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count'], dtype='object')

and for the index:

RangeIndex(start=0, stop=45466, step=1)

4: Generate descriptive statistics

What about getting a quick statistical summary for a DataFrame in Pandas? This would give us valuable information about:

  • count
  • mean
  • standard deviation
  • percentile
df.describe()

This would give us:

revenueruntimevote_averagevote_count
count4.546000e+0445203.00000045460.00000045460.000000
mean1.120935e+0794.1281995.618207109.897338
std6.433225e+0738.4078101.924216491.310374
min0.000000e+000.0000000.0000000.000000
25%0.000000e+0085.0000005.0000003.000000
50%0.000000e+0095.0000006.00000010.000000
75%0.000000e+00107.0000006.80000034.000000
max2.787965e+091256.00000010.00000014075.000000

The method will analyze both numeric and object series, as well as DataFrame column sets of mixed data types.

5: What are the dtypes?

What are the data types of the columns in a DataFrame. Data types might be several kinds like:

  • numeric
  • datetime
  • string/object
  • nested data - dictionary, list or json
  • categorical

It can help us when we work with our data. Think of it like knowing the OS of the computer. Working on Linux differs from Windows? At least for me.

So how to get data types?

df.dtypes

This would give us:

adult objectbelongs_to_collection objectbudget objectgenres objecthomepage objectid objectimdb_id objectoriginal_language objectoriginal_title objectoverview objectpopularity objectposter_path objectproduction_companies objectproduction_countries objectrelease_date objectrevenue float64runtime float64spoken_languages objectstatus objecttagline objecttitle objectvideo objectvote_average float64vote_count float64dtype: object

6: Transpose rows

How can we see the first N records of the dataset - transposed?

I like to transpose rows when I'm working with few rows with many columns.

The first few rows looks like that(transposed):

01
adultFalseFalse
belongs_to_collection{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}NaN
budget3000000065000000
genres[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}][{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
homepagehttp://toystory.disney.com/toy-storyNaN
id8628844
imdb_idtt0114709tt0113497
original_languageenen
original_titleToy StoryJumanji
overviewLed by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circ*mstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.
popularity21.94694317.015539
poster_path/rhIRbceoE9lR4veEXuwCC2wARtG.jpg/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
production_companies[{'name': 'Pixar Animation Studios', 'id': 3}][{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]
production_countries[{'iso_3166_1': 'US', 'name': 'United States of America'}][{'iso_3166_1': 'US', 'name': 'United States of America'}]
release_date1995-10-301995-12-15
revenue373554033.0262797249.0
runtime81.0104.0
spoken_languages[{'iso_639_1': 'en', 'name': 'English'}][{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]
statusReleasedReleased
taglineNaNRoll the dice and unleash the excitement!
titleToy StoryJumanji
videoFalseFalse
vote_average7.76.9
vote_count5415.02413.0

And transposing rows is possible by:

df.head(2).T

Which is the alias of method transpose(). This is another thing to have in mind working with Pandas - aliases.

7: Sort values

Can you get the top most rated movies from this dataset? To do so we will need to sort data by a column.

So the last exercise is to get the names and number of votes for the most voted movies in that dataset.

df.sort_values(by='vote_count', ascending=False)[['title', 'vote_count']].head()

Bonus question 5

Can you find the top 7 values for the numerical columns? Return the result as a DataFrame:

revenueruntimevote_averagevote_count
02.787965e+091256.010.014075.0
12.068224e+091140.010.012269.0
21.845034e+091140.010.012114.0
31.519558e+09931.010.012000.0
41.513529e+09925.010.011444.0
51.506249e+09900.010.011187.0
61.405404e+09877.010.010297.0

Can you optimize the proposed solution below?

from pandas.api.types import is_numeric_dtypedfs = []for col in df.columns: top_values = [] if is_numeric_dtype(df[col]): top_values = df[col].nlargest(n=7) dfs.append(pd.DataFrame({col: top_values}).reset_index(drop=True))pd.concat(dfs, axis=1)

Were you able to solve it? Well done!

Bonus question 6

Method describe() is good for initial exploratory data analysis. For more details like:

  • type
  • unique values
  • missing values
  • most frequent values
  • histogram and more

Can you research and find Python library which can do "serious exploratory data analysis"?

We can use library: pandas-profiling · PyPI

ProfileReport(df, title="Pandas Profiling Report")

Conclusion

Congratulations! You have passed the first set of Pandas exercises. We've tasted the movie dataset.

Now you will have a better understanding of your data and how to do initial data analysis.

In this post, we learned about initial data analysis in Pandas. Most popular methods in Pandas for "viewing data".

We saw some advanced techniques and tricks very useful when data is touched for the first time.

Pandas exercises for beginners

Pandas exercises will be structured as a sample project plus exercises - a place where we can practice data science.

The goal is to test your knowledge and learn in 3 main areas:

  • data cleaning
  • data analysis
  • data collection

The areas above are often reported as problematic in data science.

Data preparation is blamed to consume up to 80% from Data science projects. Dirty data is one of the main factors for failed projects in data science and machine learning.

Pandas Exercises - View and Explore Data - Part 1 (2024)
Top Articles
Latest Posts
Article information

Author: Francesca Jacobs Ret

Last Updated:

Views: 6146

Rating: 4.8 / 5 (68 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Francesca Jacobs Ret

Birthday: 1996-12-09

Address: Apt. 141 1406 Mitch Summit, New Teganshire, UT 82655-0699

Phone: +2296092334654

Job: Technology Architect

Hobby: Snowboarding, Scouting, Foreign language learning, Dowsing, Baton twirling, Sculpting, Cabaret

Introduction: My name is Francesca Jacobs Ret, I am a innocent, super, beautiful, charming, lucky, gentle, clever person who loves writing and wants to share my knowledge and understanding with you.