Open In Colab


6. Introduction to DataFrames

In this exercise we will introduce dataframes which are objects in python that we will work a lot with in this class. These dataframes will be how we work with data, and get the data ready for visualizing and modeling.

Outline:

  • Creating dataframes

  • Selecting from DataFrames

  • Creating new columns

  • Filtering rows

  • DataFrame functions

We’ll be using the Pandas library to do most of this work. Let’s take a look at how to create and modify dataframes using Pandas!

6.1. Creating DataFrames

#import python libraries that we will use
import numpy as np
import pandas as pd

Create a dataframe from scratch


## create a dictionary
d = {'A':[1,2,3,4], 'B':[5,6,7,8] }

## convert the dictionary to a dataframe
df_test = pd.DataFrame(data=d)

#take a look at the dataframe
df_test

Add a column to the data frame

#add list of names
df_test['ID'] = ['Steve','Sarojini','Emil','Sarah']

#add country
df_test['country'] = 'Canada'

#take a look (notice how it filled in the country!)
df_test

All dataframes also have an index, which keeps track of what data is in each row/column.

#index of the rows
df_test.index
#index of the columns
df_test.columns

6.2. Selecting from a Dataframe

#take a look at the full dataframe
df_test
#grab the A column
df_test['A']
#or similarly grab the A column
df_test.A
#grab the first three rows
df_test[0:3]

It is also possible to select by slicing similar to the numpy array!

Here we use the method loc with square brackets to select rows and columns [rows , columns] with the similar slicing [start:stop:steps]. The loc method uses the index and the column names of the dataframe.

Below we take the first three rows and only the A and B columns:

#first three rows and the A and B columns
df_test.loc[ 0:2 , ["A","B"] ]



Let's try out what you've learnt. Try and select the ID and Country columns and only the third row:
#ID and country columns, and the third row
df_test.loc[?, ?]

6.3. Selecting from a Dataframe: using positions

#full dataframe
df_test
#first three rows and the second and thrid columns - using :
df_test.iloc[0:3,1:3]
#second, and fourth rows, and second and fourth columns - using lists
df_test.iloc[ [1,3] , [1,3] ]



Try and select the ID and Country columns and only the third row using iloc:
#ID and country columns, and the third row
df_test.iloc[?, ?]

6.4. Creating new variables


#multiply the B column by 4 to create a new variable C
df_test['C'] = df_test['B'] * 4

#multiply the C column by A to create a new variable D
df_test['D'] = df_test['C'] * df_test['A']

#take a look
df_test

6.5. Filtering DataFrames

Keep only the rows/columns that you’d like

#keep only the rows where A is less than 3
df_test_sub = df_test[df_test.A<3]

#keep only the columns A and ID
df_test_sub = df_test_sub[['A','B']]

#take a look
df_test_sub



Add the C and D columns together and select only the rows with values greater than 100:
#Add the columns C and D together
df_test['D'] = ? + ?

#keep only the rows where D is greater than 100
df_test_100 = ?

#take a look
df_test_100

6.6. Dataframe functions

When we create a dataframe we are creating an object, and this dataframe object has functions built in. Let’s take a look at a few now, and we will see more as we go along in this course.

Try just typing df_test.

After typing the “.” you should see the list of functions available to you once you’ve created a dataframe object.

df_test.

One useful function is the dtypes function. It will tell us what data type is in each column.

#What data type is stored in each column
df_test.dtypes

The sum function is very useful for quickly summing up the values in a column.

#What is the sum of the A column
df_test.A.sum()

The ? is a very useful function in general, and can provide us with useful information about most objects.

#find help on functions, objects, ... etc
?df_test

6.7. Further reading

Pandas quick intro: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
Pandas longer intro: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

If you would like the notebook without missing code check out the full code version.