6. Introduction to DataFrames

In this exercise we will introduce dataframes which are objects in python that we will work a lot with in this class. These dataframes will be how we work with data, and get the data ready for visualizing and modeling.


  • Creating dataframes

  • Selecting from DataFrames

  • Creating new columns

  • Filtering rows

  • DataFrame functions

We’ll be using the Pandas library to do most of this work. Let’s take a look at how to create and modify dataframes using Pandas!

6.1. Creating DataFrames

#import python libraries that we will use
import numpy as np
import pandas as pd

Create a dataframe from scratch

## create a dictionary
d = {'A':[1,2,3,4], 'B':[5,6,7,8] }

## convert the dictionary to a dataframe
df_test = pd.DataFrame(data=d)

#take a look at the dataframe

Add a column to the data frame

#add list of names
df_test['ID'] = ['Steve','Sarojini','Emil','Sarah']

#add country
df_test['country'] = 'Canada'

#take a look (notice how it filled in the country!)

All dataframes also have an index, which keeps track of what data is in each row/column.

#index of the rows
#index of the columns

6.2. Selecting from a Dataframe

#take a look at the full dataframe
#grab the A column
#or similarly grab the A column
#grab the first three rows

It is also possible to select by slicing similar to the numpy array!

Here we use the method loc with square brackets to select rows and columns [rows , columns] with the similar slicing [start:stop:steps]. The loc method uses the index and the column names of the dataframe.

Below we take the first three rows and only the A and B columns:

#first three rows and the A and B columns
df_test.loc[ 0:2 , ["A","B"] ]

Let's try out what you've learnt. Try and select the ID and Country columns and only the third row:
#ID and country columns, and the third row
df_test.loc[?, ?]

6.3. Selecting from a Dataframe: using positions

#full dataframe
#first three rows and the second and thrid columns - using :
#second, and fourth rows, and second and fourth columns - using lists
df_test.iloc[ [1,3] , [1,3] ]

Try and select the ID and Country columns and only the third row using iloc:
#ID and country columns, and the third row
df_test.iloc[?, ?]

6.4. Creating new variables

#multiply the B column by 4 to create a new variable C
df_test['C'] = df_test['B'] * 4

#multiply the C column by A to create a new variable D
df_test['D'] = df_test['C'] * df_test['A']

#take a look

6.5. Filtering DataFrames

Keep only the rows/columns that you’d like

#keep only the rows where A is less than 3
df_test_sub = df_test[df_test.A<3]

#keep only the columns A and ID
df_test_sub = df_test_sub[['A','B']]

#take a look

Add the C and D columns together and select only the rows with values greater than 100:
#Add the columns C and D together
df_test['D'] = ? + ?

#keep only the rows where D is greater than 100
df_test_100 = ?

#take a look

6.6. Dataframe functions

When we create a dataframe we are creating an object, and this dataframe object has functions built in. Let’s take a look at a few now, and we will see more as we go along in this course.

Try just typing df_test.

After typing the “.” you should see the list of functions available to you once you’ve created a dataframe object.


One useful function is the dtypes function. It will tell us what data type is in each column.

#What data type is stored in each column

The sum function is very useful for quickly summing up the values in a column.

#What is the sum of the A column

The ? is a very useful function in general, and can provide us with useful information about most objects.

#find help on functions, objects, ... etc

6.7. Further reading

