Intermediate Python for Data Science

10 minute read

This notebook contains most of the essential elements of Python skill for data science. While you go through it, you will learn about

  • how to use matplotlib to visualize data
  • how to use pandas dataframe, read and manupulate data
  • Dictionaries
  • logic control flows
  • loop
  • Random walk - A case study.

Matplotlib

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.69, 5.263, 6.972]

# line plot
plt.plot(year, pop)
plt.xlabel("Year")
plt.ylabel("Population (billion)")
plt.show()

png

# Scatter PLot
plt.scatter(year, pop)
plt.xlabel("Year")
plt.ylabel("Population (billion)")
plt.show()

png

data = pd.read_csv("data/worldpop.csv")
data.head()
Unnamed: 0 country year population cont life_exp gdp_cap
0 11 Afghanistan 2007 31889923.0 Asia 43.828 974.580338
1 23 Albania 2007 3600523.0 Europe 76.423 5937.029526
2 35 Algeria 2007 33333216.0 Africa 72.301 6223.367465
3 47 Angola 2007 12420476.0 Africa 42.731 4797.231267
4 59 Argentina 2007 40301927.0 Americas 75.320 12779.379640
pop = data["population"]
gdp = data["gdp_cap"]

# Line PLot
plt.plot(pop, gdp)
plt.xlabel("Population")
plt.ylabel("GDP")
plt.show()

# Scatter PLot
plt.scatter(pop, gdp)
plt.xlabel("Population")
plt.ylabel("Population (")
plt.show()

png

png

Histogram

life_exp = data["life_exp"]

plt.title("Histogram with 15 bins")
plt.hist(life_exp, bins=15)
plt.show()

# to clean up plot
plt.clf()


plt.title("Histogram with 5 bins")
plt.hist(life_exp, bins=5)
plt.show()

png

png

# basic scatter
# Scatter plot
plt.scatter(gdp, life_exp)

# Previous customizations
plt.xscale('log') 
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')

# Definition of tick_val and tick_lab
tick_val = [1000, 10000, 100000]
tick_lab = ['1k', '10k', '100k']

# Adapt the ticks on the x-axis
plt.xticks(tick_val, tick_lab)

# After customizing, display the plot
plt.show()

png

Dictionaries and Pandas

# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }

# Print out the keys in europe
print(europe.keys(), '\n')

# Print out value that belongs to key 'norway'
print(europe['norway'])
dict_keys(['spain', 'france', 'germany', 'norway']) 

oslo
# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(dict)
print(cars,"\n\n")

# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars
cars.index = row_labels

# Print cars again
print(cars)
         country  drives_right  cars_per_cap
0  United States          True           809
1      Australia         False           731
2          Japan         False           588
3          India         False            18
4         Russia          True           200
5        Morocco          True            70
6          Egypt          True            45 


           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JAP          Japan         False           588
IN           India         False            18
RU          Russia          True           200
MOR        Morocco          True            70
EG           Egypt          True            45
# to read data
brics = pd.read_csv("data/brics.csv", index_col = 0)

# to print data
print("Our brics dataset....\n")
print(brics, "\n\n")

# return pandas series
print("Printing pandas series...\n")
print(brics["country"], "\n\n")

# return pandas dataframe
print("Printing pandas dataframe...\n")
print(brics[["country"]],"\n\n")


# multiple dataframes
print("Printing multiple dataframes...\n")
print(brics[["country", "population"]],"\n\n")

# Multiple dataframes with index
print("Printing multiple dataframes with indexing...\n")
print(brics[1:4])
Our brics dataset....

         country    capital    area  population
BR        Brazil   Brasilia   8.516      200.40
RU        Russia     Moscow  17.100      143.50
IN         India  New Delhi   3.286     1252.00
CH         China    Beijing   9.597     1357.00
SA  South Africa   Pretoria   1.221       52.98 


Printing pandas series...

BR          Brazil
RU          Russia
IN           India
CH           China
SA    South Africa
Name: country, dtype: object 


Printing pandas dataframe...

         country
BR        Brazil
RU        Russia
IN         India
CH         China
SA  South Africa 


Printing multiple dataframes...

         country  population
BR        Brazil      200.40
RU        Russia      143.50
IN         India     1252.00
CH         China     1357.00
SA  South Africa       52.98 


Printing multiple dataframes with indexing...

   country    capital    area  population
RU  Russia     Moscow  17.100       143.5
IN   India  New Delhi   3.286      1252.0
CH   China    Beijing   9.597      1357.0
# Row access: loc
print("Row access pandas series\n")
print(brics.loc["IN"],"\n\n") 

print("Row access pandas dataframe\n")
print(brics.loc[["IN"]],"\n\n") 

print("Multple Row access\n")
print(brics.loc[["IN", "CH", "SA"]],"\n\n") 

print("Rows with specific column\n")
print(brics.loc[["RU", "IN","SA"], ["country", "capital"]])

Row access pandas series

country           India
capital       New Delhi
area              3.286
population         1252
Name: IN, dtype: object 


Row access pandas dataframe

   country    capital   area  population
IN   India  New Delhi  3.286      1252.0 


Multple Row access

         country    capital   area  population
IN         India  New Delhi  3.286     1252.00
CH         China    Beijing  9.597     1357.00
SA  South Africa   Pretoria  1.221       52.98 


Rows with specific column

         country    capital
RU        Russia     Moscow
IN         India  New Delhi
SA  South Africa   Pretoria
# iloc

print("Row access pandas dataframe\n")
print(brics.iloc[[1]],"\n\n") 

print("Multple Row access\n")
print(brics.iloc[[1, 2, 4]],"\n\n") 

print("Rows with specific column\n")
print(brics.iloc[[1, 2, 4], :3])
Row access pandas dataframe

   country capital  area  population
RU  Russia  Moscow  17.1       143.5 


Multple Row access

         country    capital    area  population
RU        Russia     Moscow  17.100      143.50
IN         India  New Delhi   3.286     1252.00
SA  South Africa   Pretoria   1.221       52.98 


Rows with specific column

         country    capital    area
RU        Russia     Moscow  17.100
IN         India  New Delhi   3.286
SA  South Africa   Pretoria   1.221

Logic Control flow and filtering

my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])

# my_house greater than 18.5 or smaller than 10
print(np.logical_or(my_house > 18.5, my_house < 10))

# Both my_house and your_house smaller than 11

print(np.logical_and(my_house < 11, your_house < 11))
print(my_house)
[False  True False  True]
[False False False  True]
[18.   20.   10.75  9.5 ]
cars = pd.read_csv('data/cars.csv', index_col = 0)

# Extract drives_right column as Series: dr
dr = cars["drives_right"]

# Use dr to subset cars: sel
sel = cars[dr]

# Print sel
print(sel)
     cars_per_cap        country  drives_right
US            809  United States          True
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True
# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars['cars_per_cap']
car_maniac = cpc > 500
print("Cars dataset more than 500 cars ..\n\n",cars[car_maniac])
Cars dataset more than 500 cars ..

      cars_per_cap        country  drives_right
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False
between = np.logical_and(cpc > 100, cpc < 500)
medium = cars[between]
print("Cars dataset with cars between 100 and 500 cars per capital..\n\n", medium)
Cars dataset with cars between 100 and 500 cars per capital..

     cars_per_cap country  drives_right
RU           200  Russia          True

Loop

# while loop
x = 10
while x > 0:
    print("x is now: ", x)
    x = x - 2
x is now:  10
x is now:  8
x is now:  6
x is now:  4
x is now:  2
# for loop

# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

for area in areas:
    print(area)
    
print("\nenumerate: looping with index\n")
for index, area in enumerate(areas):
    print("Area ", index, ": ", area)
11.25
18.0
20.0
10.75
9.5

enumerate: looping with index

Area  0 :  11.25
Area  1 :  18.0
Area  2 :  20.0
Area  3 :  10.75
Area  4 :  9.5
# house list of lists
house = [["hallway", 11.25], 
         ["kitchen", 18.0], 
         ["living room", 20.0], 
         ["bedroom", 10.75], 
         ["bathroom", 9.50]]
         
# Build a for loop from scratch

for room in house:
    print("the " + room[0] + " is " + str(room[1]) + " sqm" )
the hallway is 11.25 sqm
the kitchen is 18.0 sqm
the living room is 20.0 sqm
the bedroom is 10.75 sqm
the bathroom is 9.5 sqm
# iterate over dictionary

# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna' }
          
# Iterate over europe
for country, capital in europe.items():
    print("The capital of ", country, " is ", capital)
The capital of  spain  is  madrid
The capital of  france  is  paris
The capital of  germany  is  berlin
The capital of  norway  is  oslo
The capital of  italy  is  rome
The capital of  poland  is  warsaw
The capital of  austria  is  vienna
# looping over n dimensional numpy array

np_height = np.array([[23, 40, 55], [55, 90, 11]])

for height in np.nditer(np_height):
    print(height)
23
40
55
55
90
11
# looping over pandas dataframe

cars = pd.read_csv('data/cars.csv', index_col = 0)

# Iterate over rows of cars
for lab, row in cars.iterrows() :
    print(lab, ": ",row['cars_per_cap'] )
US :  809
AUS :  731
JAP :  588
IN :  18
RU :  200
MOR :  70
EG :  45
# adding new rows
cars["Name_lenght"] = cars['country'].apply(len)
print(cars)
     cars_per_cap        country  drives_right  Name_lenght
US            809  United States          True           13
AUS           731      Australia         False            9
JAP           588          Japan         False            5
IN             18          India         False            5
RU            200         Russia          True            6
MOR            70        Morocco          True            7
EG             45          Egypt          True            5

Case Study: Hacker Statistics

Random Walk

A random walk is a mathematical object, known as a stochastic or random process, that describes a path that consists of a succession of random steps on some mathematical space such as the integers.

np.random.seed(123)

random_walk = [0]

for x in range(100) :
    step = random_walk[-1]
    dice = np.random.randint(1,7)

    if dice <= 2:
        # Replace below: use max to make sure step can't go below 0
        step = max(0,step - 1)
    elif dice <= 5:
        step = max(0, step + 1)
    else:
        step = max(0, step + np.random.randint(1,7))

    random_walk.append(step)

print(random_walk)
[0, 3, 4, 5, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1, 0, 0, 1, 6, 5, 4, 5, 4, 5, 6, 7, 8, 9, 8, 9, 8, 9, 10, 11, 12, 11, 15, 16, 15, 16, 15, 16, 17, 18, 19, 20, 21, 22, 25, 26, 27, 28, 33, 34, 38, 39, 38, 39, 40, 39, 40, 41, 43, 44, 45, 44, 43, 44, 45, 44, 43, 44, 45, 47, 46, 45, 46, 45, 46, 47, 48, 50, 49, 50, 51, 52, 53, 54, 53, 52, 53, 52, 53, 54, 53, 56, 57, 58, 59, 58, 59, 60]
# visualization random walk
plt.plot(random_walk)
plt.title("Random walk")
plt.show()

png

# visualization random walk
plt.hist(random_walk, bins = 15)
plt.title("Random walk")
plt.show()

png

# Distribution

# initialize and populate all_walks
all_walks = []
for i in range(10) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        random_walk.append(step)
    all_walks.append(random_walk)

# Convert all_walks to Numpy array: np_aw
np_aw = np.array(all_walks)

# Plot np_aw and show
plt.plot(np_aw)
plt.show()

# Clear the figure
plt.clf()

# Transpose np_aw: np_aw_t
np_aw_t = np_aw.T

# Plot np_aw_t and show
plt.plot(np_aw_t)
plt.show()

png

png