Pandas Dataframe

9 minute read

Basic Concepts

# let's first import all the libraries needed for this tutorial

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The primary data structures in pandas are implemented as two classes:

DataFrame, which you can imagine as a relational data table, with rows and named columns. Series, which is a single column. A DataFrame contains one or more Series and a name for each Series.

day = ['Friday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday' ]
first_sell = [100, 120, 310, 400, 90, 29, 30]

# to merge these two list we will use zip function

flower_sell = list(zip(day,first_sell))

flower_sell

[('Friday', 100),
 ('Saturday', 120),
 ('Sunday', 310),
 ('Monday', 400),
 ('Tuesday', 90),
 ('Wednesday', 29),
 ('Thursday', 30)]

# Great, we have created our dataset. Let's use pandas do some magic

df = pd.DataFrame(data = flower_sell, columns=['day', 'sell'] )
# df is for dataframe

df

	day	sell
0	Friday	100
1	Saturday	120
2	Sunday	310
3	Monday	400
4	Tuesday	90
5	Wednesday	29
6	Thursday	30

# we just have created pandas dataframe

# let's do similar with pandas series
day = pd.Series(['Friday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday' ])
first_sell = pd.Series([100, 120, 310, 400, 90, 29, 30])

flower_sell = pd.DataFrame({'Day': day, 'Sell1': first_sell})

flower_sell

	Day	Sell1
0	Friday	100
1	Saturday	120
2	Sunday	310
3	Monday	400
4	Tuesday	90
5	Wednesday	29
6	Thursday	30

# let's add another column for 2nd sell

flower_sell['Sell2'] = pd.Series([128, 230, 120, 231, 901, 140, 41])

flower_sell

	Day	Sell1	Sell2
0	Friday	100	128
1	Saturday	120	230
2	Sunday	310	120
3	Monday	400	231
4	Tuesday	90	901
5	Wednesday	29	140
6	Thursday	30	41

accessing data

flower_sell['Sell1'] 

  100
  120
  310
  400
   90
   29
   30
Name: Sell1, dtype: int64

flower_sell['Day']

     Friday
   Saturday
     Sunday
     Monday
    Tuesday
  Wednesday
   Thursday
Name: Day, dtype: object

flower_sell['Sell2'][0:4]

  128
  230
  120
  231
Name: Sell2, dtype: int64

flower_sell['Day'][::-1]

   Thursday
  Wednesday
    Tuesday
     Monday
     Sunday
   Saturday
     Friday
Name: Day, dtype: object

Manipulating Data

flower_sell['Total_Sell'] = flower_sell['Sell1'] + flower_sell['Sell2']

flower_sell

	Day	Sell1	Sell2	Total_Sell
0	Friday	100	128	228
1	Saturday	120	230	350
2	Sunday	310	120	430
3	Monday	400	231	631
4	Tuesday	90	901	991
5	Wednesday	29	140	169
6	Thursday	30	41	71

# Now we can add another column as average sell

flower_sell['average_sell'] = flower_sell['Total_Sell']/2

flower_sell

	Day	Sell1	Sell2	Total_Sell	average_sell
0	Friday	100	128	228	114.0
1	Saturday	120	230	350	175.0
2	Sunday	310	120	430	215.0
3	Monday	400	231	631	315.5
4	Tuesday	90	901	991	495.5
5	Wednesday	29	140	169	84.5
6	Thursday	30	41	71	35.5

# Let's save this file
flower_sell.to_csv('mysell', index='False',header='Small Business')

Indexes

Both Series and DataFrame objects also define an index property that assigns an identifier value to each Series item or DataFrame row.

By default, at construction, pandas assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered.

flower_sell.index

RangeIndex(start=0, stop=7, step=1)

flower_sell.reindex([2, 6, 4])

	Day	Sell1	Sell2	Total_Sell	average_sell
2	Sunday	310	120	430	215.0
6	Thursday	30	41	71	35.5
4	Tuesday	90	901	991	495.5

Working with large dataset

So far we have created a small dataframe and have done some basic operation on it. Let’s work with large amount of data. You can downlaod any dataset with google dataset search. I have a csv file which I have donwloaded from www.kaggle.com We will look into this and will perform some operation in it.

# First thing first, we need to read the file
# let's specify the location

location = r'C:\Users\ICT_H\Desktop\Machine Learning\File\train1.csv'

home_data = pd.read_csv(location)

# to describe the data we can do the following command
home_data.describe()

	Id	LotArea	YearBuilt	TotalBsmtSF	BedroomAbvGr	YrSold	SalePrice
count	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	730.500000	10516.828082	1971.267808	1057.429452	2.866438	2007.815753	180921.195890
std	421.610009	9981.264932	30.202904	438.705324	0.815778	1.328095	79442.502883
min	1.000000	1300.000000	1872.000000	0.000000	0.000000	2006.000000	34900.000000
25%	365.750000	7553.500000	1954.000000	795.750000	2.000000	2007.000000	129975.000000
50%	730.500000	9478.500000	1973.000000	991.500000	3.000000	2008.000000	163000.000000
75%	1095.250000	11601.500000	2000.000000	1298.250000	3.000000	2009.000000	214000.000000
max	1460.000000	215245.000000	2010.000000	6110.000000	8.000000	2010.000000	755000.000000

# to see only the first part of the dataset
home_data.head()

	Id	LotArea	YearBuilt	TotalBsmtSF	BedroomAbvGr	YrSold	SaleType	SalePrice
0	1	8450	2003	856	3	2008	WD	208500
1	2	9600	1976	1262	3	2007	WD	181500
2	3	11250	2001	920	3	2008	WD	223500
3	4	9550	1915	756	3	2006	WD	140000
4	5	14260	2000	1145	4	2008	WD	250000

# You can specify how many row you want to display. By default it's 5

home_data.head(10) # I want to display 10 raw

	Id	LotArea	YearBuilt	TotalBsmtSF	BedroomAbvGr	YrSold	SaleType	SalePrice
0	1	8450	2003	856	3	2008	WD	208500
1	2	9600	1976	1262	3	2007	WD	181500
2	3	11250	2001	920	3	2008	WD	223500
3	4	9550	1915	756	3	2006	WD	140000
4	5	14260	2000	1145	4	2008	WD	250000
5	6	14115	1993	796	1	2009	WD	143000
6	7	10084	2004	1686	3	2007	WD	307000
7	8	10382	1973	1107	3	2009	WD	200000
8	9	6120	1931	952	2	2008	WD	129900
9	10	7420	1939	991	2	2008	WD	118000

# how about to look at the end of our dataset. We can do so by following

home_data.tail()

	Id	LotArea	YearBuilt	TotalBsmtSF	BedroomAbvGr	YrSold	SaleType	SalePrice
1455	1456	7917	1999	953	3	2007	WD	175000
1456	1457	13175	1978	1542	3	2010	WD	210000
1457	1458	9042	1941	1152	4	2010	WD	266500
1458	1459	9717	1950	1078	2	2010	WD	142125
1459	1460	9937	1965	1256	3	2008	WD	147500

# we can visualize particular column as well

home_data.hist('SalePrice')

png

saleprice = home_data['SalePrice']

NumPy is a popular toolkit for scientific computing. pandas Series can be used as arguments to most NumPy functions:

np.log(saleprice) # to get the logarithmic value of salaprice

     12.247694
     12.109011
     12.317167
     11.849398
     12.429216
     11.870600
     12.634603
     12.206073
     11.774520
     11.678440
    11.771436
    12.751300
    11.877569
    12.540758
    11.964001
    11.790557
    11.911702
    11.407565
    11.976659
    11.842229
    12.692503
    11.845103
    12.345835
    11.774520
    11.944708
    12.454104
    11.811547
    12.631340
    12.242887
    11.134589
          ...    
  12.165980
  11.875831
  11.074421
  12.136187
  11.982929
  12.066811
  11.699405
  12.885671
  11.916389
  12.190959
  12.160029
  11.913713
  12.644328
  11.703546
  12.098487
  11.767568
  11.969717
  12.388394
  11.626254
  11.429544
  11.820410
  12.567551
  11.884489
  11.344507
  12.128111
  12.072541
  12.254863
  12.493130
  11.864462
  11.901583
Name: SalePrice, Length: 1460, dtype: float64

saleprice.apply(lambda val: val > 100000)

      True
      True
      True
      True
      True
      True
      True
      True
      True
      True
     True
     True
     True
     True
     True
     True
     True
    False
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
     True
    False
        ...  
   True
   True
  False
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
   True
  False
   True
   True
   True
  False
   True
   True
   True
   True
   True
   True
Name: SalePrice, Length: 1460, dtype: bool

Dealing with missing data

Let’s create a pandas dataframe with missing data

name = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'])

price = pd.Series([10, 20, 15])

missing_data = pd.DataFrame({'Name': name, 'Price': price}) 

missing_data

	Name	Price
0	a	10.0
1	b	20.0
2	c	15.0
3	d	NaN
4	e	NaN
5	f	NaN

missing_data['Price'].isna()

  False
  False
  False
   True
   True
   True
Name: Price, dtype: bool

# we can fill missing values with: fillna() method

missing_data['Price'].fillna(0) # to fill with 0

  10.0
  20.0
  15.0
   0.0
   0.0
   0.0
Name: Price, dtype: float64

missing_data['Price'].fillna('missing')

       10
       20
       15
  missing
  missing
  missing
Name: Price, dtype: object

We can’t build model with missing value. There are several ways to deal with missing value while building model. I will discuss about it in my future post. If you want to learn more about pandas: visit: https://pandas.pydata.org/pandas-docs/stable/cookbook.html#missing-data