Pandas - DataFrame Statistical Functions
Pandas has a number of statistical functions which can be used to understand and analyze the behavior of data. In this sections we will discuss few of such functions.
Functions | Description |
---|---|
pct_change() | Returns percentage change between the current and a prior element. |
cov() | Computes pairwise covariance of columns, excluding NA/null values. |
corr() | Computes pairwise correlation of columns, excluding NA/null values. |
rank() | Computes numerical data ranks (1 through n) along axis. |
Lets discuss these functions in detail:
Percentage Change
The Pandas DataFrame pct_change() function computes the percentage change between the current and a prior element by default. This is useful in comparing the percentage of change in a time series of elements.
Syntax
DataFrame.pct_change(periods=1, fill_method='pad', limit=None, freq=None)
Parameters
periods |
Optional. Specify the period to shift for calculating percent change. Default: 1 |
fill_method |
Optional. Specify how to handle NAs before computing percent changes. Default: 'pad'. It can take values from {'backfill', 'bfill', 'pad', 'ffill', None}. pad / ffill: use last valid observation to fill gap. backfill / bfill: use next valid observation to fill gap. |
limit |
Optional. Specify the number of consecutive NAs to fill before stopping. Default is None. |
freq |
Optional. A DateOffset, timedelta, or str to specify increment to use from time series API (e.g. 'M' or BDay()). Default is None. |
Example:
In the example below, a DataFrame df is created. The pct_change() function is used to calculate the percentage change of elements of all numerical columns.
import pandas as pd import numpy as np df = pd.DataFrame({ "GDP": [1.5, 2.5, 3.5, 1.5, 2.5, -1], "GNP": [1, 2, 3, 3, 2, -1], "HPI": [2, 3, 2, np.NaN, 2, 2]}, index= ["2015", "2016", "2017", "2018", "2019", "2020"] ) print("The DataFrame is:") print(df) #percentage change of element with period = 1 print("\ndf.pct_change() returns:") print(df.pct_change()) #percentage change of element with period = 2 print("\ndf.pct_change(periods=2) returns:") print(df.pct_change(periods=2))
The output of the above code will be:
The DataFrame is: GDP GNP HPI 2015 1.5 1 2.0 2016 2.5 2 3.0 2017 3.5 3 2.0 2018 1.5 3 NaN 2019 2.5 2 2.0 2020 -1.0 -1 2.0 df.pct_change() returns: GDP GNP HPI 2015 NaN NaN NaN 2016 0.666667 1.000000 0.500000 2017 0.400000 0.500000 -0.333333 2018 -0.571429 0.000000 0.000000 2019 0.666667 -0.333333 0.000000 2020 -1.400000 -1.500000 0.000000 df.pct_change(periods=2) returns: GDP GNP HPI 2015 NaN NaN NaN 2016 NaN NaN NaN 2017 1.333333 2.000000 0.000000 2018 -0.400000 0.500000 -0.333333 2019 -0.285714 -0.333333 0.000000 2020 -1.666667 -1.333333 0.000000
Example: using axis=1
To calculate the percentage change row-wise, the axis=1 can be passed. Consider the example below:
import pandas as pd import numpy as np df = pd.DataFrame({ "2015": [1.5, 1, 2], "2016": [2.5, 2, 3], "2017": [3.5, 3, 2], "2018": [1.5, 3, np.NaN], "2019": [2.5, 2, 2], "2020": [-1, -1, 2]}, index= ["GDP", "GNP", "HDI"] ) print("The DataFrame is:") print(df) #percentage change of element with period = 1 print("\ndf.pct_change(axis=1) returns:") print(df.pct_change(axis=1)) #percentage change of element with period = 2 print("\ndf.pct_change(axis=1, periods=2) returns:") print(df.pct_change(axis=1, periods=2))
The output of the above code will be:
The DataFrame is: 2015 2016 2017 2018 2019 2020 GDP 1.5 2.5 3.5 1.5 2.5 -1 GNP 1.0 2.0 3.0 3.0 2.0 -1 HDI 2.0 3.0 2.0 NaN 2.0 2 df.pct_change(axis=1) returns: 2015 2016 2017 2018 2019 2020 GDP NaN 0.666667 0.400000 -0.571429 0.666667 -1.4 GNP NaN 1.000000 0.500000 0.000000 -0.333333 -1.5 HDI NaN 0.500000 -0.333333 0.000000 0.000000 0.0 df.pct_change(axis=1, periods=2) returns: 2015 2016 2017 2018 2019 2020 GDP NaN NaN 1.333333 -0.400000 -0.285714 -1.666667 GNP NaN NaN 2.000000 0.500000 -0.333333 -1.333333 HDI NaN NaN 0.000000 -0.333333 0.000000 0.000000
Covariance
The Pandas DataFrame cov() function computes pairwise covariance of columns, excluding NA/null values. The returned DataFrame is the covariance matrix of the columns of the DataFrame. Both NA and null values are automatically excluded from the calculation.
Syntax
DataFrame.cov(min_periods=None, ddof=1)
Parameters
min_periods |
Optional. An int to specify minimum number of observations required per pair of columns to have a valid result. Default is None. |
ddof |
Optional. Specify Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. |
Example:
In the example below, a DataFrame report is created. The cov() function is used to create a covariance matrix using all numeric columns of the DataFrame.
import pandas as pd import numpy as np report = pd.DataFrame({ "GDP": [1.02, 1.03, 1.04, 0.98], "GNP": [1.05, 0.99, np.nan, 1.04], "HDI": [1.02, 1.01, 1.02, 1.03]}, index= ["Q1", "Q2", "Q3", "Q4"] ) print(report,"\n") print(report.cov())
The output of the above code will be:
GDP GNP HDI Q1 1.02 1.05 1.02 Q2 1.03 0.99 1.01 Q3 1.04 NaN 1.02 Q4 0.98 1.04 1.03 GDP GNP HDI GDP 0.000692 -0.000450 -0.000167 GNP -0.000450 0.001033 0.000250 HDI -0.000167 0.000250 0.000067
Correlation
The Pandas DataFrame corr() function computes pairwise correlation of columns, excluding NA/null values. The returned DataFrame is the correlation matrix of the columns of the DataFrame. Both NA and null values are automatically excluded from the calculation.
Syntax
DataFrame.corr(method='pearson', min_periods=1)
Parameters
method |
Optional. Specify method of correlation. Default is 'pearson'. Possible values are:
|
min_periods |
Optional. An int to specify minimum number of observations required per pair of columns to have a valid result. Default is 1. |
Example:
In the example below, a DataFrame report is created. The corr() function is used to create a correlation matrix using all numeric columns of the DataFrame.
import pandas as pd import numpy as np report = pd.DataFrame({ "GDP": [1.02, 1.03, 1.04, 0.98], "GNP": [1.05, 0.99, np.nan, 1.04], "HDI": [1.02, 1.01, 1.02, 1.03]}, index= ["Q1", "Q2", "Q3", "Q4"] ) print(report,"\n") print(report.corr())
The output of the above code will be:
GDP GNP HDI Q1 1.02 1.05 1.02 Q2 1.03 0.99 1.01 Q3 1.04 NaN 1.02 Q4 0.98 1.04 1.03 GDP GNP HDI GDP 1.000000 -0.529107 -0.776151 GNP -0.529107 1.000000 0.777714 HDI -0.776151 0.777714 1.000000
Data Ranking
The Pandas DataFrame rank() function computes numerical data ranks (1 through n) along specified axis. By default, The function assigns equal values a rank which is the average of the ranks of those values.
Syntax
DataFrame.rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False)
Parameters
axis |
Optional. Index to direct ranking. It can be {0 or 'index', 1 or 'columns'}. Default is 0. |
method |
Optional. Specify how to rank the group of records in case of tie:
|
numeric_only |
Optional. Specify True to rank only numeric columns. |
na_option |
Optional. Specify how to rank NaN values:
|
ascending |
Optional. Specify whether or not the elements should be ranked in ascending order. Default is True. |
pct |
Optional. Specify whether or not to display the returned rankings in percentile form. Default is False. |
Example:
The example below demonstrates how this function behaves with the above parameters:
- default_rank: Default behavior obtained without using any parameter.
- max_rank: When setting method = 'max'. The records that have the same values are ranked using the highest rank (For example - 'x2' and 'x3' are both in the first and second position, rank 2 is assigned).
- NA_bottom: When setting na_option = 'bottom'. If there are NaN values in the record they are placed at the bottom of the ranking.
- pct_rank: When setting pct = True. The ranking is expressed as percentile rank.
import pandas as pd import numpy as np df = pd.DataFrame({ "values": [20, 10, 10, np.NaN, 30]}, index= ["x1", "x2", "x3", "x4", "x5"] ) print(df,"\n") df['default_rank'] = df['values'].rank() df['max_rank'] = df['values'].rank(method='max') df['NA_bottom'] = df['values'].rank(na_option='bottom') df['pct_rank'] = df['values'].rank(pct=True) print(df,"\n")
The output of the above code will be:
values x1 20.0 x2 10.0 x3 10.0 x4 NaN x5 30.0 values default_rank max_rank NA_bottom pct_rank x1 20.0 3.0 3.0 3.0 0.750 x2 10.0 1.5 2.0 1.5 0.375 x3 10.0 1.5 2.0 1.5 0.375 x4 NaN NaN NaN 5.0 NaN x5 30.0 4.0 4.0 4.0 1.000