Pandas DataFrame - drop_duplicates() function
The Pandas DataFrame drop_duplicates() function returns DataFrame with duplicate rows removed.
Syntax
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
Parameters
subset |
Optional. Specify columns to use to identify duplicates, by default use all of the columns. |
keep |
Optional. Determines which duplicates (if any) to keep. Possible values are:
|
inplace |
Optional. If True, drop duplicates in place or to return a copy. |
ignore_index |
Optional. If True, the resulting axis will be labeled 0, 1, …, n - 1. |
Return Value
Returns DataFrame with duplicates removed or None if inplace=True.
Example: drop_duplicates() example
In the example below, a DataFrame df is created. The drop_duplicates() function is used to drop duplicate rows from this DataFrame.
import pandas as pd import numpy as np df = pd.DataFrame({ "Name": ["John", "John", "Kim", "Kim", "Kim"], "Age": [25, 25, 25, 30, 30], "Country": ["UK", "UK", "USA", "FRA", "JPN"] }) #displaying the dataframe print(df,"\n") #removes duplicate rows based on all columns print("df.drop_duplicates() returns:") print(df.drop_duplicates(),"\n")
The output of the above code will be:
Name Age Country 0 John 25 UK 1 John 25 UK 2 Kim 25 USA 3 Kim 30 FRA 4 Kim 30 JPN df.drop_duplicates() returns: Name Age Country 0 John 25 UK 2 Kim 25 USA 3 Kim 30 FRA 4 Kim 30 JPN
Example: using subset parameter
By using subset parameter, we can specify columns to identify duplicate rows from the DataFrame. Consider the example below:
import pandas as pd import numpy as np df = pd.DataFrame({ "Name": ["John", "John", "Kim", "Kim", "Kim"], "Age": [25, 25, 25, 30, 30], "Country": ["UK", "UK", "USA", "FRA", "JPN"] }) #displaying the dataframe print(df,"\n") #using 'Name' and 'Age' columns #to identify duplicates columns print("df.drop_duplicates(subset=['Name', 'Age']) returns:") print(df.drop_duplicates(subset=['Name', 'Age']),"\n")
The output of the above code will be:
Name Age Country 0 John 25 UK 1 John 25 UK 2 Kim 25 USA 3 Kim 30 FRA 4 Kim 30 JPN df.drop_duplicates(subset=['Name', 'Age']) returns: Name Age Country 0 John 25 UK 2 Kim 25 USA 3 Kim 30 FRA
Example: using keep parameter
By using keep parameter, we can specify which duplicate row to keep. Consider the example below:
import pandas as pd import numpy as np df = pd.DataFrame({ "Name": ["John", "John", "Kim", "Kim", "Kim"], "Age": [25, 25, 25, 30, 30], "Country": ["UK", "UK", "USA", "FRA", "JPN"] }) #displaying the dataframe print(df,"\n") #keeping first duplicate row print("df.drop_duplicates(subset=['Name', 'Age'], keep='first') returns:") print(df.drop_duplicates(subset=['Name', 'Age'], keep='first'),"\n") #keeping last duplicate row print("df.drop_duplicates(subset=['Name', 'Age'], keep='last') returns:") print(df.drop_duplicates(subset=['Name', 'Age'], keep='last'),"\n")
The output of the above code will be:
Name Age Country 0 John 25 UK 1 John 25 UK 2 Kim 25 USA 3 Kim 30 FRA 4 Kim 30 JPN df.drop_duplicates(subset=['Name', 'Age'], keep='first') returns: Name Age Country 0 John 25 UK 2 Kim 25 USA 3 Kim 30 FRA df.drop_duplicates(subset=['Name', 'Age'], keep='last') returns: Name Age Country 1 John 25 UK 2 Kim 25 USA 4 Kim 30 JPN
❮ Pandas DataFrame - Functions