- Python : 10 Ways to Filter Pandas DataFrame
- Filter pandas dataframe by column value
- Filter Pandas Dataframe by Row and Column Position
- Filter pandas dataframe by rows position and column names
- Selecting multiple values of a column
- Select rows whose column value does not equal a specific value
- How to negate the whole condition
- Select Non-Missing Data in Pandas Dataframe
- Filtering String in Pandas Dataframe
- Handle space in column name while filtering
- How to filter data without using pandas package
- Pandas: How to Use LIKE inside query()
- Example 1: Find Rows that Contain One Pattern
- Example 2: Find Rows that Contain One of Several Patterns
- Additional Resources
Python : 10 Ways to Filter Pandas DataFrame
In this article, we will cover various methods to filter pandas dataframe in Python. Data Filtering is one of the most frequent data manipulation operation. It is similar to WHERE clause in SQL or you must have used filter in MS Excel for selecting specific rows based on some conditions. In terms of speed, python has an efficient way to perform filtering and aggregation. It has an excellent package called pandas for data wrangling tasks. Pandas has been built on top of numpy package which was written in C language which is a low level language. Hence data manipulation using pandas package is fast and smart way to handle big sized datasets.
- Select all the active customers whose accounts were opened after 1st January 2019
- Extract details of all the customers who made more than 3 transactions in the last 6 months
- Fetch information of employees who spent more than 3 years in the organization and received highest rating in the past 2 years
- Analyze complaints data and identify customers who filed more than 5 complaints in the last 1 year
- Extract details of metro cities where per capita income is greater than 40K dollars
Make sure pandas package is already installed before submitting the following code. You can check it by running !pip show pandas statement in Ipython console. If it is not installed, you can install it by using the command !pip install pandas .
We are going to use dataset containing details of flights departing from NYC in 2013. This dataset has 336776 rows and 16 columns. See column names below. To import dataset, we are using read_csv( ) function from pandas package.
['year', 'month', 'day', 'dep_time', 'dep_delay', 'arr_time', 'arr_delay', 'carrier', 'tailnum', 'flight', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute']
import pandas as pd df = pd.read_csv("https://raw.githubusercontent.com/JackyP/testing/master/datasets/nycflights.csv", usecols=range(1,17))
Filter pandas dataframe by column value
Select flights details of JetBlue Airways that has 2 letters carrier code B6 with origin from JFK airport
newdf = df[(df.origin == "JFK") & (df.carrier == "B6")]
newdf.head() Out[23]: year month day dep_time . air_time distance hour minute 3 2013 1 1 544.0 . 183.0 1576 5.0 44.0 8 2013 1 1 557.0 . 140.0 944 5.0 57.0 10 2013 1 1 558.0 . 149.0 1028 5.0 58.0 11 2013 1 1 558.0 . 158.0 1005 5.0 58.0 15 2013 1 1 559.0 . 44.0 187 5.0 59.0 [5 rows x 16 columns]
- Filtered data (after subsetting) is stored on new dataframe called newdf .
- Symbol & refers to AND condition which means meeting both the criteria.
- This part of code (df.origin == «JFK») & (df.carrier == «B6») returns True / False. True where condition matches and False where the condition does not hold. Later it is passed within df and returns all the rows corresponding to True. It returns 4166 rows.
In pandas package, there are multiple ways to perform filtering. The above code can also be written like the code shown below. This method is elegant and more readable and you don’t need to mention dataframe name everytime when you specify columns (variables).
newdf = df.query('origin == "JFK" & carrier == "B6"')
loc is an abbreviation of location term. All these 3 methods return same output. It’s just a different ways of doing filtering rows.
newdf = df.loc[(df.origin == "JFK") & (df.carrier == "B6")]
Filter Pandas Dataframe by Row and Column Position
Suppose you want to select specific rows by their position (let’s say from second through fifth row). We can use df.iloc[ ] function for the same.
Indexing in python starts from zero. df.iloc[0:5,] refers to first to fifth row (excluding end point 6th row here). df.iloc[0:5,] is equivalent to df.iloc[:5,]
df.iloc[:5,] #First 5 rows df.iloc[1:5,] #Second to Fifth row df.iloc[5,0] #Sixth row and 1st column df.iloc[1:5,0] #Second to Fifth row, first column df.iloc[1:5,:5] #Second to Fifth row, first 5 columns df.iloc[2:7,1:3] #Third to Seventh row, 2nd and 3rd column
loc considers rows based on index labels. Whereas iloc considers rows based on position in the index so it only takes integers. Let’s create a sample data for illustration
import numpy as np x = pd.DataFrame(, index=[9,8,7,6,0, 1, 2, 3, 4, 5])
col1 9 1 8 3 7 5 6 7 0 9 1 11 2 13 3 15 4 17 5 19
x.iloc[0:5] Output col1 9 1 8 3 7 5 6 7 0 9
x.loc[0:5] Output col1 0 9 1 11 2 13 3 15 4 17 5 19
It is because loc does not produce output based on index position. It considers labels of index only which can be alphabet as well and includes both starting and end point. Refer the example below.
x = pd.DataFrame(, index=['a','b','c','d']) x.loc['a':'c'] # equivalent to x.iloc[0:3] col1 a 1 b 2 c 3
Filter pandas dataframe by rows position and column names
df.index returns index labels. df.index[0:5] is required instead of 0:5 (without df.index) because index labels do not always in sequence and start from 0. It can start from any number or even can have alphabet letters. Refer the example where we showed comparison of iloc and loc.
Selecting multiple values of a column
# Long Way newdf = df.loc[(df.origin == "JFK") | (df.origin == "LGA")] # Smart Way newdf = df[df.origin.isin(["JFK", "LGA"])]
| implies OR condition which means any of the conditions holds True. isin( ) is similar to IN operator in SAS and R which can take many values and apply OR condition. Make sure you specify values in list [ ].
Select rows whose column value does not equal a specific value
In this example, we are deleting all the flight details where origin is from JFK. != implies NOT EQUAL TO.
newdf = df.loc[(df.origin != "JFK") & (df.carrier == "B6")]
Let’s check whether the above line of code works fine or not by looking at unique values of column origin in newdf.
pd.unique(newdf.origin) ['LGA', 'EWR']
How to negate the whole condition
newdf = df[~((df.origin == "JFK") & (df.carrier == "B6"))]
Select Non-Missing Data in Pandas Dataframe
With the use of notnull() function, you can exclude or remove NA and NAN values. In the example below, we are removing missing values from origin column. Since this dataframe does not contain any blank values, you would find same number of rows in newdf.
Filtering String in Pandas Dataframe
It is generally considered tricky to handle text data. But python makes it easier when it comes to dealing character or string columns. Let’s prepare a fake data for example.
import pandas as pd df = pd.DataFrame() var1 0 AA_2 1 B_1 2 C_2 3 A_2
By using .str , you can enable string functions and can apply on pandas dataframe. str[0] means first letter.
contains( ) function is similar to LIKE statement in SQL and SAS. You can subset data by mentioning pattern in contains( ) function.
df[df['var1'].str.contains('A|B')] Output var1 0 AA_2 1 B_1 3 A_2
Handle space in column name while filtering
Let’s rename a column var1 with a space in between var 1 We can rename it by using rename function.
df.rename(columns=, inplace = True)
newdf = df.query("`var 1` == 'AA_2'")
Backticks are supported from version 0.25 of pandas package. Run this command in console to check pandas version !pip show pandas If you have version prior to the version 0.25 you can upgrade it by using this command !pip install —upgrade pandas —user
How to filter data without using pandas package
Warning : Methods shown below for filtering are not efficient ones. The main objective of showing the following methods is to show how to do subsetting without using pandas package. In your live project, you should use pandas’ builtin functions (query( ), loc[ ], iloc[ ]) which are explained above.
We don’t need to create a dataframe to store data. We can stock it in list data structure. lst_df contains flights data which were imported from CSV file.
import csv import requests response = requests.get('https://dyurovsky.github.io/psyc201/data/lab2/nycflights.csv').text lines = response.splitlines() d = csv.DictReader(lines) lst_df = list(d)
Lambda is an alternative way of defining user defined function. With the use of lambda, you can define function in a single line of code. You can check out this link to learn more about it.
l1 = list(filter(lambda x: x["origin"] == 'JFK' and x["carrier"] == 'B6', lst_df))
newdf = df[df.apply(lambda x: x["origin"] == 'JFK' and x["carrier"] == 'B6', axis=1)]
List comprehension is an alternative to lambda function and makes code more readable. Detailed Tutorial : List Comprehension
l2 = list(x for x in lst_df if x["origin"] == 'JFK' and x["carrier"] == 'B6')
newdf = df.iloc[[index for index,row in df.iterrows() if row['origin'] == 'JFK' and row['carrier'] == 'B6']]
class filter: def __init__(self, l, query): self.output = [] for data in l: if eval(query): self.output.append(data) l3 = filter(lst_df, 'data["origin"] == "JFK" and data["carrier"] == "B6"').output
About Author:
Deepanshu founded ListenData with a simple objective — Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.
While I love having friends who agree, I only learn from those who don’t
Let’s Get Connected Email LinkedIn
Pandas: How to Use LIKE inside query()
You can use the following methods to use LIKE (similar to SQL) inside a pandas query() function to find rows that contain a particular pattern:
Method 1: Find Rows that Contain One Pattern
df.query('my_column.str.contains("pattern1")')
Method 2: Find Rows that Contain One of Several Patterns
df.query('my_column.str.contains("pattern1|pattern2")')
The following examples show how to use each method in practice with the following pandas DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame(team': ['Cavs', 'Heat', 'Mavs', 'Mavs', 'Nets', 'Heat', 'Cavs', 'Jazz', 'Jazz', 'Hawks'], 'points': [3, 3, 4, 5, 4, 7, 8, 7, 12, 14], 'rebounds': [15, 14, 14, 10, 8, 14, 13, 9, 5, 4]>) #view DataFrame print(df) team points rebounds 0 Cavs 3 15 1 Heat 3 14 2 Mavs 4 14 3 Mavs 5 10 4 Nets 4 8 5 Heat 7 14 6 Cavs 8 13 7 Jazz 7 9 8 Jazz 12 5 9 Hawks 14 4
Example 1: Find Rows that Contain One Pattern
The following code shows how to use the query() function to find all rows in the DataFrame that contain “avs” in the team column:
df.query('team.str.contains("avs")') team points rebounds 0 Cavs 3 15 2 Mavs 4 14 3 Mavs 5 10 6 Cavs 8 13
Each row that is returned contains “avs” somewhere in the team column.
Also note that this syntax is case-sensitive.
Thus, if we used “AVS” instead then we would not receive any results because no row contains uppercase “AVS” in the team column.
Example 2: Find Rows that Contain One of Several Patterns
The following code shows how to use the query() function to find all rows in the DataFrame that contain “avs” or “eat” in the team column:
df.query('team.str.contains("avs|eat")') team points rebounds 0 Cavs 3 15 1 Heat 3 14 2 Mavs 4 14 3 Mavs 5 10 5 Heat 7 14 6 Cavs 8 13
Each row that is returned contains either “avs” or “eat” somewhere in the team column.
Note: The | operator stands for “or” in pandas. Feel free to use as many as these operators as you’d like to search for even more string patterns.
Additional Resources
The following tutorials explain how to perform other common tasks in pandas: