Welcome back to the Polars Series, in this blog post, we'll learn how to apply polars for data analysis using different Joins. This is Part III of the series, and in the previous two posts, we discussed the following functions groupby, drop, null, schema, describe, reading the file, and so on.
Check out the previous two blog post here Part I and Part II
In this blog post, we'll get into a more crucial function for data analysis i.e. Joins and GroupBy with Aggregation.
Let's jump into the blog post.
Importing Polars and creating simple Polars dataframe
import polars as pl
person = pl.DataFrame(
{
"id" : [1, 2, 3, 4, 5, 6, 7, 8],
"name": ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
"city": ['CH', 'BG', 'CH', 'CH', 'BG', 'BG', 'BG', 'CH'],
"age": [23, 27, 25, 25, 26, 23, 31, 29]
}
)
department = pl.DataFrame(
{
"id" : [1, 2, 3, 4, 5, 8, 9, 10],
"dpt_name": ['PA', 'PB', 'PA', 'PA', 'PB', 'PB', 'PC', 'PC'],
"dpt_salary": [4500, 7000, 4500, 5000, 9000, 8000, 1000, 1500]
}
)From the above dataframes, we notice some missing ids in both the dataframes.


There are various types of join as follows:
Simple Join includes (inner, left, right, outer, full)
Cross Join
Semi Join
Anti Join
Simple Join
Default join is Inner Join
"""Simple Join on Two Dataframes"""
person.join(department, on='id')
"""Simple Join on Two Dataframes with parameters as inner"""
person.join(department, on='id', how='inner')
Left Join
"""Left Join to return all data"""
person.join(department, on='id', how='left')
In the left join, the 'id' in the left dataframe (person in this case) matches the id of the right dataframe if the id doesn't match from left to right, like id 6 and 7. We then add null values for the corresponding 'id' in the right dataframe.
Right Join
"""Right Join to return all data"""
person.join(department, on='id', how='right')
Now, similar to left join. Here, we perform the right join with the right dataframe as the department. If the id doesn't match from right to left, like id 9 and 10. We then add null values for the corresponding 'id' in the left dataframe.
Full Join
"""Get all rows"""
person.join(department, on='id', how='full')Now consider if you want to join both dataframes even if the id doesn't match and also not missing out any rows after the join, then we can perform a full join.

Now, we can notice in full join, all rows from both data frames are presented.
Cross Join
Consider this simple dataframe.
df_colors = pl.DataFrame(
{
"color": ["red", "blue", "green"],
}
)
df_sizes = pl.DataFrame(
{
"size": ["S", "M", "L"],
}
)
df_colors.join(df_sizes, how='cross')
It is similar to performing all the combinations of two dataframes. Here, we had two dataframes, one containing the colors & other containing the size. Now, we can join them, to get all the combinations of colors and sizes.
Semi Join
Consider two dataframes, one containing the cars & other containing the repairs.
# Car and Its Repair Dataframe
df_cars = pl.DataFrame(
{
"id": ["a", "b", "c"],
"make": ["ford", "toyota", "bmw"],
}
)
df_repairs = pl.DataFrame(
{
"id": ["c", "c"],
"cost": [100, 200],
}
)
"""Which car got repaired?"""
df_cars.join(df_repairs, on='id' ,how='semi')
Though we have two records in df_repairs belonging to BMW, semi-join acts like a filter based on the second dataframe. The 'id=c' in df_cars gets matched with the first row of df_repairs. For
When semi-join is applied, it returns the rows from one table for which there are matching records in another related table.
Incorrect but an example where semi-join tries to match value to value between dataframes.
df_cars = pl.DataFrame(
{
"id": ["a", "b", "c", "c"],
"make": ["ford", "toyota", "bmw", "tesla"],
}
)
df_repairs = pl.DataFrame(
{
"id": ["c", "c"],
"cost": [100, 200],
}
)
"""Which car got repaired?"""
df_cars.join(df_repairs, on='id' ,how='semi')
Now, two 'c' values in the left dataframe matched with the two 'c' in the right dataframe. But it is incorrect because the id value must be unique.
Anti Join
"""Which cars didn't get repaired?"""
df_cars.join(df_repairs, on='id' ,how='anti')
It is the opposite of semi-join, where all unmatched id are represented after the join.
So we are at the end of the blog post for part III. In the next blog post, we'll continue with Polars for more complex queries including Join, filter, and groupby.