Welcome back to the Polars Series, in this blog post, we'll learn how to apply polars for data analysis using different Joins. This is Part III of the series, and in the previous two posts, we discussed the following functions groupby, drop, null, schema, describe, reading the file, and so on.

Check out the previous two blog post here Part I and Part II

In this blog post, we'll get into a more crucial function for data analysis i.e. Joins and GroupBy with Aggregation.

Let's jump into the blog post.

Importing Polars and creating simple Polars dataframe

import polars as pl

person = pl.DataFrame(
    {
        "id" : [1, 2, 3, 4, 5, 6, 7, 8],
        "name": ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
        "city": ['CH', 'BG', 'CH', 'CH', 'BG', 'BG', 'BG', 'CH'],
        "age": [23, 27, 25, 25, 26, 23, 31, 29]
        
    }
)

department = pl.DataFrame(
    {
        "id" : [1, 2, 3, 4, 5, 8, 9, 10],
        "dpt_name": ['PA', 'PB', 'PA', 'PA', 'PB', 'PB', 'PC', 'PC'],
        "dpt_salary": [4500, 7000, 4500, 5000, 9000, 8000, 1000, 1500]
        
    }
)

From the above dataframes, we notice some missing ids in both the dataframes.

None
Person Dataframe .
None
Department Dataframe.

There are various types of join as follows:

Simple Join includes (inner, left, right, outer, full)

Cross Join

Semi Join

Anti Join

Simple Join

Default join is Inner Join

"""Simple Join on Two Dataframes"""

person.join(department, on='id')
None
By default, Polars perform inner join
"""Simple Join on Two Dataframes with parameters as inner"""

person.join(department, on='id', how='inner')
None
Polars Dataframe with inner join

Left Join

"""Left Join to return all data"""

person.join(department, on='id', how='left')
None
Left Join On Id.

In the left join, the 'id' in the left dataframe (person in this case) matches the id of the right dataframe if the id doesn't match from left to right, like id 6 and 7. We then add null values for the corresponding 'id' in the right dataframe.

Right Join

"""Right Join to return all data"""

person.join(department, on='id', how='right')
None
Right Join On id.

Now, similar to left join. Here, we perform the right join with the right dataframe as the department. If the id doesn't match from right to left, like id 9 and 10. We then add null values for the corresponding 'id' in the left dataframe.

Full Join

"""Get all rows"""

person.join(department, on='id', how='full')

Now consider if you want to join both dataframes even if the id doesn't match and also not missing out any rows after the join, then we can perform a full join.

None
Full Join

Now, we can notice in full join, all rows from both data frames are presented.

Cross Join

Consider this simple dataframe.

df_colors = pl.DataFrame(
    {
        "color": ["red", "blue", "green"],
    }
)
df_sizes = pl.DataFrame(
    {
        "size": ["S", "M", "L"],
    }
)
df_colors.join(df_sizes, how='cross')
None
All Cross Rows with Cross Joins.

It is similar to performing all the combinations of two dataframes. Here, we had two dataframes, one containing the colors & other containing the size. Now, we can join them, to get all the combinations of colors and sizes.

Semi Join

Consider two dataframes, one containing the cars & other containing the repairs.

# Car and Its Repair Dataframe

df_cars = pl.DataFrame(
    {
        "id": ["a", "b", "c"],
        "make": ["ford", "toyota", "bmw"],
    }
)

df_repairs = pl.DataFrame(
    {
        "id": ["c", "c"],
        "cost": [100, 200],
    }
)
"""Which car got repaired?"""

df_cars.join(df_repairs, on='id' ,how='semi')
None
Semi Join

Though we have two records in df_repairs belonging to BMW, semi-join acts like a filter based on the second dataframe. The 'id=c' in df_cars gets matched with the first row of df_repairs. For

When semi-join is applied, it returns the rows from one table for which there are matching records in another related table.

Incorrect but an example where semi-join tries to match value to value between dataframes.

df_cars = pl.DataFrame(
    {
        "id": ["a", "b", "c", "c"],
        "make": ["ford", "toyota", "bmw", "tesla"],
    }
)

df_repairs = pl.DataFrame(
    {
        "id": ["c", "c"],
        "cost": [100, 200],
    }
)
"""Which car got repaired?"""

df_cars.join(df_repairs, on='id' ,how='semi')
None

Now, two 'c' values in the left dataframe matched with the two 'c' in the right dataframe. But it is incorrect because the id value must be unique.

Anti Join

"""Which cars didn't get repaired?"""

df_cars.join(df_repairs, on='id' ,how='anti')
None

It is the opposite of semi-join, where all unmatched id are represented after the join.

So we are at the end of the blog post for part III. In the next blog post, we'll continue with Polars for more complex queries including Join, filter, and groupby.

Watch the video tutorial for the polars series here!