Pandas is a library in Python that makes it easy to work with datasets. It provides tools to organize, analyze and clean up datasets. It helps us to sort, filter and make calculations on that data so that we can get the information we need.

Pandas is widely used in data analysis and data science projects, it's a powerful and easy tool to work with data, making it a popular choice among data scientists and analysts.

In this article, we will go through some functions in the pandas library that are widely used for analyzing the data and for data manipulation.

Let's get started!

1. read_csv()

This function reads a CSV file and converts it into a pandas data frame. This function takes the file name as an argument and returns a data frame.

Example:

import pandas as pd
df = pd.read_csv('data.csv')

The data frame represents the data in the CSV file in a tabular format, with rows and columns.

2. head()

This function is used to display the first n rows of a data frame. This function takes an optional argument 'n' which specifies the number of rows to be returned, the default value for n is 5.

Example:

import pandas as pd
df = pd.read_csv('data.csv')

df.head()

This function is useful when we want to take a quick look at the data without having to display the entire data frame.

3. tail()

This function displays the last n rows of a data frame. This function also takes in an optional argument 'n' which specifies the number of rows to be returned, the default for n is 5.

Example:

import pandas as pd
df = pd.read_csv('data.csv')

df.tail()

4. describe()

This describe() function is used to generate descriptive statistics of the numerical columns in a data frame.

Example:

import pandas as pd
df = pd.read_csv('data.csv')

df.describe()

It returns a new data frame that contains various summary statistics such as mean, standard deviation, minimum, maximum, etc of all the numerical columns in the original data frame.

5. columns

This columns attribute is used to get the names of the columns in a data frame.

Example:

import pandas as pd
df = pd.read_csv('data.csv')

df.columns

This attribute returns an Index object which can be used to retrieve the column names of the data frame.

6. info()

This info() function is used to get a quick overview of the data frame, including the number of rows, columns, and the data types of each column.

Example:

import pandas as pd
df = pd.read_csv('data.csv')

df.info()

This function returns the number of non-null entries in each column and the data types of each column. It can help us understand the structure of the data and check if the data types of columns are correct.

7. loc[] and iloc[]

These are used for data frame indexing. loc[] is used to access a group of rows and columns by labels or a boolean array.

iloc[] is used to access a group of rows and columns by index position.

Example:

# Example of using loc
df.loc[:, 'column_name']  # returns all rows of a specific column
df.loc[:, ['column1', 'column2']]  # returns all rows of multiple specific columns

# Example of using iloc
df.iloc[:, 0]  # returns all rows of the first column
df.iloc[:, 0:2]  # returns all rows of the first two columns

Both of these indexing methods are useful for selecting specific rows and columns from a data frame, but loc[] is based on the data frame labels and iloc[] is based on the data frame index positions.

8. sort_values()

This function is used to sort a data frame by one or more columns. This function takes an optional argument by which specifies the column(s) to sort by and an optional argument 'ascending' which is a boolean and specifies whether the sort order should be ascending or descending.

Example:

import pandas as pd
df = pd.read_csv('data.csv')

df.sort_values(by='column_name')

This function returns a new data frame sorted by the specified column(s).

9. groupby()

This groupby() function is used to group rows of a data frame by one or more columns and perform various operations on the grouped data.

Example:

import pandas as pd
df = pd.read_csv('data.csv')

df.groupby('column_name').mean()

This function takes one or more column names as arguments and groups the data frame by the specified column(s). After that, it can be used with various aggregation functions such as mean, sum, count, etc.

10. drop()

This drop() function is used to drop one or more rows or columns from a data frame.

Example:

import pandas as pd
df = pd.read_csv('data.csv')

df.drop(columns=['column_name'])  # drops a specific column from the DataFrame

This function takes in an argument columns or index which is used to specify the columns or rows to drop. It returns a new data frame with the specified columns or rows removed.

Conclusion

That's all from this article. In this article, we discussed some of the most useful functions in the pandas library for data manipulation and analysis.

Hope you liked it. Thanks for reading!

Before you go…

If you liked this article and want to stay tuned for more exciting articles — do consider becoming a medium member using my referral link: https://pralabhsaxena.medium.com/membership.

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job