Pandas is a Python library that provides high-performance, easy-to-use data structures and data manipulation tools designed to make data cleaning and analysis fast and convenient in Python. It is often used in tandem with other libraries like NumPy, scikit-learn, and matplotlib.
Data structures
series
A Series is a one-dimensional array-like object in Pandas, similar to a list or 1D array, but with labeled axes (indices). It can hold any data type and is commonly used for working with a single column data. You can access, modify, and perform operations on individual elements or subsets of the data.
An index labels each element and can be customized. By default indexes consists of the integers 0 through N — 1, where N is the length of the data. Example:
import pandas as pd
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
#output:
# a 10
# b 20
# c 30
# d 40
# dtype: int64xdataframe
A DataFrame (df) is a two-dimensional, labeled data structure that can hold data of various types across different columns. It's similar to table or spreadsheet. A DataFrame consists of rows and columns, where each column is a series.
A DataFrame can be created in several ways, but one of the most common methods is using a dictionary containing lists or NumPy arrays of the same length. Example:
data = {
"student": ["Lia", "Jack", "Alina", "Amanda", "Anton"],
"age": [19, 20, 20, 21, 22],
"grade": [78, 89, 86, 98, 91]
}
df = pd.DataFrame(data)
print(df)
#output:
# Student Age Grade
# 0 Lia 19 78
# 1 Jack 20 89
# 2 Alina 20 86
# 3 Amanda 21 98
# 4 Anton 22 91dataframe attributes
Attributes are the properties of a DataFrame that can be used to fetch data or any information related to a particular dataframe. Main attributes are:
.shapereturns the dimensions of the df as a tuple: (number of rows, number of columns)..columnsreturns the column labels (names) of the df as an index object..indexreturns the row labels (indexes) of the df..dtypesreturns the data types of each column in the df..sizereturns the total number of elements in the df (rows*columns)..valuesreturns the data in the df as a NumPy array, excluding the row and column labels..Treturns the transposed df (swaps rows and columns)..emptyreturns returns a boolean indicating whether the df is empty..axesreturns a list of the row and column labels.
dataframe methods
Method operate on the object's data and usually returns a result. Main methods are:
head()returns the first n rows of the df (by default n=5).tail()returns the last n rows of the df (by default n=5).info()provides a summary of the df, including the number of non-null entries, data types, and memory usage.describe()generates descriptive statistics fur numeric columns, such as mean, standard deviation, and percentiles.sort_values()sorts the df by the values of one or more columns.sort_index()sorts the df by its row labels.drop()removes specified columns or index labels.groupby()groups the df using a particular columns or index for performing aggregate functions (e.g., sum, mean).agg()allows to apply multiple aggregate functions to one or more columns.apply()applies a function along a specified axes (rows or columns) of the df.merge()combines two dfs based on a common column or index (similar to SQL joins).concat()combines two or more dfs along rows or columns.read_csv()is used to read data from CSV files into a df.read_excel()is used to read data from Excel files into a df.read_sql()is used to read data from SQL files into a df.to_csv()writes the df to a CSV file.to_excel()writes the df to a Excel file.to_sql()writes the df to a SQL file.
Functionality
Slicing
Slicing in Pandas DataFrames is similar to slicing Python lists or NumPy arrays, but it offers much more functionality. It allows you to extract specific rows, columns, or even subsets of the df based on index positions, labels, or conditions.
- Row slicing: index positions with
.ilocand labels with.loc[]. Examples:df.iloc[2:5],df.loc["row1":"row2"]. - Column slicing. Columns also can be sliced using
.ilocand.loc[], or by directly accessing column names. Examples:df.iloc[:, 2:5],df.loc[:, "col1":"col2"],df[["col1", "col2"]].

Missing values
Missing data refers to a value that is not recorded for a particular variable in an observation. Methods to identify missing values:
.isna()shows True / False for missing values,.isnull()=.isna(),.isna().sum()counts missing values per columns,.isna().mean()*100percentage of missing values per column.
Dropping rows or columns with any missing values:
.dropna()for rows,.dropna(axis=1)for columns.
Filling missing values:
.fillna(0)fill with constant value,.fillna(method="ffill")forward fill,.fillna(method="bfill")backward fill,.fillna(df[column].mean())fill with column mean.
