Pandas is a Python library that provides high-performance, easy-to-use data structures and data manipulation tools designed to make data cleaning and analysis fast and convenient in Python. It is often used in tandem with other libraries like NumPy, scikit-learn, and matplotlib.

Data structures

series

A Series is a one-dimensional array-like object in Pandas, similar to a list or 1D array, but with labeled axes (indices). It can hold any data type and is commonly used for working with a single column data. You can access, modify, and perform operations on individual elements or subsets of the data.

An index labels each element and can be customized. By default indexes consists of the integers 0 through N — 1, where N is the length of the data. Example:

import pandas as pd

s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

#output:
# a    10
# b    20
# c    30
# d    40
# dtype: int64x

dataframe

A DataFrame (df) is a two-dimensional, labeled data structure that can hold data of various types across different columns. It's similar to table or spreadsheet. A DataFrame consists of rows and columns, where each column is a series.

A DataFrame can be created in several ways, but one of the most common methods is using a dictionary containing lists or NumPy arrays of the same length. Example:

data = {
  "student": ["Lia", "Jack", "Alina", "Amanda", "Anton"],
  "age": [19, 20, 20, 21, 22],
  "grade": [78, 89, 86, 98, 91]
}
df = pd.DataFrame(data)
print(df)

#output:
#   Student  Age  Grade
# 0 Lia      19   78
# 1 Jack     20   89
# 2 Alina    20   86
# 3 Amanda   21   98
# 4 Anton    22   91

dataframe attributes

Attributes are the properties of a DataFrame that can be used to fetch data or any information related to a particular dataframe. Main attributes are:

  • .shape returns the dimensions of the df as a tuple: (number of rows, number of columns).
  • .columns returns the column labels (names) of the df as an index object.
  • .index returns the row labels (indexes) of the df.
  • .dtypes returns the data types of each column in the df.
  • .size returns the total number of elements in the df (rows*columns).
  • .values returns the data in the df as a NumPy array, excluding the row and column labels.
  • .T returns the transposed df (swaps rows and columns).
  • .empty returns returns a boolean indicating whether the df is empty.
  • .axes returns a list of the row and column labels.

dataframe methods

Method operate on the object's data and usually returns a result. Main methods are:

  • head() returns the first n rows of the df (by default n=5).
  • tail() returns the last n rows of the df (by default n=5).
  • info() provides a summary of the df, including the number of non-null entries, data types, and memory usage.
  • describe() generates descriptive statistics fur numeric columns, such as mean, standard deviation, and percentiles.
  • sort_values() sorts the df by the values of one or more columns.
  • sort_index() sorts the df by its row labels.
  • drop() removes specified columns or index labels.
  • groupby() groups the df using a particular columns or index for performing aggregate functions (e.g., sum, mean).
  • agg() allows to apply multiple aggregate functions to one or more columns.
  • apply() applies a function along a specified axes (rows or columns) of the df.
  • merge() combines two dfs based on a common column or index (similar to SQL joins).
  • concat() combines two or more dfs along rows or columns.
  • read_csv() is used to read data from CSV files into a df.
  • read_excel() is used to read data from Excel files into a df.
  • read_sql() is used to read data from SQL files into a df.
  • to_csv() writes the df to a CSV file.
  • to_excel() writes the df to a Excel file.
  • to_sql() writes the df to a SQL file.

Functionality

Slicing

Slicing in Pandas DataFrames is similar to slicing Python lists or NumPy arrays, but it offers much more functionality. It allows you to extract specific rows, columns, or even subsets of the df based on index positions, labels, or conditions.

  • Row slicing: index positions with .iloc and labels with .loc[]. Examples: df.iloc[2:5] , df.loc["row1":"row2"].
  • Column slicing. Columns also can be sliced using .iloc and .loc[], or by directly accessing column names. Examples: df.iloc[:, 2:5] , df.loc[:, "col1":"col2"], df[["col1", "col2"]].
None
slicing options

Missing values

Missing data refers to a value that is not recorded for a particular variable in an observation. Methods to identify missing values:

  • .isna() shows True / False for missing values,
  • .isnull() = .isna(),
  • .isna().sum() counts missing values per columns,
  • .isna().mean()*100 percentage of missing values per column.

Dropping rows or columns with any missing values:

  • .dropna() for rows,
  • .dropna(axis=1) for columns.

Filling missing values:

  • .fillna(0) fill with constant value,
  • .fillna(method="ffill") forward fill,
  • .fillna(method="bfill") backward fill,
  • .fillna(df[column].mean()) fill with column mean.
None