When we collect data from a source it is called primary data or raw data. Every raw data have the potential to become information. You can read my article to learn more about information and knowledge.

When we collect data from data exhaust, we get an enormous amount of raw data. Data exhaust refers to unconventional data or the trail of data left by Internet or computer users during their online activity, and transactions. This includes geospatial, network, time-series, website visits, clicked links, and even hovering with a mouse and these are in the form of cookies, temporary files, log files, storable choices, and more. Some of the data after processing can be used for predictive analytics, improving the user interface and layout design, etc.

These raw data may have lots of noise, duplicates, unformatted or have multiple formats, often do not have a pattern, and lacks consistency and completeness. To extract information or insights from these data to make a business decision, we have to make sure data is clean and data quality checks are performed. If data is not cleaned and quality checks are not performed then it can lead to incorrect or wrong insights and which eventually leads to incorrect business decisions. Data is only valuable when it is cleaned and quality checks are performed.

Data cleaning

Data cleaning also known as Data cleansing or Data scrubbing is the process in which dirty or messy data is converted to clean data, which can be fed to machine learning models or for data analysis. All the datasets may not require all the data cleaning process. The essential data cleaning process is:

Removing Duplicates:

Duplicate data occurs when data is scrapped or collected via different means or when a customer submits redundant forms or data blended from different data sources or when data is transferred between systems. These duplicate records increase storage costs. But when labeling a data as duplicate it should be a complete carbon copy, because partial duplicates may have business value. For example, if it is to check customers based on the number of transactions per month, the customer details will be the same only transaction, transaction amount, and other transaction-related details will be different. So partial duplicate data should be analyzed with the business objective.

To explain this using python, I am using youtube scrapped data for Data Science videos, dataset can be accessed from here. In python using the duplicate() function, returns a series with True and False values that describe which rows in the data frame is duplicated.

None

Here duplicate() function has returned True for two different rows out of 173 rows, which indicates each of these two rows has one more carbon copy.

Partial duplicates can be identified by using a subset with the duplicate() function. If we have to find partial duplicates for channel_name we can find them by using df[df.duplicated(subset="Channel_Name")].

None

Here in this dataset, we have 67 partial duplicates for Channel_Name.

To remove duplicates, we can use drop_duplicates () with keep as first, last or False, which will keep the first duplicate and drop rest, keep the last duplicate and drop rest, or will drop all duplicates simultaneously. The default for keep value is first.

None

Here the result_df has only 171, which means two duplicate records are removed. The code snippet explains the same.

Remove Irrelevant Observations/Data:

If the business objective is to analyze data within a timeframe, data that does not belong to this category can be removed. Likewise, if our data have many columns like full name, first name, last name, middle name, and user-id, apart from the full name we can remove the rest of the 3 columns because it will not add any significance to the target. By removing irrelevant observations and/or data, the resultant data will be more efficient and will have a minimum distraction from the business target.

Fix Structural Errors:

There are different forms of structural errors, like data that are scrapped can have '\n', or type form which contain city to be enter may have city abbreviation or full form or old name of the city, then Not Applicable can be written as NA, N/A or No. So these types of structural errors should be addressed in the data cleaning process.

Here in the dataset, we can see '\n' is there for Video_Title. We use replace() to clean the Video_Title. We can also check if any other column has '\n' and if so that column can also be cleaned.

None

It is always a better practice to use strip() after cleaning the '\n' so that the extra spaces would be trimmed. Here we can see only Video_Title has '\n's.

Full Python code can be accessed from here.

Formatting:

The date can have a different format if data is collected from different sources like 9–12–2021 or 9th December 2021 or 9th Dec 2021 or 9/12/2021 or 9/12/21. So formatting it to a single format is a mandatory step to be followed before analyzing the data or feeding it to ML models.

Filter unwanted outliers:

When an observation lies mean plus or minus 2 times the standard deviation, it is considered an outlier. But if we have an outlier, we should not discard it without proper analysis. If the file size of a few files is outliers, chances are there these files contain duplicate values or irrelevant observations which are not handled. So first step is to check and analyze the file content rather than just discarding it without any proper validation. Sometimes outliers are valid and they can prove our business objective.

Handling Missing Values

Before handling missing values, rest all data cleaning processes should be performed, especially removing duplicates and handling irrelevant observations or data and outliers. Else while doing data imputation with mean/median you will be taking the mean or median of the entire data which will give wrong results. Few methods followed to handle missing values are :

If a column has 85% of the values missing, then we do not have enough data to feed these missing values. So we can drop the column.

When a column has a categorical value missing, we can replace it with the default value or as "Unknown". But when the numerical value is missing we can impute it with mean or median or with KNN imputation. Or depending on the business objective we can use a prediction algorithm to predict the missing values, which will give better accuracy unless the missing value is expected to have a very high variance.

Once Data Cleaning is done, then data quality checks should be performed before taking the data for analyzing or feeding it to ML models. In my next article, I will cover data quality checks and how they can be done.

Thanks for reading! If you found this useful, follow me here on Medium. I regularly write about Python, ML, AI tools, and developer productivity.

You can also find me on Twitter : @PythonDataStats, and on YouTube @pythonDataStats where I share quick tips, tools, and tutorials for developers.