Why You Should Use JSON, Not Pandas, In Your First Data Science Project

Why your first instinct to convert raw data into a data frame is probably wrong.

Zach Quinn

Pipeline: Your Data Engineering Resource

· ~6 min read · May 10, 2023 (Updated: September 28, 2023) · Free: No

To receive this story and other content first, consider joining my email list.

If you're like me your very first interaction with data was likely through one command.

read_csv.

This simple function, universal for data scientists working in every language from Python to R, is fundamental to processing and manipulating data. But there's a catch.

The added layer of abstraction that Pandas offers in its data frame-based commands can make for lazy and redundant operations.

Your JSON–And Final Data Format–Don't Need To Be Pretty

One of the bad habits I picked up as a first year data science student was thinking that JSON was somehow "raw" data that needed to be immediately "prettified" in a data frame.

This approach meant that I was too concerned with preprocessing and converting the data to a data frame that I didn't take time to understand the structure of the data or complexity of data types embedded in it.

So, when I would convert JSON to a Pandas data frame, I'd recoil at "ugly" data like nested arrays or dictionaries.

Note: Both forms, by the way, are perfectly acceptable when stored in a database. Just be aware that CSVs generally don't support complex data types like nested rows and/or columns.

Using a ready-made function like Pandas' read_json or a method like .explode() to flatten your data suggests that the data frame is somehow the ideal, the ultimate form every data payload should take.

Nearly every college course, guided coding tutorial or paint-by-numbers data science project all seemed to guide me, a data science student, toward getting my data into a data frame ASAP.

For years, I never questioned why this was the case. Looking back on such experiences, presumably, it was because Pandas was so universally used in every job on the data science spectrum.

And, frankly, Pandas data frames make for a more readable output when creating visual tutorials.

But this doesn't mean it's always the absolute best choice for processing, loading and storing data long-term.

It took me until my first job to realize the utility and elegance of JSON. My hope is that I can explain the role of JSON in the data science process and why you, the first year data science student, need to use this data type more often.

Why JSON?

You Can Naturally Read JSON

I recently received a message asking me a broad question:

"Why exactly should I use JSON?"

The easy answer to the question is you should use JSON because it's a human readable form of raw data that you can easily manipulate, process and store.

The more complex response to this would delve into the computational cost of a Pandas operation vs. working directly with JSON data.

Even though a raw JSON string looks intimidating, JSON is incredibly intuitive, even for beginners.

JSON is the perfect first "real world" example of so many abstract programming and data structure concepts.

You'll Learn Data Structures Faster

JSON provides us with common data types like dictionaries and lists. But it's also easy enough to delve into more complex types like arrays and experience the messiness of multi-level nested fields.

Since accessing each field is as easy as accessing a key, a key reference mirrors Pandas' dot notation with the (slightly) added complexity of having to loop through nested values.

Although, admittedly, looping through a list of values is extra work, the trade off is JSON's preference for object types means you don't need to worry about the pesky type conversions you would encounter in Pandas, unless you're adding a column of a defined type like a TIMESTAMP.

You'll Find JSON in Every Schema

Whether you are a data analyst, data scientist or data engineer, you will encounter JSON in one fundamental area of your discipline: Schemas.

Although cloud platforms like BigQuery support APIs that allow the creation of custom schemas using operators like the bigqueryField() function, the most straightforward and universal way to create a table schema remains the JSON dictionary.

There's a reason you can view the JSON output of a completed BigQuery query. It is a universal data format for everyone working in the data science field.

Creating a schema in JSON also allows you to have a "backup" or "draft" of your schema in case you need to make later changes or recreate a table.

Defining and ultimately loading a JSON schema will arguably teach you more about data types and data type compatibility than the built-in Pandas data types, which, themselves, are problematic (looking at you NaN classified as "float").

Pardon the interruption: For more Python, SQL and cloud computing walkthroughs, follow Pipeline: Your Data Engineering Resource.

To receive my latest writing, you can follow me as well.

3 Actionable Ways To Incorporate JSON Into Your First (Or Next) Data Science Project

Show Off A Schema Design

I'm not (yet) a hiring manager, but if any part of your data science project requires a load to a RDBMS and you can write a coherent and CLEAN schema then, in my book, you're hired.

As a bonus, creating a schema in JSON also allows you to have a "backup" or "draft" of your schema in case you need to make later changes or recreate a table.

Defining and ultimately loading a JSON schema will arguably teach you more about data types and data type compatibility than the built-in Pandas data types.

Being able to intuit and represent a schema (and not rely on auto detect) is an undervalued skill in this industry. It's not enough just to model or analyze data. You need to demonstrate an understanding of its ingestion, even if you're not the one building a pipeline.

Generate API Payloads

Despite the abundance of static CSVs available on Kaggle, data from pre-cleaned CSVs does not help teach you the skills required to work with real-world API data, most of which is delivered in the form of a long and possibly intimidating JSON object.

There are a ton of free APIs to use for practicing JSON retrieval and manipulation.

Getting comfortable with basic methods like .json() will help you better extract your data.

Plus, relying on APIs instead of CSVs opens you up to all kinds of fascinating data sources that will not just be interesting to work with, but also impress your interviewers.

Use JSON As A Backup

Maybe your API-based application or pipeline takes a ton of time to return an output. Maybe you're concerned about data accuracy.

Even as a beginner, a great fail-safe for your data, especially if your input is JSON based, is to save a file either locally (or even better) in a cloud-based storage bucket.

I recently implemented this strategy when building a pipeline at work in which I needed to store "chunks" of JSON streaming data.

Even as a beginner, storing data locally can provide a static file for you to explore and manipulate without having to worry about calling an API or running a script to produce an output.

The Bottom Line

When you graduate from data science student to working in a data science-based discipline, your greatest challenge will be quickly adapting to "messy" data by incorporating it into a pipeline, dashboard or model; you will rarely work with a CSV — so get out of that habit now and embrace the output you will work with, regardless of tech stack, JSON.

Create a job-worthy data portfolio. Learn how with my free project guide.

#data-engineering #data-science #json #python #learning-to-code