Pandas NaN – missing data handling

Summary: in this tutorial, you’re going to learn about how Pandas handle missing data, the NaN value and quick built-in functions to manipulate missing values.

Gathering or collecting data usually produces inconsistencies. Many potential problems can arise, including invalid, ambiguous, or missing values, and out-of-range data.

Pandas development team has acknowledge the problem and built in measures to make working with missing data as painless as possible.

NaN value in Pandas

For numerical values, Pandas uses NaN – Not a Number, a floating-point value to represent missing data. This is far from perfect, but it is functional, simple and works for most people.

Starting from pandas 1.0, some optional data types start experimenting with a native NA scalar using a mask-based approach.

NumPy ‘s np.nan value in a Pandas data type will be marked as NaN and can be quickly verified using isnull() or notnull().

None vs NaN

Python None is treated as NaN when the row values are all number-based types. If the other values in the row are strings or other types, Pandas will convert None to the string "None".

While None is a native Python object, NaN is actually a part of NumPy.

import pandas as pd
import numpy as np
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
                   "toy": [np.NaN, 'Batmobile', 'Bullwhip'],
                   "reputation": [None, "Gotham", np.NaN], # Not numbers, so no NaN
                   "movie": [np.NaN, 3, None], # This none is NaN

Different types of missing values compared

Below is an overview of all the popular values that should be treated as missing in Pandas.

  • NaN is a NumPy built-in placeholder for missing values for any data type. NaN can be manually created using numpy.nan.
  • NA: Most of the time, NA comes from R code, where NA is an identifier for a missing value.
  • NaT (Not a Timestamp) is equivalent to NaN, but for timestamp data points. NaT can also be created using numpy.nat. None: This represents missing values of data types other than numeric.
  • null: This originates when a function doesn’t return a value or if the value is undefined.
  • inf means infinity. It is a NumPy placefolder used when calculation returns an extremely large or small value. Often, we need to treat inf as a missing value by manually specifying pandas.options.mode.use_inf_as_na = True.

Leave a Comment