Read Excel files into DataFrame

Summary: in this tutorial, you’ll learn how to create a DataFrame directly from Excel files.

DataFrame can represent spreadsheet data, so reading an Excel file seems like a must-have feature for a all-in-one data analysis library like Pandas.

Supported file formats include xls, xlsx, xlsm, xlsb, odf, ods and odt. You can either read from a local filesystem or from a remote location using an URL.

Install xlrd and openpyxl

Pandas uses xlrd and openpyxl internally as the engine to parse and read data from Excel files.

While xlrd has done a great job processing Excel files, it phased out support for XLSX from version 2.0. OpenPyXL is now the recommended package for reading and writing Excel 2010 files, according to python-excel.org.

To sum up : opt for OpenPyXL if you read XLSX, otherwise xlrd is enough.

Before you can perform any operation with XLS/XLSX, you have to install either xlrd and openpyxl first by running the following command in the terminal.

pip install xlrd
pip install openpyxl

Or if you use Anaconda, install xlrd with conda.

conda install -c anaconda xlrd
conda install -c anaconda openpyxl

Trying to read Excel files without xlrd, you will encounter the following error message.

ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.

Read Excel files stored locally

In this example, we’ll use the Storage Medium Security Evaluation XLSX file from Auburn University.

We will use the built-in read_excel method of pandas.

import pandas as pd
df = pd.read_excel("data-storage-matrix.xlsx")
df.head(3) # Get the first 3 rows

Output:

Unnamed: 0Storage MediumConfidential (Credit Cards)Confidential (SSN, Financial Accounting Numbers, Driver License Number, Health Insurance Policy ID NumbersConfidential (HIPAA-Protected Health Information, Passport and Visa Numbers, Export Controlled Information)Confidential (FERPA -Student Information , GLBA)Operational (Unpublished Research Data, Faculty Staff employment applications, personnel files, benefits information, birth date, and personal contact information, Admissions applications, Donor contact information and non-public gift amounts.Public (Username, Published Research Data, Certain Policy and Procedure Manuals designated by the owner as public, Campus Maps, Job Postings, Certain University Contact Information not designated by the individual as “private”, Information in the public domain.
NaNAdvise AssistNONONOYESYESYes
1NaNBOX – Auburn Approved AccountNONOHIPAA/Health Information With ApprovalYesYesYes
2NaNCanvasNONONOYESYESYes

Please note that we don’t need to import OpenPyXL as it is handled automatically by Pandas.

See also  Pandas dropna() - filter out empty rows and columns

Read Excel files from URL

read_excel also supports reading Excel files directly from a remote location.

A handful of URL schemes are supported: http, ftp, s3, file. A local file could be read using its URL : file://localhost/path/to/file.xls.

import pandas as pd
url = 'http://www.auburn.edu/oit/security/data-storage-matrix.xlsx'
df = pd.read_excel(url)
df.head(3)

You can also read XLS/XLSX files directly from Amazon S3 using s3:// URLs (you’ll have to install s3fs first).

import pandas as pd

df = pd.read_excel("s3://bucket....url")
print(df)

Summary

  • xlrd or openpyxl have to be installed before reading Excel files
  • pandas can create a DataFrame from both local XLS/XLSX files as well as remotely located ones.
Author: Thijmen I’m currently a SysAdmin located in the Netherlands. Every day I try to keep around a hundred users happy with their network connections and overall, tech-related issues. I also spend my spare time fiddling with web-based applications.

Leave a Comment