Read Excel files into DataFrame

Summary: in this tutorial, you’ll learn how to create a DataFrame directly from Excel files.

DataFrame can represent spreadsheet data, so reading an Excel file seems like a must-have feature for a all-in-one data analysis library like Pandas.

Supported file formats include xls, xlsx, xlsm, xlsb, odf, ods and odt. You can either read from a local filesystem or from a remote location using an URL.

Install xlrd and openpyxl

Pandas uses xlrd and openpyxl internally as the engine to parse and read data from Excel files.

While xlrd has done a great job processing Excel files, it phased out support for XLSX from version 2.0. OpenPyXL is now the recommended package for reading and writing Excel 2010 files, according to python-excel.org.

To sum up : opt for OpenPyXL if you read XLSX, otherwise xlrd is enough.

Before you can perform any operation with XLS/XLSX, you have to install either xlrd and openpyxl first by running the following command in the terminal.

pip install xlrd pip install openpyxl
Code language: Python (python)

Or if you use Anaconda, install xlrd with conda.

conda install -c anaconda xlrd conda install -c anaconda openpyxl
Code language: Python (python)

Trying to read Excel files without xlrd, you will encounter the following error message.

ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.
Code language: Python (python)

Read Excel files stored locally

In this example, we’ll use the Storage Medium Security Evaluation XLSX file from Auburn University.

We will use the built-in read_excel method of pandas.

import pandas as pd df = pd.read_excel("data-storage-matrix.xlsx") df.head(3) # Get the first 3 rows
Code language: Python (python)

Output:

Unnamed: 0Storage MediumConfidential (Credit Cards)Confidential (SSN, Financial Accounting Numbers, Driver License Number, Health Insurance Policy ID NumbersConfidential (HIPAA-Protected Health Information, Passport and Visa Numbers, Export Controlled Information)Confidential (FERPA -Student Information , GLBA)Operational (Unpublished Research Data, Faculty Staff employment applications, personnel files, benefits information, birth date, and personal contact information, Admissions applications, Donor contact information and non-public gift amounts.Public (Username, Published Research Data, Certain Policy and Procedure Manuals designated by the owner as public, Campus Maps, Job Postings, Certain University Contact Information not designated by the individual as “private”, Information in the public domain.
0NaNAdvise AssistNONONOYESYESYes
1NaNBOX – Auburn Approved AccountNONOHIPAA/Health Information With ApprovalYesYesYes
2NaNCanvasNONONOYESYESYes

Please note that we don’t need to import OpenPyXL as it is handled automatically by Pandas.

Read Excel files from URL

read_excel also supports reading Excel files directly from a remote location.

A handful of URL schemes are supported: http, ftp, s3, file. A local file could be read using its URL : file://localhost/path/to/file.xls.

import pandas as pd url = 'http://www.auburn.edu/oit/security/data-storage-matrix.xlsx' df = pd.read_excel(url) df.head(3)
Code language: Python (python)

You can also read XLS/XLSX files directly from Amazon S3 using s3:// URLs (you’ll have to install s3fs first).

import pandas as pd df = pd.read_excel("s3://bucket....url") print(df)
Code language: Python (python)

Summary

  • xlrd or openpyxl have to be installed before reading Excel files
  • pandas can create a DataFrame from both local XLS/XLSX files as well as remotely located ones.

Leave a Comment