Summary: in this tutorial, you’ll learn how to create a
DataFrame directly from Excel files.
DataFrame can represent spreadsheet data, so reading an Excel file seems like a must-have feature for a all-in-one data analysis library like Pandas.
Supported file formats include xls, xlsx, xlsm, xlsb, odf, ods and odt. You can either read from a local filesystem or from a remote location using an URL.
Install xlrd and openpyxl
While xlrd has done a great job processing Excel files, it phased out support for XLSX from version 2.0. OpenPyXL is now the recommended package for reading and writing Excel 2010 files, according to python-excel.org.
To sum up : opt for OpenPyXL if you read XLSX, otherwise xlrd is enough.
Before you can perform any operation with XLS/XLSX, you have to install either xlrd and openpyxl first by running the following command in the terminal.
pip install xlrd pip install openpyxl
Or if you use Anaconda, install xlrd with conda.
conda install -c anaconda xlrd conda install -c anaconda openpyxl
Trying to read Excel files without xlrd, you will encounter the following error message.
ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.
Read Excel files stored locally
In this example, we’ll use the Storage Medium Security Evaluation XLSX file from Auburn University.
We will use the built-in
read_excel method of pandas.
import pandas as pd df = pd.read_excel("data-storage-matrix.xlsx") df.head(3) # Get the first 3 rows
|Unnamed: 0||Storage Medium||Confidential (Credit Cards)||Confidential (SSN, Financial Accounting Numbers, Driver License Number, Health Insurance Policy ID Numbers||Confidential (HIPAA-Protected Health Information, Passport and Visa Numbers, Export Controlled Information)||Confidential (FERPA -Student Information , GLBA)||Operational (Unpublished Research Data, Faculty Staff employment applications, personnel files, benefits information, birth date, and personal contact information, Admissions applications, Donor contact information and non-public gift amounts.||Public (Username, Published Research Data, Certain Policy and Procedure Manuals designated by the owner as public, Campus Maps, Job Postings, Certain University Contact Information not designated by the individual as “private”, Information in the public domain.|
|1||NaN||BOX – Auburn Approved Account||NO||NO||HIPAA/Health Information With Approval||Yes||Yes||Yes|
Please note that we don’t need to import OpenPyXL as it is handled automatically by Pandas.
Read Excel files from URL
read_excel also supports reading Excel files directly from a remote location.
A handful of URL schemes are supported: http, ftp, s3, file. A local file could be read using its URL :
import pandas as pd url = 'http://www.auburn.edu/oit/security/data-storage-matrix.xlsx' df = pd.read_excel(url) df.head(3)
import pandas as pd df = pd.read_excel("s3://bucket....url") print(df)
- xlrd or openpyxl have to be installed before reading Excel files
- pandas can create a DataFrame from both local XLS/XLSX files as well as remotely located ones.