Import HTML tables into DataFrame

Summary: in this tutorial, you’ll learn how to parse HTML webpages and import its data into a DataFrame.

HTML is the language of the web, created for wrapping web elements into a page.

Sometimes we want to bring a HTML table into Pandas quickly for analysis. In such cases, Pandas offer the read_html method that can both fetch the webpage and parse HTML tables, then create a DataFrame object out of that data.

Supported URL schemes include http, ftp and file (sadly no https support). The reason for the lack of HTTPS support is because Pandas relies on lxml library to perform its HTML parsing tasks, and lxml cannot access HTTPS URLs (yet).

Install lxml

Before you can perform any operation with HTML tables, you have to install lxml first by running the following command in the terminal.

pip install lxml
Code language: Python (python)

If you use Anaconda, lxml is already installed, no manual action required.

Read HTML tables with Pandas

In this example, we’re going to get the FDIC Failed Bank List into Pandas for further analysis.

We will use the built-in read_html method to fetch and parse the page.

import pandas as pd url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/' df = pd.read_html(url) df[0]
Code language: Python (python)

Output:

Bank NameBankCityCityStateStCertCertAcquiring InstitutionAIClosing DateClosingFundFund
0Almena State BankAlmenaKS15426Equity BankOctober 23, 202010538
1First City Bank of FloridaFort Walton BeachFL16748United Fidelity Bank, fsbOctober 16, 202010537
2The First State BankBarboursvilleWV14361MVB Bank, Inc.April 3, 202010536
3Ericson State BankEricsonNE18265Farmers and Merchants BankFebruary 14, 202010535
4City National Bank of New JerseyNewarkNJ21111Industrial BankNovember 1, 201910534
558Superior Bank, FSBHinsdaleIL32646Superior Federal, FSBJuly 27, 20016004
559Malta National BankMaltaOH6629North Valley BankMay 3, 20014648
560First Alliance Bank & Trust Co.ManchesterNH34264Southern New Hampshire Bank & TrustFebruary 2, 20014647
561National State Bank of MetropolisMetropolisIL3815Banterra Bank of MarionDecember 14, 20004646
562Bank of HonoluluHonoluluHI21029Bank of the OrientOctober 13, 20004645

Please note that read_html automatically look for <table> elements inside the webpage and returns a list of DataFrames, even when there’s only one table found.

In the example above, the long DataFrame has been automatically truncated by Pandas for easier viewing. You will also see a note in the lower part of the output that says “563 rows × 7 columns”.

Summary

  • Pandas read_html can automatically look for HTML tables parse all of them into DataFrames.
  • read_html always return a list of DataFrames, when there’s no HTML table, it returns an empty list.
  • read_html cannot fetch from HTTPS URLs, you need urllib3, Requests or HTTPX for that.

Leave a Comment