Summary: in this tutorial, you’ll learn how to parse HTML webpages and import its data into a DataFrame
.
HTML is the language of the web, created for wrapping web elements into a page.
Sometimes we want to bring a HTML table into Pandas quickly for analysis. In such cases, Pandas offer the read_html
method that can both fetch the webpage and parse HTML tables, then create a DataFrame object out of that data.
Supported URL schemes include http, ftp and file (sadly no https support). The reason for the lack of HTTPS support is because Pandas relies on lxml library to perform its HTML parsing tasks, and lxml cannot access HTTPS URLs (yet).
Install lxml
Before you can perform any operation with HTML tables, you have to install lxml first by running the following command in the terminal.
pip install lxml
If you use Anaconda, lxml is already installed, no manual action required.
Read HTML tables with Pandas
In this example, we’re going to get the FDIC Failed Bank List into Pandas for further analysis.
We will use the built-in read_html
method to fetch and parse the page.
import pandas as pd
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
df = pd.read_html(url)
df[0]
Output:
Bank NameBank | CityCity | StateSt | CertCert | Acquiring InstitutionAI | Closing DateClosing | FundFund | |
---|---|---|---|---|---|---|---|
Almena State Bank | Almena | KS | 15426 | Equity Bank | October 23, 2020 | 10538 | |
1 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | October 16, 2020 | 10537 |
2 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | April 3, 2020 | 10536 |
3 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | February 14, 2020 | 10535 |
4 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | November 1, 2019 | 10534 |
… | … | … | … | … | … | … | … |
558 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | July 27, 2001 | 6004 |
559 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | May 3, 2001 | 4648 |
560 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | February 2, 2001 | 4647 |
561 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | December 14, 2000 | 4646 |
562 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | October 13, 2000 | 4645 |
Please note that read_html
automatically look for <table>
elements inside the webpage and returns a list of DataFrames, even when there’s only one table found.
In the example above, the long DataFrame has been automatically truncated by Pandas for easier viewing. You will also see a note in the lower part of the output that says “563 rows × 7 columns”.
Summary
- Pandas
read_html
can automatically look for HTML tables parse all of them into DataFrames. read_html
always return a list of DataFrames, when there’s no HTML table, it returns an empty list.read_html
cannot fetch from HTTPS URLs, you need urllib3, Requests or HTTPX for that.
