Quickly clean up citations from Wikipedia table with pandas

Summary: In this tutorial, you’ll learn how to quickly removing citations from wikipedia tables so that we can read them to a clean DataFrame.

One of the most annoying thing about Wikipedia is citations on its table. Sure, it’s good to see that the data is backed up by numbers and references, but if you try to use Pandas read_html on them, the citation is quickly includes in the values, and we end up with a messy DataFrame.

CityCountryNameYear openedYear of last expansionStationsSystem lengthAnnual ridership(millions)
AlgiersAlgeriaAlgiers Metro2011[13]2018[14]19[14]18.5 km (11.5 mi)[15]45.3 (2019)[R 1]
1Buenos AiresArgentinaBuenos Aires Underground1927[Nb 1]2019[18]90[19]56.7 km (35.2 mi)[19]321.3 (2019)[R 2]
2YerevanArmeniaYerevan Metro1981[20]1996[21]10[20]13.4 km (8.3 mi)[20]20.2 (2019)[R 3]
3SydneyAustraliaSydney Metro2019[22]ā€“13[22]36 km (22 mi)[22][23]12.9 (2020)[R 4][R Nb 1]
4ViennaAustriaVienna U-Bahn1976[24][Nb 2]2017[25]98[26]83.3 km (51.8 mi)[24]459.8 (2019)[R 6]
189San FranciscoUnited StatesBART[Nb 77]1972[380]2020[381]47[380][Nb 78]186.8 km (116.1 mi)[380][Nb 79]34.1 (2020)[R 16][R Nb 2]
190San JuanUnited StatesTren Urbano2004[356]20051617.2 km (10.7 mi)1.1 (2020)[R 16][R Nb 2]
191Washington, D.C.United StatesWashington Metro1976[382]2014[383]91[382]188 km (117 mi)[382]68.1 (2020)[R 16][R Nb 2]
192TashkentUzbekistanTashkent Metro19772020[Nb 80]39[384]57.1 km (35.5 mi)[384]71.2 (2019)[R 3]
193CaracasVenezuelaCaracas Metro[Nb 81]1983[385]2015[386]52[Nb 82]67.2 km (41.8 mi)[Nb 82]358 (2017)[R 100][R 101]
Messy table with citations included

This article will present a simple and quick way to get rid of all citations.

See also  Pandas: Rename a single column

Step 1 : Open Devtools

With the Wikipedia page opened, you need to right-click in the table and select Inspect Element to open up the Devtools.

Step 2 : Remove all references tag

Switch to Console tab, paste the following JavaScript to the console and press Enter to run it. This code snippet will remove all <sup> and <sub> tags from the current page.

document.body.innerHTML=document.body.innerHTML.replace(/<sup\b[^>]*>(.*?)<\/sup>/gi, "" );

Now you can see that our page is cleaned from references.

Step 3 : Save the page

Now you can save the page locally and load it into your Notebook, as usual. In this example, we named the file wiki.html.

import pandas as pd
page_tables = pd.read_html("wiki.html")
df = page_tables[0]

You can see that the data is now free from any unnecessary information.

CityCountryNameYear openedYear of last expansionStationsSystem lengthAnnual ridership(millions)
AlgiersAlgeriaAlgiers Metro201120181918.5 km (11.5 mi)45.3 (2019)
1Buenos AiresArgentinaBuenos Aires Underground192720199056.7 km (35.2 mi)321.3 (2019)
2YerevanArmeniaYerevan Metro198119961013.4 km (8.3 mi)20.2 (2019)
3SydneyAustraliaSydney Metro2019ā€“1336 km (22 mi)12.9 (2020)
4ViennaAustriaVienna U-Bahn197620179883.3 km (51.8 mi)459.8 (2019)
189San FranciscoUnited StatesBART1972202047186.8 km (116.1 mi)34.1 (2020)
190San JuanUnited StatesTren Urbano200420051617.2 km (10.7 mi)1.1 (2020)
191Washington, D.C.United StatesWashington Metro1976201491188 km (117 mi)68.1 (2020)
192TashkentUzbekistanTashkent Metro197720203957.1 km (35.5 mi)71.2 (2019)
193CaracasVenezuelaCaracas Metro198320155267.2 km (41.8 mi)358 (2017)

Leave a Comment