Select DataFrame columns

Summary: in this tutorial, you’ll learn how to select columns from a DataFrame object.

In this example, we are going to use the data from List of metro systems on Wikipedia and parse its HTML table into a DataFrame object to serve as the main data source with the following code snippet.

# Create a DataFrame from Wiki table import pandas as pd url = 'https://en.wikipedia.org/wiki/List_of_metro_systems' df_list = pd.read_html(url) df = df_list[0]
Code language: Python (python)

Get column list

In order to select columns from a DataFrame, first we need to know how many column it contains.

The columns attribute of DataFrame returns an Index object, basically represents the list of all column names.

# Get column names from DataFrame df.columns #Output Index(['City', 'Country', 'Name', 'Year opened', 'Year of last expansion', 'Stations', 'System length', 'Annual ridership(millions)'], dtype='object')
Code language: Python (python)

Select one column

To get the individual column data we can either use df.column_name. For example.

# Select column using name attribute df.Country # Output 0 Algeria 1 Argentina 2 Armenia 3 Australia 4 Austria ... 188 United States 189 United States 190 United States 191 Uzbekistan 192 Venezuela Name: Country, Length: 193, dtype: object
Code language: Python (python)

Alternatively, use df['column_name'] syntax achieves the same result.

# Select column using index operator df['Country'] # Output 0 Algeria 1 Argentina 2 Armenia 3 Australia 4 Austria ... 188 United States 189 United States 190 United States 191 Uzbekistan 192 Venezuela Name: Country, Length: 193, dtype: object
Code language: Python (python)

Select multiple columns

In order to select multiple columns, we have to use the square brackets syntax, but this time, a list of column names will be passed : df[["column1", "column2"]].

# Select multiple columns col_names = ['Country', 'Name'] df[col_names]
Code language: Python (python)

We’ll get another DataFrame as the result.

CountryName
0AlgeriaAlgiers Metro
1ArgentinaBuenos Aires Underground
2ArmeniaYerevan Metro
3AustraliaSydney Metro
4AustriaVienna U-Bahn
188United StatesBART[Nb 77]
189United StatesTren Urbano
190United StatesWashington Metro
191UzbekistanTashkent Metro
192VenezuelaCaracas Metro[Nb 81]
193 rows × 2 columns

Select columns by index number

If you want to select a column by its index number, you have to use DataFrame’s iloc property. More details on the method will be covered more thoroughly in the future.

Use df.iloc[:, n] to access the column at index number n, like the example below.

# Select the first column data (index 0) df.iloc[:, 0] # Output 0 Algiers 1 Buenos Aires 2 Yerevan 3 Sydney 4 Vienna ... 188 San Francisco 189 San Juan 190 Washington, D.C. 191 Tashkent 192 Caracas Name: City, Length: 193, dtype: object
Code language: Python (python)

Similar to how we pass a list of column names to select multiple columns, you can also pass a list of column index numbers to create a subset of a DataFrame out of multiple columns.

The data returned will be wrapped in a DataFrame.

# Select columns with index 0 and 3 col_index = [0,3] df.iloc[:, col_index] # or shorter df.iloc[:, [0,3]]
Code language: Python (python)

Result :

CityYear opened
0Algiers2011[13]
1Buenos Aires1927[Nb 1]
2Yerevan1981[20]
3Sydney2019[22]
4Vienna1976[24][Nb 2]
188San Francisco1972[378]
189San Juan2004[354]
190Washington, D.C.1976[380]
191Tashkent1977
192Caracas1983[383]
193 rows × 2 columns

Select columns by condition

Just like how we can select rows based on multiple conditions using boolean index, we can also do the same thing with columns.

# Select columns by multiple conditions condition_1 = df["Country"] == "China" condition_2 = df["Year opened"] == "2012" df[condition_1 & condition_2]
Code language: Python (python)
CityCountryNameYear openedYear of last expansionStationsSystem lengthAnnual ridership(millions)
32HangzhouChinaHangzhou Metro[86]20122020[87]165306 km (190 mi)582.41 (2020)[R 18]
38KunmingChinaKunming Metro20122020[99]84139.4 km (86.6 mi)[100]159.26 (2020)[R 18]
50SuzhouChinaSuzhou Rail Transit20122019[118]131166.2 km (103.3 mi)[118]308.57 (2020)[R 18]

Please note that we did not clean up citations and footnotes from our pandas DataFrame, so any “greater than” or “lower than” operation should not be evaluated to avoid strange errors.

Columns with duplicate names

Let’s say you have a CSV which contains multiple columns that share the same name and want to import it into a DataFrame for easier querying.

Don’t worry, Pandas read_csv method has a parameter built to handle duplicate columns. By default, mangle_dupe_cols is set to True, meaning that duplicate columns will be automatically renamed to X, X.1, X.2 and so on, once they are imported to the DataFrame.

Let’s see how this works in action.

from io import StringIO import pandas as pd # Two duplicate id columns txt = """success, created, id, errors, id 1, 2, 3, 4, 5, 6 7, 8, 9, 10, 11, 12""" # Create a DataFrame out of the above data df = pd.read_csv(StringIO(txt), skipinitialspace=True) df
Code language: Python (python)

We’ll get back the DataFrame with no duplicate names.

successcreatediderrorsid.1
123456
789101112

If you don’t like this way of renaming column names, you can always overwrite them with your own by setting df.columns to a new list of strings.

# Set new column names df.columns = ['success', 'created', 'id_created', 'errors', 'id_error'] df
Code language: Python (python)

The original DataFrame will be updated instantly.

successcreatedid_createderrorsid_error
123456
789101112

Summary

  • Select one column with either df.column_name or df['column_name']
  • Select multiple columns using index number with df.iloc[:, [id1,id2, id3]]
  • Select multiple columns using conditions by passing boolean variables : df[df['col1'] == 'val1' & df['col2'] >= 'val2']
  • Duplicate column names are renamed automatically upon reading CSV.
  • Batch column name renaming can be done by setting df.columns to a Python list.

Leave a Comment