Select DataFrame columns

Summary: in this tutorial, you’ll learn how to select columns from a DataFrame object.

In this example, we are going to use the data from List of metro systems on Wikipedia and parse its HTML table into a DataFrame object to serve as the main data source with the following code snippet.

# Create a DataFrame from Wiki table
import pandas as pd
url = ''
df_list = pd.read_html(url)
df = df_list[0]

Get column list

In order to select columns from a DataFrame, first we need to know how many column it contains.

The columns attribute of DataFrame returns an Index object, basically represents the list of all column names.

# Get column names from DataFrame

Index(['City', 'Country', 'Name', 'Year opened', 'Year of last expansion',
       'Stations', 'System length', 'Annual ridership(millions)'],

Select one column

To get the individual column data we can either use df.column_name. For example.

# Select column using name attribute

# Output
0            Algeria
1          Argentina
2            Armenia
3          Australia
4            Austria
188    United States
189    United States
190    United States
191       Uzbekistan
192        Venezuela
Name: Country, Length: 193, dtype: object

Alternatively, use df['column_name'] syntax achieves the same result.

# Select column using index operator

# Output
0            Algeria
1          Argentina
2            Armenia
3          Australia
4            Austria
188    United States
189    United States
190    United States
191       Uzbekistan
192        Venezuela
Name: Country, Length: 193, dtype: object

Select multiple columns

In order to select multiple columns, we have to use the square brackets syntax, but this time, a list of column names will be passed : df[["column1", "column2"]].

# Select multiple columns
col_names = ['Country', 'Name']

We’ll get another DataFrame as the result.

AlgeriaAlgiers Metro
1ArgentinaBuenos Aires Underground
2ArmeniaYerevan Metro
3AustraliaSydney Metro
4AustriaVienna U-Bahn
188United StatesBART[Nb 77]
189United StatesTren Urbano
190United StatesWashington Metro
191UzbekistanTashkent Metro
192VenezuelaCaracas Metro[Nb 81]
193 rows × 2 columns

Select columns by index number

If you want to select a column by its index number, you have to use DataFrame’s iloc property. More details on the method will be covered more thoroughly in the future.

See also  Pandas reindexing

Use df.iloc[:, n] to access the column at index number n, like the example below.

# Select the first column data (index 0)
df.iloc[:, 0]

# Output
0               Algiers
1          Buenos Aires
2               Yerevan
3                Sydney
4                Vienna
188       San Francisco
189            San Juan
190    Washington, D.C.
191            Tashkent
192             Caracas
Name: City, Length: 193, dtype: object

Similar to how we pass a list of column names to select multiple columns, you can also pass a list of column index numbers to create a subset of a DataFrame out of multiple columns.

The data returned will be wrapped in a DataFrame.

# Select columns with index 0 and 3
col_index = [0,3]
df.iloc[:, col_index]
# or shorter
df.iloc[:, [0,3]] 

Result :

CityYear opened
1Buenos Aires1927[Nb 1]
4Vienna1976[24][Nb 2]
188San Francisco1972[378]
189San Juan2004[354]
190Washington, D.C.1976[380]
193 rows × 2 columns

Select columns by condition

Just like how we can select rows based on multiple conditions using boolean index, we can also do the same thing with columns.

# Select columns by multiple conditions
condition_1 = df["Country"] == "China"
condition_2 = df["Year opened"] == "2012"
df[condition_1 & condition_2]
CityCountryNameYear openedYear of last expansionStationsSystem lengthAnnual ridership(millions)
32HangzhouChinaHangzhou Metro[86]20122020[87]165306 km (190 mi)582.41 (2020)[R 18]
38KunmingChinaKunming Metro20122020[99]84139.4 km (86.6 mi)[100]159.26 (2020)[R 18]
50SuzhouChinaSuzhou Rail Transit20122019[118]131166.2 km (103.3 mi)[118]308.57 (2020)[R 18]

Please note that we did not clean up citations and footnotes from our pandas DataFrame, so any “greater than” or “lower than” operation should not be evaluated to avoid strange errors.

See also  Select DataFrame rows

Columns with duplicate names

Let’s say you have a CSV which contains multiple columns that share the same name and want to import it into a DataFrame for easier querying.

Don’t worry, Pandas read_csv method has a parameter built to handle duplicate columns. By default, mangle_dupe_cols is set to True, meaning that duplicate columns will be automatically renamed to X, X.1, X.2 and so on, once they are imported to the DataFrame.

Let’s see how this works in action.

from io import StringIO
import pandas as pd

# Two duplicate id columns
txt = """success, created, id, errors, id
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""

# Create a DataFrame out of the above data
df = pd.read_csv(StringIO(txt), skipinitialspace=True)

We’ll get back the DataFrame with no duplicate names.


If you don’t like this way of renaming column names, you can always overwrite them with your own by setting df.columns to a new list of strings.

# Set new column names
df.columns = ['success', 'created', 'id_created', 'errors', 'id_error']

The original DataFrame will be updated instantly.



  • Select one column with either df.column_name or df['column_name']
  • Select multiple columns using index number with df.iloc[:, [id1,id2, id3]]
  • Select multiple columns using conditions by passing boolean variables : df[df['col1'] == 'val1' & df['col2'] >= 'val2']
  • Duplicate column names are renamed automatically upon reading CSV.
  • Batch column name renaming can be done by setting df.columns to a Python list.

Leave a Comment