Summary: in this tutorial, you’ll learn how to select columns from a DataFrame object.
In this example, we are going to use the data from List of metro systems on Wikipedia and parse its HTML table into a DataFrame object to serve as the main data source with the following code snippet.
# Create a DataFrame from Wiki table
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_metro_systems'
df_list = pd.read_html(url)
df = df_list[0]
Get column list
In order to select columns from a DataFrame, first we need to know how many column it contains.
The columns
attribute of DataFrame returns an Index
object, basically represents the list of all column names.
# Get column names from DataFrame
df.columns
#Output
Index(['City', 'Country', 'Name', 'Year opened', 'Year of last expansion',
'Stations', 'System length', 'Annual ridership(millions)'],
dtype='object')
Select one column
To get the individual column data we can either use df.column_name
. For example.
# Select column using name attribute
df.Country
# Output
0 Algeria
1 Argentina
2 Armenia
3 Australia
4 Austria
...
188 United States
189 United States
190 United States
191 Uzbekistan
192 Venezuela
Name: Country, Length: 193, dtype: object
Alternatively, use df['column_name']
syntax achieves the same result.
# Select column using index operator
df['Country']
# Output
0 Algeria
1 Argentina
2 Armenia
3 Australia
4 Austria
...
188 United States
189 United States
190 United States
191 Uzbekistan
192 Venezuela
Name: Country, Length: 193, dtype: object
Select multiple columns
In order to select multiple columns, we have to use the square brackets syntax, but this time, a list of column names will be passed : df[["column1", "column2"]]
.
# Select multiple columns
col_names = ['Country', 'Name']
df[col_names]
We’ll get another DataFrame as the result.
Country | Name | |
---|---|---|
Algeria | Algiers Metro | |
1 | Argentina | Buenos Aires Underground |
2 | Armenia | Yerevan Metro |
3 | Australia | Sydney Metro |
4 | Austria | Vienna U-Bahn |
… | … | … |
188 | United States | BART[Nb 77] |
189 | United States | Tren Urbano |
190 | United States | Washington Metro |
191 | Uzbekistan | Tashkent Metro |
192 | Venezuela | Caracas Metro[Nb 81] |
Select columns by index number
If you want to select a column by its index number, you have to use DataFrame’s iloc
property. More details on the method will be covered more thoroughly in the future.
Use df.iloc[:, n]
to access the column at index number n
, like the example below.
# Select the first column data (index 0)
df.iloc[:, 0]
# Output
0 Algiers
1 Buenos Aires
2 Yerevan
3 Sydney
4 Vienna
...
188 San Francisco
189 San Juan
190 Washington, D.C.
191 Tashkent
192 Caracas
Name: City, Length: 193, dtype: object
Similar to how we pass a list of column names to select multiple columns, you can also pass a list of column index numbers to create a subset of a DataFrame out of multiple columns.
The data returned will be wrapped in a DataFrame.
# Select columns with index 0 and 3
col_index = [0,3]
df.iloc[:, col_index]
# or shorter
df.iloc[:, [0,3]]
Result :
City | Year opened | |
---|---|---|
Algiers | 2011[13] | |
1 | Buenos Aires | 1927[Nb 1] |
2 | Yerevan | 1981[20] |
3 | Sydney | 2019[22] |
4 | Vienna | 1976[24][Nb 2] |
… | … | … |
188 | San Francisco | 1972[378] |
189 | San Juan | 2004[354] |
190 | Washington, D.C. | 1976[380] |
191 | Tashkent | 1977 |
192 | Caracas | 1983[383] |
Select columns by condition
Just like how we can select rows based on multiple conditions using boolean index, we can also do the same thing with columns.
# Select columns by multiple conditions
condition_1 = df["Country"] == "China"
condition_2 = df["Year opened"] == "2012"
df[condition_1 & condition_2]
City | Country | Name | Year opened | Year of last expansion | Stations | System length | Annual ridership(millions) | |
---|---|---|---|---|---|---|---|---|
32 | Hangzhou | China | Hangzhou Metro[86] | 2012 | 2020[87] | 165 | 306 km (190 mi) | 582.41 (2020)[R 18] |
38 | Kunming | China | Kunming Metro | 2012 | 2020[99] | 84 | 139.4 km (86.6 mi)[100] | 159.26 (2020)[R 18] |
50 | Suzhou | China | Suzhou Rail Transit | 2012 | 2019[118] | 131 | 166.2 km (103.3 mi)[118] | 308.57 (2020)[R 18] |
Please note that we did not clean up citations and footnotes from our pandas DataFrame, so any “greater than” or “lower than” operation should not be evaluated to avoid strange errors.
Columns with duplicate names
Let’s say you have a CSV which contains multiple columns that share the same name and want to import it into a DataFrame for easier querying.
Don’t worry, Pandas read_csv
method has a parameter built to handle duplicate columns. By default, mangle_dupe_cols
is set to True
, meaning that duplicate columns will be automatically renamed to X, X.1, X.2 and so on, once they are imported to the DataFrame.
Let’s see how this works in action.
from io import StringIO
import pandas as pd
# Two duplicate id columns
txt = """success, created, id, errors, id
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""
# Create a DataFrame out of the above data
df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df
We’ll get back the DataFrame with no duplicate names.
success | created | id | errors | id.1 | |
---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 |
7 | 8 | 9 | 10 | 11 | 12 |
If you don’t like this way of renaming column names, you can always overwrite them with your own by setting df.columns
to a new list of strings.
# Set new column names
df.columns = ['success', 'created', 'id_created', 'errors', 'id_error']
df
The original DataFrame will be updated instantly.
success | created | id_created | errors | id_error | |
---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 |
7 | 8 | 9 | 10 | 11 | 12 |
Summary
- Select one column with either
df.column_name
ordf['column_name']
- Select multiple columns using index number with
df.iloc[:, [id1,id2, id3]]
- Select multiple columns using conditions by passing boolean variables :
df[df['col1'] == 'val1' & df['col2'] >= 'val2']
- Duplicate column names are renamed automatically upon reading CSV.
- Batch column name renaming can be done by setting
df.columns
to a Python list.
