Summary: in this tutorial, you’ll learn about
.ix – the three ways of selecting row and column data in Pandas
There are three basic way of selecting data from rows and columns in Pandas.
.iloc – select by index number
.iloc selects rows and columns by their index numbers. The name is short for “integer location” or “index location”.
Each row has a number from 0 to the total rows called its “index”. Similarly, each column also has its index number.
The syntax is very simple :
df.iloc[<row selection>, <column selection>]
# iloc usage in Pandas # Select first five rows of dataframe data.iloc[0:5] # Select first two columns of data frame with all rows data.iloc[:, 0:2] # Select 1st, 4th, 7th, 25th row + 1st 6th 7th columns. Remember index is counted from 0. data.iloc[[0,3,6,24], [0,5,6]] # Select first 5 rows and 5th, 6th, 7th columns of data frame. data.iloc[0:5, 5:8]
Important note :
- The last row or column in the range will never be selected. For example : [3:9] will select from rows number 3 to 8, but not row number 9.
- If you pass a single number into
.iloc, the data returned will be a
Series(which makes sense because it contains only one row). Multiple rows selected will turn into a DataFrame in the result instead of a Series. To ensure the result is always a DataFrame, pass a single-valued list into
ilocinstead of just a single number.
.loc – select by name or boolean vector
.loc is a label indexer that allows us to select rows and columns either by their names or boolean vectors.
.loc syntax is
df.loc. Inside the brackets are the inputs, which could be either
row_selection, column_selection or
The boolean vector syntax is the most useful one because it can quickly filter through a DataFrame to find what you need.
.loc by index number
.loc relies on the index of the DataFrame to perform selection (if there’s any).
On a DataFrame with default number-based index, you can select rows using their index number with
The example below reads a table (citations removed) into a DataFrame and select a few rows from that data.
import pandas as pd # Import data from cleaned HTML file df = pd.read_html("wiki.html") # .loc can select row/rows using index number # Select one row df.loc # Select multiple rows df.loc[[6,8,10]]
This result in another DataFrame which contains our desired rows only.
|City||Country||Name||Year opened||Year of last expansion||Stations||System length||Annual ridership(millions)|
|6||Minsk||Belarus||Minsk Metro||1984||2020||33||40.8 km (25.4 mi)||293.7 (2019)|
|8||Belo Horizonte||Brazil||Belo Horizonte Metro||1986||2002||19||28.1 km (17.5 mi)||58.4 (2018)|
|10||Porto Alegre||Brazil||Porto Alegre Metro||1985||2014||22||43.8 km (27.2 mi)||51.7 (2018)|
.loc by labels
On a DataFrame with an index is set,
.loc allows directly selecting based on index values of any rows.
In the following example, we’ve set City column to be the index, then select all metro based in New York.
import pandas as pd # Import data from cleaned HTML file df = pd.read_html("wiki.html") # .loc can select row/rows using index number # Select using index label df.set_index("City", inplace=True) df.loc['New York City']
We’ll get a fresh DataFrame with what we needed.
|Country||Name||Year opened||Year of last expansion||Stations||System length||Annual ridership(millions)|
|New York City||United States||New York City Subway||1904||2017||424||399 km (248 mi)||1697.8 (2019)|
|New York City||United States||Staten Island Railway||1925||2017||21||22.5 km (14.0 mi)||2.7 (2020)|
|New York City||United States||PATH||1908||1937||13||22.2 km (13.8 mi)||29.7 (2020)|
But be aware that if only one row is found, a Series will be returned instead of a DataFrame.
In order to avoid this behaviour, pass a one-element list instead of just a string.
# df.loc['Boston'] returns a Series df.loc[['Boston']] # returns a DataFrame
.ix is .loc and .iloc combined, but deprecated
.ix is a combination of the two methods .loc and .iloc above.
Depending on the input, it will perform the appropriate operation. If the input is a non-integer label, it will behave like .loc, and if it is an integer, it will behave like .iloc.
.ix indexer has been deprecated since Pandas 0.2.