Summary: in this tutorial, you’ll learn how to select rows from a DataFrame object.
Select the first n rows
head
is the built-in method to quickly select the first n rows from a DataFrame object. This is particularly useful if you’re wondering whether you have the right type of data inside the object.
By default, head
returns the first 5 rows from the DataFrame.
import pandas as pd
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
df = pd.read_html(url)
df[0].head()
Output :
Bank NameBank | CityCity | StateSt | CertCert | Acquiring InstitutionAI | Closing DateClosing | FundFund | |
---|---|---|---|---|---|---|---|
Almena State Bank | Almena | KS | 15426 | Equity Bank | October 23, 2020 | 10538 | |
1 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | October 16, 2020 | 10537 |
2 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | April 3, 2020 | 10536 |
3 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | February 14, 2020 | 10535 |
4 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | November 1, 2019 | 10534 |
If you want more than the first 5 rows, pass a number to head
.
df.head(10) # Get first 10 rows from a DataFrame
Interestingly, you can pass a negative number to head
. In this case, the method trims the last n rows and return the remaining part of the DataFrame, equivalent to df[:-n]
.
import pandas as pd
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
df = pd.read_html(url)
# Trim 555 rows from the bottom of the object
# and return the remaining rows
df[0].head(-555)
We’ll get the first 7 rows as the output, 555 rows from the last rows has been trimmed out.
Bank NameBank | CityCity | StateSt | CertCert | Acquiring InstitutionAI | Closing DateClosing | FundFund | |
---|---|---|---|---|---|---|---|
Almena State Bank | Almena | KS | 15426 | Equity Bank | October 23, 2020 | 10538 | |
1 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | October 16, 2020 | 10537 |
2 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | April 3, 2020 | 10536 |
3 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | February 14, 2020 | 10535 |
4 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | November 1, 2019 | 10534 |
5 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | October 25, 2019 | 10533 |
6 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | October 25, 2019 | 10532 |
7 | The Enloe State Bank | Cooper | TX | 10716 | Legend Bank, N. A. | May 31, 2019 | 10531 |
Select the last n rows
Contrary to the head
to get rows from the top, Pandas has a built-in tail
method to select from the last rows upwards.
import pandas as pd
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
df = pd.read_html(url)
df[0].tail()
Bank NameBank | CityCity | StateSt | CertCert | Acquiring InstitutionAI | Closing DateClosing | FundFund | |
---|---|---|---|---|---|---|---|
558 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | July 27, 2001 | 6004 |
559 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | May 3, 2001 | 4648 |
560 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | February 2, 2001 | 4647 |
561 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | December 14, 2000 | 4646 |
562 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | October 13, 2000 | 4645 |
Similarly, passing a number n
to the method gets us the last n
elements from the DataFrame object. If a negative number is used, this method returns all rows except the first n rows.
import pandas as pd
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
df = pd.read_html(url)
# Get the last 10 rows
df[0].tail(10)
Select rows by index
Just like how Python list slicing works, you can extract a range of rows from anywhere in the middle of the DataFrame using row indexes. For example.
import pandas as pd
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
df_list = pd.read_html(url)
df = df_list[0]
# Get 5 rows from rows number 10
df[10:15]
Bank NameBank | CityCity | StateSt | CertCert | Acquiring InstitutionAI | Closing DateClosing | FundFund | |
---|---|---|---|---|---|---|---|
10 | Fayette County Bank | Saint Elmo | IL | 1802 | United Fidelity Bank, fsb | May 26, 2017 | 10528 |
11 | Guaranty Bank, (d/b/a BestBank in Georgia & Mi… | Milwaukee | WI | 30003 | First-Citizens Bank & Trust Company | May 5, 2017 | 10527 |
12 | First NBC Bank | New Orleans | LA | 58302 | Whitney Bank | April 28, 2017 | 10526 |
13 | Proficio Bank | Cottonwood Heights | UT | 35495 | Cache Valley Bank | March 3, 2017 | 10525 |
14 | Seaway Bank and Trust Company | Chicago | IL | 19328 | State Bank of Texas | January 27, 2017 | 10524 |
An important thing to keep in mind that the method is that this will include row #10 and exclude row #15.
Similarly, df[:]
will returns all rows in the DataFrame.
Select rows based on one condition
In order to select all rows that meet a specific condition, we’ll make use of Pandas Boolean Index.
If reading through the documentation doesn’t make sense to you, don’t worry, an example will clarify the concept.
In the example below, we need to select all rows that have Cert lower than 500.
# Create a DataFrame from FDIC data
import pandas as pd
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
df_list = pd.read_html(url)
df = df_list[0]
# Select based on condition
low_cert = df['CertCert'] <= 500
df[low_cert]
We’ll get the output we wanted:
Bank NameBank | CityCity | StateSt | CertCert | Acquiring InstitutionAI | Closing DateClosing | FundFund | |
---|---|---|---|---|---|---|---|
16 | Allied Bank | Mulberry | AR | 91 | Today’s Bank | September 23, 2016 | 10522 |
97 | First Capital Bank | Kingfisher | OK | 416 | F & M Bank | June 8, 2012 | 10443 |
195 | Habersham Bank | Clarkesville | GA | 151 | SCBT National Association | February 18, 2011 | 10345 |
248 | The Peoples Bank | Winder | GA | 182 | Community & Southern Bank | September 17, 2010 | 10292 |
Select rows based on multiple condition
We can combine unlimited number of boolean indexes to filter data from DataFrame quickly and easily.
Let’s suppose we want to get all failed bank that meet two criteria : has Cert higher than 5000 and based in New York.
# Create a DataFrame from FDIC data
import pandas as pd
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
df_list = pd.read_html(url)
df = df_list[0]
# Select based on multiple condition
low_cert = df['CertCert'] > 5000
ny_based = df['CityCity'] == "New York"
df[low_cert & ny_based]
And voila, the result comes.
Bank NameBank | CityCity | StateSt | CertCert | Acquiring InstitutionAI | Closing DateClosing | FundFund | |
---|---|---|---|---|---|---|---|
343 | The Park Avenue Bank | New York | NY | 27096 | Valley National Bank | March 12, 2010 | 10195 |
344 | LibertyPointe Bank | New York | NY | 58071 | Valley National Bank | March 11, 2010 | 10194 |
Summary
df.head(n)
returns first n rows from a DataFrame.df.tail(n)
returns last n rows from a DataFrame.df[x:y]
returns rows from #x to #(y-1) (row number y is not included).- You can select rows based on multiple condition using boolean indexes.
df[criteria]
withcriteria
being a boolean variable returns rows that met the criteria.
