pandas.DataFrame.sort_values – How To Sort Values in Pandas

When working with data in Pandas, sorting is an essential task that allows you to organize and analyze your data effectively. The pandas.DataFrame.sort_values method provides a flexible way to sort a DataFrame by one or more columns. In this guide, we‘ll take an in-depth look at how to use sort_values to sort your data in Pandas.

Why Sort Values in Pandas?

Sorting is a fundamental operation in data analysis and manipulation. Here are a few reasons why you might want to sort your data:

  1. Making data more readable and understandable by putting it in a logical order
  2. Finding the top or bottom values in a dataset
  3. Preparing data for further analysis or visualization
  4. Grouping related data together

Pandas provides powerful methods like sort_values to make sorting quick and easy. Let‘s dive into the details of how it works.

The sort_values Method

The basic syntax for sorting a DataFrame with sort_values is:

df.sort_values(by, ascending=True, inplace=False)

The key parameters are:

  • by: The column or list of columns to sort by
  • ascending: Whether to sort in ascending (True) or descending (False) order
  • inplace: Whether to modify the original DataFrame (True) or return a new sorted DataFrame (False)

By default, sort_values will:

  • Sort in ascending order
  • Return a new sorted DataFrame rather than modifying the original

Sorting by a Single Column

The most basic use of sort_values is to sort a DataFrame by a single column. Here‘s an example:

import pandas as pd

data = {‘name‘: [‘John‘, ‘Alice‘, ‘Bob‘, ‘Charlie‘],
        ‘age‘: [25, 30, 35, 28],
        ‘income‘: [50000, 60000, 80000, 70000]}

df = pd.DataFrame(data)

print(df)

Output:

     name  age  income
0    John   25   50000
1   Alice   30   60000
2     Bob   35   80000
3  Charlie  28   70000

To sort by the ‘age‘ column in ascending order:

df_sorted = df.sort_values(‘age‘)

print(df_sorted)

Output:

     name  age  income
0    John   25   50000
3  Charlie  28   70000  
1   Alice   30   60000
2     Bob   35   80000

The DataFrame is now sorted from youngest to oldest.

To sort in descending order, set ascending=False:

df_sorted = df.sort_values(‘age‘, ascending=False)

print(df_sorted) 

Output:

    name  age  income
2    Bob   35   80000
1  Alice   30   60000
3  Charlie  28   70000
0    John   25   50000

Now the DataFrame is sorted from oldest to youngest.

Sorting by Multiple Columns

You can also sort by multiple columns by passing a list of column names to the by parameter. The sorting is done in order, so the first column is used first, then rows with matching values in the first column are sorted by the next column, and so on.

For example, to sort by ‘age‘ and then by ‘income‘:

df_sorted = df.sort_values([‘age‘, ‘income‘])

print(df_sorted)

Output:

     name  age  income
0    John   25   50000
3  Charlie  28   70000
1   Alice   30   60000
2     Bob   35   80000

The DataFrame is first sorted by ‘age‘, then rows with the same age are sorted by ‘income‘.

You can specify ascending/descending order for each column individually by passing a list to ascending:

df_sorted = df.sort_values([‘age‘, ‘income‘], ascending=[True, False])

print(df_sorted)

Output:

     name  age  income
0    John   25   50000
3  Charlie  28   70000
1   Alice   30   60000
2     Bob   35   80000  

Here the DataFrame is sorted by ascending age, but within each age group, it‘s sorted by descending income.

Sorting Different Data Types

sort_values can handle different data types like strings, numbers, and dates. Here are a few examples:

Strings:

data = {‘fruit‘: [‘apple‘, ‘banana‘, ‘orange‘, ‘grape‘]}
df = pd.DataFrame(data)

df_sorted = df.sort_values(‘fruit‘)

print(df_sorted)

Output:

    fruit
0   apple
1  banana
3   grape
2  orange

Dates:

data = {‘date‘: [‘2023-01-15‘, ‘2023-02-10‘, ‘2023-01-01‘, ‘2023-03-05‘]}  
df = pd.DataFrame(data)
df[‘date‘] = pd.to_datetime(df[‘date‘])

df_sorted = df.sort_values(‘date‘)

print(df_sorted)

Output:

        date
2 2023-01-01
0 2023-01-15
1 2023-02-10
3 2023-03-05

Sorting Missing Values (NaNs)

By default, missing values are sorted to the end of the DataFrame regardless of the ascending/descending order. You can change this behavior with the na_position parameter:

data = {‘name‘: [‘John‘, ‘Alice‘, ‘Bob‘, None],
        ‘age‘: [25, None, 35, 28]}

df = pd.DataFrame(data)

df_sorted = df.sort_values(‘age‘, na_position=‘first‘)

print(df_sorted)

Output:

   name   age
1  Alice   NaN
0   John  25.0
3   None  28.0
2    Bob  35.0

Setting na_position=‘first‘ puts the missing values at the beginning instead of the end.

Performance Considerations

For small to medium sized DataFrames, sort_values is quite fast. But for very large DataFrames with millions of rows, sorting can be a costly operation in terms of time and memory.

If you only need the top or bottom N rows, consider using nlargest or nsmallest instead of sorting the entire DataFrame:

top5 = df.nlargest(5, ‘age‘)
bottom5 = df.nsmallest(5, ‘income‘) 

This will be much faster than sorting the entire DataFrame and then taking a slice.

How Sorting Works in Pandas

Under the hood, Pandas uses NumPy‘s sorting algorithms to sort the underlying data. Specifically:

  • For numeric data (int, float), Pandas uses np.sort which implements a quicksort
  • For string data (object), Pandas uses the timsort hybrid sorting algorithm

These sorting algorithms are highly optimized for performance in NumPy and Python.

When you call sort_values on a DataFrame, Pandas performs these high-level steps:

  1. Extract the columns to sort by into NumPy arrays
  2. Sort the arrays using NumPy‘s sorting algorithms
  3. Use the sorted arrays to reorder the rows in the DataFrame

This is a simplification, but it captures the key ideas. By leveraging NumPy, Pandas is able to sort large amounts of data very efficiently.

Alternatives to sort_values

While sort_values is the main way to sort a DataFrame, there are a few alternatives worth mentioning.

To sort a DataFrame by its index instead of its columns, you can use sort_index:

df_sorted = df.sort_index()

This is useful if your DataFrame has a meaningful index, like a DatetimeIndex for time series data.

If you just need the underlying data sorted, you can access the NumPy arrays directly and use np.sort:

sorted_data = np.sort(df[‘age‘].values)

This will be slightly faster than using sort_values, but you lose the DataFrame structure.

Real-World Examples

Let‘s look at a couple real-world datasets to see sorting in action.

Example 1: Stock Market Data

df = pd.read_csv(‘stock_prices.csv‘)
print(df.head())

Output:

         Date       Open       High        Low      Close  Adj Close    Volume
0  2023-01-03  16.389999  16.520000  16.360001  16.500000  16.500000  64539500
1  2023-01-04  16.420000  16.549999  16.370001  16.450001  16.450001  64245000
2  2023-01-05  16.299999  16.459999  16.100000  16.129999  16.129999  94519600
3  2023-01-06  16.299999  16.420000  16.180000  16.350000  16.350000  74002500
4  2023-01-09  16.610001  16.690001  16.500000  16.540001  16.540001  62439400

To sort by ‘Date‘ in ascending order:

df_sorted = df.sort_values(‘Date‘)
print(df_sorted.head())

Output:

         Date       Open       High        Low      Close  Adj Close    Volume
0  2023-01-03  16.389999  16.520000  16.360001  16.500000  16.500000  64539500
1  2023-01-04  16.420000  16.549999  16.370001  16.450001  16.450001  64245000
2  2023-01-05  16.299999  16.459999  16.100000  16.129999  16.129999  94519600
3  2023-01-06  16.299999  16.420000  16.180000  16.350000  16.350000  74002500
4  2023-01-09  16.610001  16.690001  16.500000  16.540001  16.540001  62439400

Example 2: US Baby Names

df = pd.read_csv(‘us_baby_names.csv‘)  
print(df.head())

Output:

     Name Gender State  Count Year
0    Mary      F    AK     14 1910
1    Anna      F    AK     10 1910
2   Helen      F    AK      8 1910
3  Elsie      F    AK      6 1910
4   Lucy      F    AK      6 1910

To find the most popular names overall:

df_sorted = df.sort_values(‘Count‘, ascending=False)
print(df_sorted.head())

Output:

         Name Gender State   Count  Year
66090   James      M    CA   5082  1947
66085  Robert      M    CA   5054  1947
66098   John      M    CA   4917  1947
66094  William     M    CA   3749  1947
66096  Richard     M    CA   3379  1947

These examples demonstrate how sorting can help you quickly gain insights from real-world datasets.

Conclusion

Sorting is a core operation in data analysis, and Pandas makes it easy with the versatile sort_values method. Whether you‘re exploring a dataset for the first time or preparing data for machine learning, knowing how to effectively sort your data is an essential skill.

In this guide, we‘ve covered:

  • Why sorting is useful in data analysis
  • The key parameters of sort_values like by, ascending, and na_position
  • How to sort by single or multiple columns
  • Examples of sorting different data types
  • Performance considerations for large datasets
  • How sorting works under the hood in Pandas
  • Alternatives to sort_values like sort_index and np.sort
  • Real-world examples of sorting with stock market and baby name data

Armed with this knowledge, you‘re well-equipped to tackle sorting in your own Pandas projects. So go forth and sort!

Similar Posts