pandas.DataFrame.sort_values – How To Sort Values in Pandas
When working with data in Pandas, sorting is an essential task that allows you to organize and analyze your data effectively. The pandas.DataFrame.sort_values method provides a flexible way to sort a DataFrame by one or more columns. In this guide, we‘ll take an in-depth look at how to use sort_values to sort your data in Pandas.
Why Sort Values in Pandas?
Sorting is a fundamental operation in data analysis and manipulation. Here are a few reasons why you might want to sort your data:
- Making data more readable and understandable by putting it in a logical order
- Finding the top or bottom values in a dataset
- Preparing data for further analysis or visualization
- Grouping related data together
Pandas provides powerful methods like sort_values to make sorting quick and easy. Let‘s dive into the details of how it works.
The sort_values Method
The basic syntax for sorting a DataFrame with sort_values is:
df.sort_values(by, ascending=True, inplace=False)
The key parameters are:
- by: The column or list of columns to sort by
- ascending: Whether to sort in ascending (True) or descending (False) order
- inplace: Whether to modify the original DataFrame (True) or return a new sorted DataFrame (False)
By default, sort_values will:
- Sort in ascending order
- Return a new sorted DataFrame rather than modifying the original
Sorting by a Single Column
The most basic use of sort_values is to sort a DataFrame by a single column. Here‘s an example:
import pandas as pd
data = {‘name‘: [‘John‘, ‘Alice‘, ‘Bob‘, ‘Charlie‘],
‘age‘: [25, 30, 35, 28],
‘income‘: [50000, 60000, 80000, 70000]}
df = pd.DataFrame(data)
print(df)
Output:
name age income
0 John 25 50000
1 Alice 30 60000
2 Bob 35 80000
3 Charlie 28 70000
To sort by the ‘age‘ column in ascending order:
df_sorted = df.sort_values(‘age‘)
print(df_sorted)
Output:
name age income
0 John 25 50000
3 Charlie 28 70000
1 Alice 30 60000
2 Bob 35 80000
The DataFrame is now sorted from youngest to oldest.
To sort in descending order, set ascending=False:
df_sorted = df.sort_values(‘age‘, ascending=False)
print(df_sorted)
Output:
name age income
2 Bob 35 80000
1 Alice 30 60000
3 Charlie 28 70000
0 John 25 50000
Now the DataFrame is sorted from oldest to youngest.
Sorting by Multiple Columns
You can also sort by multiple columns by passing a list of column names to the by parameter. The sorting is done in order, so the first column is used first, then rows with matching values in the first column are sorted by the next column, and so on.
For example, to sort by ‘age‘ and then by ‘income‘:
df_sorted = df.sort_values([‘age‘, ‘income‘])
print(df_sorted)
Output:
name age income
0 John 25 50000
3 Charlie 28 70000
1 Alice 30 60000
2 Bob 35 80000
The DataFrame is first sorted by ‘age‘, then rows with the same age are sorted by ‘income‘.
You can specify ascending/descending order for each column individually by passing a list to ascending:
df_sorted = df.sort_values([‘age‘, ‘income‘], ascending=[True, False])
print(df_sorted)
Output:
name age income
0 John 25 50000
3 Charlie 28 70000
1 Alice 30 60000
2 Bob 35 80000
Here the DataFrame is sorted by ascending age, but within each age group, it‘s sorted by descending income.
Sorting Different Data Types
sort_values can handle different data types like strings, numbers, and dates. Here are a few examples:
Strings:
data = {‘fruit‘: [‘apple‘, ‘banana‘, ‘orange‘, ‘grape‘]}
df = pd.DataFrame(data)
df_sorted = df.sort_values(‘fruit‘)
print(df_sorted)
Output:
fruit
0 apple
1 banana
3 grape
2 orange
Dates:
data = {‘date‘: [‘2023-01-15‘, ‘2023-02-10‘, ‘2023-01-01‘, ‘2023-03-05‘]}
df = pd.DataFrame(data)
df[‘date‘] = pd.to_datetime(df[‘date‘])
df_sorted = df.sort_values(‘date‘)
print(df_sorted)
Output:
date
2 2023-01-01
0 2023-01-15
1 2023-02-10
3 2023-03-05
Sorting Missing Values (NaNs)
By default, missing values are sorted to the end of the DataFrame regardless of the ascending/descending order. You can change this behavior with the na_position parameter:
data = {‘name‘: [‘John‘, ‘Alice‘, ‘Bob‘, None],
‘age‘: [25, None, 35, 28]}
df = pd.DataFrame(data)
df_sorted = df.sort_values(‘age‘, na_position=‘first‘)
print(df_sorted)
Output:
name age
1 Alice NaN
0 John 25.0
3 None 28.0
2 Bob 35.0
Setting na_position=‘first‘ puts the missing values at the beginning instead of the end.
Performance Considerations
For small to medium sized DataFrames, sort_values is quite fast. But for very large DataFrames with millions of rows, sorting can be a costly operation in terms of time and memory.
If you only need the top or bottom N rows, consider using nlargest or nsmallest instead of sorting the entire DataFrame:
top5 = df.nlargest(5, ‘age‘)
bottom5 = df.nsmallest(5, ‘income‘)
This will be much faster than sorting the entire DataFrame and then taking a slice.
How Sorting Works in Pandas
Under the hood, Pandas uses NumPy‘s sorting algorithms to sort the underlying data. Specifically:
- For numeric data (int, float), Pandas uses np.sort which implements a quicksort
- For string data (object), Pandas uses the timsort hybrid sorting algorithm
These sorting algorithms are highly optimized for performance in NumPy and Python.
When you call sort_values on a DataFrame, Pandas performs these high-level steps:
- Extract the columns to sort by into NumPy arrays
- Sort the arrays using NumPy‘s sorting algorithms
- Use the sorted arrays to reorder the rows in the DataFrame
This is a simplification, but it captures the key ideas. By leveraging NumPy, Pandas is able to sort large amounts of data very efficiently.
Alternatives to sort_values
While sort_values is the main way to sort a DataFrame, there are a few alternatives worth mentioning.
To sort a DataFrame by its index instead of its columns, you can use sort_index:
df_sorted = df.sort_index()
This is useful if your DataFrame has a meaningful index, like a DatetimeIndex for time series data.
If you just need the underlying data sorted, you can access the NumPy arrays directly and use np.sort:
sorted_data = np.sort(df[‘age‘].values)
This will be slightly faster than using sort_values, but you lose the DataFrame structure.
Real-World Examples
Let‘s look at a couple real-world datasets to see sorting in action.
Example 1: Stock Market Data
df = pd.read_csv(‘stock_prices.csv‘)
print(df.head())
Output:
Date Open High Low Close Adj Close Volume
0 2023-01-03 16.389999 16.520000 16.360001 16.500000 16.500000 64539500
1 2023-01-04 16.420000 16.549999 16.370001 16.450001 16.450001 64245000
2 2023-01-05 16.299999 16.459999 16.100000 16.129999 16.129999 94519600
3 2023-01-06 16.299999 16.420000 16.180000 16.350000 16.350000 74002500
4 2023-01-09 16.610001 16.690001 16.500000 16.540001 16.540001 62439400
To sort by ‘Date‘ in ascending order:
df_sorted = df.sort_values(‘Date‘)
print(df_sorted.head())
Output:
Date Open High Low Close Adj Close Volume
0 2023-01-03 16.389999 16.520000 16.360001 16.500000 16.500000 64539500
1 2023-01-04 16.420000 16.549999 16.370001 16.450001 16.450001 64245000
2 2023-01-05 16.299999 16.459999 16.100000 16.129999 16.129999 94519600
3 2023-01-06 16.299999 16.420000 16.180000 16.350000 16.350000 74002500
4 2023-01-09 16.610001 16.690001 16.500000 16.540001 16.540001 62439400
Example 2: US Baby Names
df = pd.read_csv(‘us_baby_names.csv‘)
print(df.head())
Output:
Name Gender State Count Year
0 Mary F AK 14 1910
1 Anna F AK 10 1910
2 Helen F AK 8 1910
3 Elsie F AK 6 1910
4 Lucy F AK 6 1910
To find the most popular names overall:
df_sorted = df.sort_values(‘Count‘, ascending=False)
print(df_sorted.head())
Output:
Name Gender State Count Year
66090 James M CA 5082 1947
66085 Robert M CA 5054 1947
66098 John M CA 4917 1947
66094 William M CA 3749 1947
66096 Richard M CA 3379 1947
These examples demonstrate how sorting can help you quickly gain insights from real-world datasets.
Conclusion
Sorting is a core operation in data analysis, and Pandas makes it easy with the versatile sort_values method. Whether you‘re exploring a dataset for the first time or preparing data for machine learning, knowing how to effectively sort your data is an essential skill.
In this guide, we‘ve covered:
- Why sorting is useful in data analysis
- The key parameters of sort_values like by, ascending, and na_position
- How to sort by single or multiple columns
- Examples of sorting different data types
- Performance considerations for large datasets
- How sorting works under the hood in Pandas
- Alternatives to sort_values like sort_index and np.sort
- Real-world examples of sorting with stock market and baby name data
Armed with this knowledge, you‘re well-equipped to tackle sorting in your own Pandas projects. So go forth and sort!