Master "df sort by column" Like a Pro: The Ultimate Guide to Sorting DataFrame Columns

Managing data effectively often requires organizing information in a specific order, and the Linux command line provides a straightforward way to achieve this with the df sort by column operation. The `df` command reports file system disk space usage, while `sort` rearranges lines of text files. Combining these utilities allows system administrators to parse raw disk usage statistics and display them in a more meaningful sequence, such as sorting by usage percentage or mount point. This technique transforms a simple list of mounted drives into an actionable overview, prioritizing attention on the most critical volumes.

Understanding the Default Behavior of df

Before manipulating the output, it is essential to understand what `df` produces by default. Running the command without arguments generates a table that includes the file system name, total size, used space, available space, usage percentage, and the mount point. The columns are separated by spaces, but the exact number of spaces can vary depending on the system and the length of the device names. This irregular spacing is important to remember when constructing parsing commands, as it means the data is not delimited by a single, consistent character like a comma.

Sorting by the Usage Percentage Column

The most common requirement is to sort `df` output by the usage percentage to identify which drives are filling up fastest. Since the percentage column usually appears as the fifth field, the command `df

sort -k5 -n` is frequently employed. The `-k5` flag specifies the fifth column as the key for sorting, and the `-n` flag ensures the sort treats the numbers as integers rather than strings. Without the numeric flag, "100%" would sort before "90%" because string sorting compares characters sequentially, leading to incorrect results for percentage values.

Handling the Header Row

A challenge with sorting the output of `df` is that the first line is a header describing the columns. Standard numeric sorting will place this header anywhere in the list, often resulting in it appearing in the middle of the output where it disrupts readability. To solve this, users can extract the header separately and concatenate it with the sorted data. This involves using `head -n 1` to grab the first line and `tail -n +2` to skip it, ensuring the summary remains at the top of the report regardless of the sort order applied to the subsequent lines.

Sorting in Reverse Order for Critical Alerts

For monitoring scripts or quick health checks, sorting in descending order is usually more practical than ascending order. By default, `sort` arranges numbers from smallest to largest, which would list the empty drives first. To reverse this and highlight the drives closest to capacity, the `-r` flag is added to the command. The combination `df

sort -k5 -n -r` immediately surfaces the most urgent alerts, allowing administrators to see which volumes require expansion or cleanup without scanning the entire list manually.

Sorting by Mount Point for Organization

While monitoring usage is critical, organizing the output alphabetically by mount point can be beneficial for documentation and inventory purposes. If you need to sort the `df` output by the file system path, which is typically the last column, you can use `df

sort -k6`. This is particularly useful when managing servers with a complex directory structure. Alphabetical sorting helps verify that specific mount points are present and confirms their associated device names, which is helpful during audits or troubleshooting sessions.

Dealing with Variable Whitespace

One technical nuance when working with `df` output is the inconsistent whitespace between columns. The first column, representing the device name, can be of varying lengths, which throws off the column numbering if you count positions statically. A more robust approach involves treating the output as space-delimited and counting fields from the end. Since the last column is always the mount point and the second-to-last is the usage percentage, specifying keys relative to the end of the line can future-proof your parsing logic against changes in device path lengths.