This post describes the purpose and how to use SQL window functions. If you have looked up window functions in the official documentation, you might have noticed that it can be quite difficult to understand, but I am here to clarify a few things and help you make sense of it. You will be using these almost daily before you know it.

Let’s start with PostgreSQL‘s documentation definition ad it is one of the better definitions:


window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

PostgreSQL Documentation

Instead of covering more technical definitions, let’s just dive into some use cases.

For the examples outlined below, we will be using this short little table of a companies sales team.

EmployeeRegionPeriodSales
John DavisWestJanuary $          8,414
Track JohnsonEastJanuary $        10,033
Steve DiasEastJanuary $        11,038
Jack DorsoWestJanuary $        13,841
Frost MillerEastJanuary $          9,466
Dina GillianWestJanuary $          9,262
Lisa FortWestJanuary $          7,354
Harry FordEastJanuary $        14,403
Ben OakleyWestJanuary $          7,422
John DavisWestFebruary $        10,573
Track JohnsonEastFebruary $          7,260
Steve DiasEastFebruary $          8,154
Jack DorsoWestFebruary $        14,392
Frost MillerEastFebruary $        13,520
Dina GillianWestFebruary $        14,212
Lisa FortWestFebruary $          9,028
Harry FordEastFebruary $          7,273
Ben OakleyWestFebruary $        13,843

Rank Sales Team by Sales in Each Region

Who is the best performing seller for each region? This is a question a manager might ask on a monthly basis, or maybe more frequently. Let’s take a look at how we could accomplish this for the month of February.

SELECT
    employee,
    region,
    sales,
    RANK() OVER(PARTITION BY region ORDER BY sales DESC) as sales_rank
FROM sell
WHERE period = 'February';

The above query, if it works correctly, should show us all the employees along with their respective region and sales numbers, then give us a number for how they rank within a specific region. Here is what the output would look like:

employeeregionsalessales_rank
Frost MillerEast135201
Steve DiasEast81542
Track JohnsonEast72603
Harry FordEast71734
Jack DorsoWest143921
Dina GillianWest142122
Ben OakleyWest138433
John DavisWest105734
Lisa FortWest90285

As you can see, Frost Miller and Jack Dorso were both the top selling employees in their regions for the month of February.

What Are The Sales of Each Employee as a Percentage of Regional Sales?

This is a question that most sales leads would ask themselves, and there is nothing fancy about it. Getting this information should be pretty straight forward. We are first going to look at what this might look like if we were to NOT use window functions.

SELECT
    s.employee,
    s.sales,
    s.region,
    s.sales / r.region_sales as percent_of_region
FROM sell s
LEFT JOIN (SELECT
               region,
               SUM(sales) as region_sales
		    FROM sell
            GROUP BY region) r ON s.region = r.region
ORDER BY percent_of_region DESC;

This is obviously a very simple example, but you should get the idea of how quickly this query can grow if we were to add any more logic to it. Below is an example of what this query would look like if we employed a window function instead.

SELECT
    employee,
    region,
    sales,
    sales / SUM(sales) OVER(PARTITION BY region) as percent_of_region
FROM sell
ORDER BY percent_of_region DESC;
employeeregionsalespercent_of_region
Harry FordEast144030.1777116981504559
Frost MillerEast135200.16681678532209707
Steve DiasEast110380.13619257961429787
Jack DorsoWest143920.13283982979666054
Dina GillianWest142120.13117840891260002

How Does Each Salesperson Compare to Last Month?

Another question a sales manager might ask is how his team compares to the previous month? This is virtually impossible to answer without using window functions.

To answer this question, we are going to utilize the LAG function.

SELECT
    employee,
    period,
    sales,
    sales - LAG(sales) OVER(PARTITION BY employee ORDER BY period DESC) as month_over_month
FROM sell;

Running this function produces a null for the first value per salesperson (in this case for January). In order to eliminate these, we simply wrap this function in another SELECT statement and ask for the rows that are not null.

SELECT * FROM (
  SELECT
      employee,
      period,
      sales,
      sales - LAG(sales) OVER(PARTITION BY employee ORDER BY period DESC) as month_over_month
  FROM sell) f
WHERE month_over_month is not null;

The resulting output looks like this (after I tacked on an ‘ORDER BY month_over_month DESC’):

employeeperiodsalesmonth_over_month
Ben OakleyFebruary138436421
Dina GillianFebruary142124950
Frost MillerFebruary135204054
John DavisFebruary105732159
Lisa FortFebruary90281674
Jack DorsoFebruary14392551
Track JohnsonFebruary7260-2773
Steve DiasFebruary8154-2884
Harry FordFebruary7173-7230

Functions To Use With Window Functions

There are two different types of window functions; aggregate and non-aggregate functions.

Here is a list of the supported aggregate functions in MySQL 8.0:

NameDescription
AVG()Return the average value of the argument
BIT_AND()Return bitwise AND
BIT_OR()Return bitwise OR
BIT_XOR()Return bitwise XOR
COUNT()Return a count of the number of rows returned
COUNT(DISTINCT)Return the count of a number of different values
GROUP_CONCAT()Return a concatenated string
JSON_ARRAYAGG()Return result set as a single JSON array
JSON_OBJECTAGG()Return result set as a single JSON object
MAX()Return the maximum value
MIN()Return the minimum value
STD()Return the population standard deviation
STDDEV()Return the population standard deviation
STDDEV_POP()Return the population standard deviation
STDDEV_SAMP()Return the sample standard deviation
SUM()Return the sum
VAR_POP()Return the population standard variance
VAR_SAMP()Return the sample variance
VARIANCE()Return the population standard variance

And here is a list of the non-aggregate functions:

NameDescription
CUME_DIST()Cumulative distribution value
DENSE_RANK()Rank of current row within its partition, without gaps
FIRST_VALUE()Value of argument from first row of window frame
LAG()Value of argument from row lagging current row within partition
LAST_VALUE()Value of argument from last row of window frame
LEAD()Value of argument from row leading current row within partition
NTH_VALUE()Value of argument from N-th row of window frame
NTILE()Bucket number of current row within its partition.
PERCENT_RANK()Percentage rank value
RANK()Rank of current row within its partition, with gaps
ROW_NUMBER()Number of current row within its partition

Summary

As you can probably tell from this short introduction, these functions can be very handy, and very useful for most business use cases. Do yourself a favor and start to play around with them more, try to employ them in your day to day queries.

I hope you found this blog post useful. Drop me a comment if you feel like I dropped the ball on anything or if you have any questions.

I have half a decade of experience working with data science and data engineering in a variety of fields both professionally and in academia. I ahve demonstrated advanced skills in developing machine learning algorithms, econometric models, intuitive visualizations and reporting dashboards in order to communicate data and technical terminology in an easy to understand manner for clients of varying backgrounds.

Write A Comment