Oracle's PERCENTILE_CONT implementation using SQL Server 2005 analytic functions...

I saw an interesting question today in the MSDN Transact-SQL forum about implementing PERCENTILE_CONT analytic function that is available in Oracle in SQL Server. The function description as noted in the Oracle docs is below:

 

The PERCENTILE_CONT function is an inverse distribution function that assumes a continuous distribution model. It takes a percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification. Nulls are ignored in the calculation.

 

The details of the function can be obtained from PERCENTILE_CONT topic in the Oracle docs. Implemention such an analytic function in a query using SQL Server 2000 is tough due to the absence of windowed aggregates. SQL Server 2005 however implements some of the basic analytic functions like ROW_NUMBER, NTILE, OVER clause for standard aggregate functions which makes the port easier to do.

 

We will use the sample EMP table from Oracle for our examples. I have the equivalent TSQL script that creates the sample schema in SQL Server.  Let us take a look at the data in the EMP table for relevant columns first:

 

      EMPNO DEPTNO SAL
----------- ------ ---------------------------------------
7839 10 5000.00
7782 10 2450.00
7934 10 1300.00
7902 20 3000.00
7788 20 3000.00
7566 20 2975.00
7876 20 1100.00
7369 20 800.00
7698 30 2850.00
7499 30 1600.00
7844 30 1500.00
7521 30 1250.00
7654 30 1250.00
7900 30 950.00

The Oracle query using PERCENTILE_CONT to get the median salary is:

 

SELECT deptno,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sal DESC) as percentile_sal
FROM emp
GROUP BY deptno;

 

The result of the query in Oracle is:

 

deptno percentile_sal
------ ---------------------------------------
10 2450.000
20 2975.000
30 1375.000

 

The same query in Oracle can be written using some of the primitive analytic functions as below:

 

select deptno, sum(case grp
when 1 then (case PRN when RN then sal end)
else (case PRN

                        when FRN then (CRN - RN) * sal

                        when CRN then (RN - FRN) * sal end)
end) as percentile_sal
from (
select deptno, sal, N, PRN, RN, CRN, FRN

   , case when e3.CRN = e3.FRN and e3.CRN = e3.RN then 1 else 0 end as grp
from (
select deptno, sal, N, PRN, RN, CEIL(RN) as CRN, FLOOR(RN) as FRN
from (
select deptno, sal, PRN, N, 1 + (0.5* (N-1)) as RN
from (
select deptno, sal

, row_number() over(partition by deptno order by sal desc) as PRN

, count(*) over(partition by deptno) as N
from emp
) e1
) e2
) e3
) er
group by deptno;

 

Now, it is easy to write the same query in SQL Server 2005 also. I started of actually by writing a simpler but slower version of the query first based on the Oracle implementation details and easier to understand. This particular query is shown below:

 

declare @p float;
set @p = .5;
with e1 as (
select deptno, count(*) as N
from emp
group by deptno
), e2 as (
select deptno, N, 1 + (@p * (N-1)) as RN
from e1
), e3 as (
select deptno, N, RN, CEILING(RN) as CRN, FLOOR(RN) as FRN from e2
), er as (
select deptno, sal, row_number() over(partition by deptno order by sal desc) as RN
from emp
)
select deptno,
case
when e3.CRN = e3.FRN and e3.CRN = e3.RN

            then (select er.sal from er where er.deptno = e3.deptno and er.RN = e3.RN)
else ((e3.CRN - e3.RN) *

                  (select er.sal from er where er.deptno = e3.deptno and er.RN = e3.FRN)) +
((e3.RN - e3.FRN) *

                  (select er.sal from er where er.deptno = e3.deptno and er.RN = e3.CRN))
end as percentile_sal

from e3;

go

 

This query can however be simplified to scan the data only once and optimized further. The simplified and better performing version of the query is shown below:

 

declare @p float;
set @p = .5;
with e1 as (
select deptno, sal, row_number() over(partition by deptno order by sal desc) as PRN, count(*) over(partition by deptno) as N
from emp
), e2 as (
select deptno, sal, PRN, N, 1 + (@p * (N-1)) as RN
from e1
), e3 as (
select deptno, sal, N, PRN, RN, CEILING(RN) as CRN, FLOOR(RN) as FRN
from e2
), er as (
select deptno, sal, N, PRN, RN, CRN, FRN, case when e3.CRN = e3.FRN and e3.CRN = e3.RN then 1 else 0 end as grp
from e3
)
select deptno, sum(case grp
when 1 then (case PRN when RN then sal end)
else (case PRN

                     when FRN then (CRN - RN) * sal

                     when CRN then (RN - FRN) * sal end)
end) as percentile_sal
from er
group by deptno;
go

 

The ORDER BY clause in the PERCENTILE_CONT function goes in the ROW_NUMBER OVER clause in TSQL query and the GROUP BY clause becomes the PARTITION BY clause in the TSQL query. And the percentile value is specified in a variable.

Oracle PERCENT_CONT Implementation.sql