Tuesday, December 10, 2013

Oracle SQL - Analytical Functions

Why?
Why do I need to write about analytical functions?
I am getting requests to write about analytical functions in Oracle SQL from several people. Many great articles and explanations are already available about them but I find that very few people understand them and use the power they deliver. During my recent recruitment drive, I found a lot of people who are writing SQLs since 4-5 years but they haven’t used single analytical functions. Some people have actually used 2-3 of them but they don’t understand them properly and just remember few specific problem they solved using them.


To be efficient in delivering technology solutions, you must know and understand what tools are available to you and what kind of problem they solve. If you are unaware of those tools, you may still be able to solve your problem but that may not be an optimal solution and may incur unnecessary extra effort time and cost. Often I find people in my own team showing me their solutions with proud expecting some appreciation as they think they solved a unique problem in a unique way. Unfortunately I may have to break their heart telling them that there are much more easy solutions which you may have achieved in less than 5% of time and more than 100% efficiency. Analytical functions provided by Oracle SQLs are one of such tools you must understand properly to avoid such mistakes.

Start with Aggregation Basics
To explain analytical functions, I have to start with aggregate functions. Aggregate functions are easy to understand and almost every SQL developer uses them often. Some of the most common aggregate functions are COUNT (), AVG (), MAX (), MIN () etc. Have you not used something like below SQLs?



If you haven’t seen such SQL, step back and learn aggregate functions before you want to learn analytical function.
Let’s look at some of the above SQLs and their result. I will use Oracle Xpress edition which comes with preloaded HR schema. If you want to play with it yourself, you can get it free from OTN and install it on your machine in less than 10 minutes.
Let’s look at the first SQL and its output.



What do you get as output? I got just one row telling me total number of records in the table. Isn’t it correct? That’s what I wanted.
Let’s look at the second SQL.



I get 19 rows each of them telling me total number of records for each unique JOB_ID. That’s again correct. Count () function gives me total number of rows and GROUP BY restricts count () function to work only on subset of entire table.

Aggregation is loss of details
But do you notice a loss of details in these query results. Aggregate functions are fantastic but they hide entire details behind them. I had 107 records in my table and several columns. I wanted to count employees without losing any details. My expectation was a result something like shown below. I need count as highlighted in red, but I want all rows and columns as well. How do I get that?



That’s where analytical functions play their role. I can get above results using simplest form of analytical function. Above results are generated by analytical query shown below.

 

Partition By – When do we need it?
What about grouping problem? I want to count number of employees sharing same job_id without losing any details. We have seen similar results using aggregate function and group by. But in that solution, I lost all the rows and columns.
Grouping problem is solved using PARTITION BY clause of analytical functions. To get answer for “count of employees sharing same job_id” we can partition data by job_id, below screen shown how to do this.



You can see that my data is partitioned on JOB_ID and count () function is restricted within each partition. This is exactly same what we get using GROUP BY on aggregate functions. But I haven’t lost any details. You can extend this partitioning to multiple columns similar to what you do in GROUP BY multiple columns.
You may ask me a question. Why do I need to have my entire data and still have some aggregates calculated on it?
I have seen several people writing complex SQLs and get their results saved in MS Excel. We often get requests from business users to get them data into MS Excel. Then they perform analysis on that data in Excel by adding additional columns. That’s exactly what analytical functions will do for us.
To summarize whatever we discussed till this point “Whenever you need to get aggregates on result set of your query without losing anything from your result, you can start thinking of analytical functions.”

Order By – When do we need it?
Let’s pick one partition from the above example.
In this partition, I have five employees and they all share same job title “FI_ACCOUNT” but their salaries are different. Luis is least paid and Daniel is highest paid. I need to calculate delta salaries for all employees compared with the highest paid employee.
Difficult, isn’t it? I see a simple 3 step logic to get this for each partition.
  1. Order by on salary in decreasing order.
  2. Pick the first value which will be the highest salary.
  3. Calculate difference between highest salary with each employee salaries.
Simple, isn’t it? Only thing you need to know is how to perform each of the above steps using analytical function.
  1. You can extend PARTITION BY with ORDER BY.
  2. Instead of COUNT (), you can use FIRST_VALUE () function.
  3. Calculating difference is simplest and I don’t need to tell you that.
Let’s look at the SQL and results. Highlighted part of SQL will get you highest salary within each partition. Rest is just a calculation.



We have seen combined effect of ORDER BY and FIRST_VALUE. You must understand individual work done by them. ORDER BY will just order my data within partition and FIRST_VALUE will pick the top most record. We did an ORDER BY DESC so first record is highest salary and FIRST_VALUE function picks that. But if you do an order by ASC, first record will be lowest salary and FIRST_VALUE will pick lowest one.
You can use ORDER BY without PARTITION BY if you want delta salaries across all employees. Try it yourself.

Order By – careful, cautious
It is important to note that behavior of specific function might be different when they are used with ORDER BY clause. You have to understand specific behavior of an aggregate function which you use with ORDER BY clause. Lack of understanding on how function will behave when used by ORDER BY will cause confusion. Let’s take an example of Count () function and I bet you will be surprised and confused. 

 

  1. Highlighted in Green- Count () is showing cumulative counts so first record is 1, second records counts first and second hence 2 and so on.
  2. Highlighted in Red- Both are showing 2. Why behavior is different than other partition which is highlighted in green. Because we have ordered by salary and there is a tie on salary.
If you don’t understand this behavior, don’t worry about it as I will explain it again and just follow principle of not having a gun if you don’t need it. I mean, don’t use ORDER BY unless and until order is important for you. When I want to calculate count, order within a partition is not relevant for me, so I will never use order by while calculating count. Whereas while finding maximum salary, order was important for me because I wanted to take first value after ordering it so I used order by.

Order By – Ensure deterministic ordering
Sometimes you need an ORDER BY since it is important but your function’s behavior is giving you a trouble. Let’s take an example. You need a running total for employee salaries in order of first_name. Getting a running total is not possible unless we have a defined order of records. SQL and results for this is shown below.



Above results looks perfect. Let me change the question slightly. We need a running total for employee salaries in order of increasing salaries. As expected, I just changed order by from first_name to salary. New SQL and results are shown below.



It’s broken caused by equal salaries. That’s what I wanted to highlight that you need to be careful when using order by in analytical queries. Now let’s understand why both of these are showing same value. We did an order by salary. Salaries for Steven and Hazel are same and Oracle is not sure which records it should apply SUM () function first. My Order by clause is not able to ensure deterministic ordering. In that case analytical function is applied on both of them and same value is returned for both.
How can we fix above problem. We know the reason why it happened, so let’s fix that reason. If I add another column in order by, I will be able to ensure deterministic ordering. Solution is shown below.




Windowing
We understood that PARTITION BY clause creates a work window for analytical functions. ORDER BY clause will order records in that work window. Finally analytical functions are applied on all records within work window starting from first record, second record, and third record and so on. What does that mean? Let’s look at one example to discover what I am trying to explain.
Check below example which shows running total of salaries in order of salaries and first_name. It is same example we used earlier with same results but with additional RANGE clause in the query.



Do you notice RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW? This statement directs SUM () function about the window. It defines window from first record (UNBOUNDED PRECEDING) till current row. This is default windowing clause of analytical functions that’s why even if you omit it, your query behaves exactly same.
Above explanation gives us a hint that we can play with windowing clause to redefine our window which by default starts at first row and ends at current row. You have two options RANGE BETWEEN <Start> AND <End> and ROWS BETWEEN <Start> AND <End>. RANGE defines a logical window and ROWS define a physical window. We will see examples for both types but before that we need to understand how we define start and end of the window.
We have following methods to define start.
  1. Starts at the first row of the partition* - UNBOUNDED PRECEDING
  2. Start at the nth row above current row - value_expr PRECEDING
  3. Start at the current row - CURRENT ROW
  4. Start at nth row after the current row - value_expr FOLLOWING
*-Note that if you don’t have any partition, then entire set is considered as single partition.
We have following methods to define end.
  1. End at the last row of the partition - UNBOUNDED FOLLOWING
  2. End at the nth row after current row - value_expr FOLLOWING
  3. End at the current row - CURRENT ROW
  4. End at the nth row above current row- value_expr PRECEDING
You must take care that your start must evaluate before your end. This means start at CURRENT ROW and end at one row before current row (1 PRECEDING) or any such combination is not valid.
Now it’s time for an example. We will first look at an example for ROWS BETWEEN because understanding physical rows is easy compared to logical rows. You can see below example that I am able to access my previous and next physical rows for each current row. This kind of access is almost impossible to think and achieve without doing a loop using a procedural language like PL/SQL.



Let’s look at the logical RANGE example.



Let’s first look at the previous_row. For Luis, previous_row window starts at itself because there is no record before it and 1000 PRECEDING has no meaning. We took first value hence it is 6900. But for Jose Manuel, previous_row window start from a salary which is in rage of 1000 (7800-1000=6800) which is again first record because 6900 is in range of 6800 to 7800. We took first value hence it is 6900.
Now we look at next_row. For Luis, next_row starts at current row and ends at a salary which is in rage of 1000 (6900+1000=7900) which is 7800 because 7800 is in rage of 6900 to 7900. Note that next record shows 8200 which is beyond the range. We took last value hence it is 7800.
I hope you get the meaning of logical range. It could be numeric value if you order by a numeric column. It could be date value or a numeric value if you order by a date column. If you order by a date column and numeric value 365 will create a range of one year.
Analytic functions are commonly used in data warehousing environments and are one of the most powerful tools in hand. It’s a difficult topic to understand so try it yourself and play with it for some time if you really want to grasp all of it.
Good luck.