Introduction
It is very common in information systems to have functions where the users are able to search the data by selecting freely among many possible search fields. When you implement such a function with SQL Server there are two challenges: to produce the correct result and have good performance.
When it comes to the latter, there is a key theme: there is no single execution plan that is good for all possible search criterias. Rather, you want the query plan to be different depending on user input. There are two ways to achieve this. You can write a static SQL query and add the hint OPTION (RECOMPILE) which forces SQL Server to compile the query every time. Or you can use dynamic SQL to build a query string which includes only the search conditions the user specified. We will look at both these approaches in this article. They are both viable, and as a good SQL programmer you should have both of them in your toolbox since both have their strengths and weaknesses.
This article assumes that you are on SQL 2008 or later. Note that if you still are on SQL 2008 you should be on the last service pack, that is, SP4 for SQL 2008 and SP3 for SQL 2008 R2.
The article starts with two short chapters. The first looks at some methods which are good for very simple cases where you only need to handle a very small set of choices and where the more general methods shoot over the target. The second chapter introduces the task in focus for the rest of the article: the requirement to implement the routine search_orders in the Northgale database, the demo database for this article. Then follows the two main chapters, one devoted to static SQL and one to dynamic SQL.
The Northgale Demo Database
The demo database for this article is Northgale, which is an inflated version of the old demo database Northwind from Microsoft. To build Northgale you first need to run Northwind.sql which is the original script from Microsoft for Northwind, where I have replaced deprecated data types for LOBs with MAX data types, so it can be installed with any collation. Now you can run Northgale.sql which takes most of its data from Northwind.
Rather than running the files separately, you can download the dynsearch.zip and unpack it somewhere. The zip file includes the scripts for the two databases, as well as all stored procedures in this article. The zip file also includes two files to run all SQL files: build_northgale.bat (for Windows) and build_northgale.sh (for Linux). Both take server/instance as their first argument. The Windows file assumes Windows authentication and takes no further arguments. The Linux file assumes SQL authentication and takes username and password as the second and third argument. If you want something else on either platform, just edit the script. (If you do this, be sure that you don’t lose the -I
option to SQLCMD. This option is required to get QUOTED_IDENTIFIER ON.)
Northwind always installs itself in the same folder as where you have the master database, whereas the script for Northgale respects your default settings for data and log directories. Northwind is less than 10 MB in size and has 830 orders, whereas Northgale takes up 1.3 GB and has a little more than one million orders. (You will need 4 GB extra to build the database for a secondary log file which is dropped once the script completes.) This size is big enough to show that some approaches do not work well. However, keep in mind that while Northgale may be big to be a demo database, it is still a small by most standards. A real-life production database could easily have a thousand times as many orders. Thus, just because something runs fast on Northgale, it is not a promise that it will be good for a real-world case.
The schema in Northgale is based on the schema of Northwind, but I have made a few alterations to accommodate examples in this article. (As well as for some presentations where I also use Northgale.) The data in Northgale is largely constructed by cross-joining tables in Northwind with themselves and recombining the data, and you will see this if you look at for instance the Customers table. This makes the data a little less well-distributed than in a real-world database. I’ve added some data on my own to support particular points I want to make or to simply add more base data where Northwind is thin.
Note: If you already have a copy of Northgale installed after having attended any of my presentations in 2020 or earlier, or you have Northgale from an earlier version of this article, I recommend that you download and install a new, since I have made some additions for the article. (And compared to the Northgale that came with older versions of the article, this one has three times as many orders.) If you have Northgale from any presentation of mine in 2021 or later, there can be minor differences, as I every once in a while make adjustments to the script. These differences are less likely to matter, though.
Alternate Key Lookup
Using IF statements
Problems with dynamic search conditions come in more than flavour. In the typical case, there is a search form where the user can select between many search conditions, and this is also the main focus of this article. But sometimes you encounter problems with a small number of conditions that are more or less mutually exclusive. A typical example would be a form where a user can look up a customer by entering one of: 1) The customer id. 2) The customer’s tax registration number. 3) The name of the customer. There are indexes on all three columns.
None of the solutions in the main body in the article are not really suitable here. Forcing a recompile every time with OPTION (RECOMPILE) can add too much load to the system, particularly if these lookups are frequent. And dynamic SQL is just too much hassle for a simple problem like this one.
So let us look at more lightweight solutions that fit this problem. A very simple-minded way is to use IF:
IF @custid IS NOT NULL SELECT ... FROM Customers WHERE CustomerID = @custid ELSE IF @vatno IS NOT NULL SELECT ... FROM Customers WHERE VATno = @vatno ELSE IF @custname IS NOT NULL SELECT TOP (20) ... FROM Customers WHERE CustomerName LIKE @custname + '%' ORDER BY CustomerName ELSE RAISERROR('No search condition given!', 16, 1)
(The TOP 20 for the search on customer name limits the output in case the user would enter a very short search string, so that we don’t overwhelm the user with customer names.)
If you need to return data from other tables as well, and you don’t want to repeat the join, you could enter all matching customer ids into a table variable or a temp table, and then do your final join:
IF @custid IS NOT NULL INSERT @cust (CustomerID) VALUES (@custid) ELSE IF @vatno IS NOT NULL INSERT @cust (CustomerID) SELECT CustomerID FROM Customers WHERE VATno = @vatno ELSE IF @custname IS NOT NULL INSERT @cust (CustomerID) SELECT TOP (20) CustomerID FROM Customers WHERE CustomerName LIKE @custname + '%' ORDER BY CustomerName ELSE RAISERROR('No search condition given!', 16, 1) SELECT ... FROM @cust c JOIN Customers cst ON cst.CustomerID = c.CustomerID JOIN ...
There is however a potential performance problem here. No matter which choice the user makes, we want the optimizer to use the index on the chosen search column. But the way SQL Server builds query plans, this may not always happen. When the procedure is invoked and there is no plan in the cache, SQL Server builds the plan for the entire stored procedure and “sniffs” the current input values for the parameters. Say that the first user to make a search enters a customer ID. This means that the branches for tax-registration number and customer name are optimised for NULL and under unfortunate circumstances this could lead to a plan with a table scan, which is not what you want. (We will come back to parameter sniffing a few times in this article, but for an in-depth discussion on parameter sniffing, see my article Slow in the Application – Fast in SSMS.)
To prevent this from happening, there are a couple of precautions you can take. One is to push down the three SELECT statements into three subprocedures, but admittedly this is a bit bulky. Another approach is to add explicit index hints, but you should always be restrictive with index hints. For instance, what if someone renames the index? That would cause the query to fail.
Rather, the best option is probably to use the OPTIMIZE FOR hint:
SELECT TOP (20) CustomerID FROM Customers WHERE CustomerName LIKE @custname + '%' ORDER BY CustomerName OPTION (OPTIMIZE FOR (@custname = N'ZZZZZZZ'))
This hint causes SQL Server to build the query plan for the value you specify. Obviously, you should pick a value which is selective enough.
Whatever strategy you choose, you should test on production-size data that you get the plans you expect. Due to the sniffing issue, your test should look something like this:
EXEC find_customer @custid = N'ALFKI' EXEC find_customer @vatno = 'SW030072560007' EXEC find_customer @custname = 'Centro' EXEC sp_recompile find_customer -- flush the plan for the procedure EXEC find_customer @vatno = 'SW030072560007' EXEC find_customer @custid = N'ALFKI' EXEC find_customer @custname = 'Centro' EXEC sp_recompile find_customer EXEC find_customer @custname = 'Centro' EXEC find_customer @custid = N'ALFKI' EXEC find_customer @vatno = 'SW030072560007'
That is, you should test with all three parameters as the parameter “sniffed” when the plan is built.
In this particular example, there is one more issue with the @custname parameter that I have ignored so far: the user could add a leading %, in which case a scan would be a better choice. If you need to support searches with a leading %, the best is to split this into two branches:
IF left(@custname, 1) <> '%' -- query as above ELSE -- same query, but with different value in OPTIMIZE FOR.
Using OR
If you don’t like the multiple IF statements, you may be delighted to know that it is in fact perfectly possible do it all in one query as long as you can ignore leading % in @custname. Look at this procedure:
CREATE PROCEDURE find_customer @custid nchar(5) = NULL, @vatno varchar(15) = NULL, @custname nvarchar(40) = NULL AS SELECT TOP (20) CustomerID, VATno, CustomerName, ContactName, City, Country FROM Customers WHERE (CustomerID = @custid AND @custid IS NOT NULL) OR (VATno = @vatno AND @vatno IS NOT NULL) OR (CustomerName LIKE @custname + '%' AND @custname IS NOT NULL) ORDER BY CustomerName
From a logical perspective, the WHERE clause is the same as:
CustomerID = @custid OR VATno = @vatno OR CustomerName LIKE @custname + '%'
The added conditions on the variables with IS NOT NULL may just seem like fluff, but they do serve a purpose. Run this call in Northgale with Display Actual Execution Plan enabled:
EXEC find_customer @custid = N'BERGS'
The right half of the plan is the interesting one:
You can see that the plan is accessing all three indexes. However, there is one very interesting detail hiding here. Hover with the mouse over the middle Filter operator and then over the Index Seek operator right to it. You will see this (with my emphasis added):
The Filter operator includes a Startup Expression Predicate. Such a filter operator is often referred to as a Startup Filter. That is, the plan is dynamically filtering on the variable @vatno at run time. If you look at the Index Seek operator, you can see that Number of Executions is 0, that is, this index is never touched. In this particular plan, the gain of the startup filter is limited, since the Index Seek is likely to be quick anyway. But sometimes a startup filter may prevent more expensive operators from being executed.
I said above that you should try many permutations of the calls with sp_recompile mixed to test that you get the intended plan no matter what parameter values the optimizer sniffs, and this applies here as well. Indeed, if you run:
EXEC sp_recompile find_customer EXEC find_customer @custname = 'Centro'
The optimizer now settles for a plain Clustered Index Scan and there are no startup filters. You can address with the OPTIMIZE FOR hint as I discussed above.
I like to repeat that the conditions on @custid IS NOT NULL etc are required for the optimizer to consider startup filters. What I have also found to be a requirement is that all columns in the query must be indexed. If one is unindexed, you will always get a scan. I have rarely (if ever) been able to get startup filters if the search terms are in different tables. In any case, you should never use this strategy blindly, but always verify that you get the plan – and the performance – you intended.
We will see a few more examples of startup filters later in the article.
The Case Study: Searching Orders
We will now turn to a more general case with many search terms. We will work with implementing a stored procedure that retrieves information about orders in the Northgale database. This is the interface that we expose to the user (well rather to a GUI or middle-layer programmer):
CREATE PROCEDURE search_orders @orderid int = NULL, @status char(1) = NULL, @fromdate date = NULL, @todate date = NULL, @minprice decimal(10,2) = NULL, @maxprice decimal(10,2) = NULL, @custid nchar(5) = NULL, @custname nvarchar(40) = NULL, @city nvarchar(25) = NULL, @region nvarchar(15) = NULL, @country nvarchar(15) = NULL, @prodid int = NULL, @prodname nvarchar(40) = NULL, @employeestr varchar(MAX) = NULL, @employeetbl intlist_tbltype READONLY AS SELECT o.OrderID, o.OrderDate, o.EmployeeID, o.Status, c.CustomerID, c.CustomerName, c.Address, c.City, c.Region, c.PostalCode, c.Country, c.Phone, od.UnitPrice, od.Quantity, p.ProductID, p.ProductName, p.UnitsInStock, p.UnitsOnOrder FROM Orders o JOIN [Order Details] od ON o.OrderID = od.OrderID JOIN Customers c ON o.CustomerID = c.CustomerID JOIN Products p ON p.ProductID = od.ProductID WHERE ??? ORDER BY o.OrderID
You see in the SELECT list what information the user gets. Here is a specification of the parameters:
Parameter | Function |
---|---|
@orderid | Retrieve this order only. |
@status | Retrieve only orders with this status. There are four possibly status values: N (New), P (in Progress), E (Error), C (Completed). Over 99 % of the orders have Status = 'C' . |
@fromdate | Retrieve only orders made on @fromdate or later. |
@todate | Retrieve only orders made on @todate or earlier. |
@minprice | Retrieve only order details that cost at least @minprice. |
@maxprice | Retrieve only order details that cost at most @maxprice. |
@custid | Retrieve only orders from this customer. |
@custname | Retrieve only orders from customers of which the name starts with @custname. |
@city | Retrieve only orders from customers in this city. |
@region | Retrieve only orders from customers in this region. |
@country | Retrieve only orders from customers in this country. |
@prodid | Retrieve only order details with this product. |
@prodname | Retrieve only order details with a product of which the name starts with @prodname. |
@employeestr @employeetbl | These two parameters serve the same purpose: return only orders for the specified employees. @employeestr is a comma-separated list with employee IDs, while @employeetbl is a table-valued parameter (TVP). |
If the user leaves out a search condition, that search condition should not apply to the search. Thus, a plain EXEC search_orders
should return all orders in the database. In this text, I will discuss some different implementations of search_orders, unimaginatively named search_orders_1 etc. A few them are included in whole in this text, others only in parts, and some are only mentioned. As noted earlier, all procedures are available in the zip file dynsearch.zip, but there are also links to each procedure the first time we encounter it.
Not all these procedures include both parameters for employees and some leave out both of them. A few procedures have extra parameters to illustrate specific points that I discuss.
It is possible that in a real-world application a search on a certain product or a certain price range should return all details for the matching orders. However, to keep the example simple, search_orders only returns the matching order lines.
Most of the search columns are indexed, the exception is the UnitPrice column in Order Details. The index on Orders.Status is a filtered index:
CREATE INDEX status_ix ON dbo.Orders(Status) WHERE Status <> 'C'
The search_orders procedure as specified above is not overly complicated; all conditions can be implemented with a single condition using =,
<=
, >=
, LIKE or IN. In a real-life application you may encounter more complex requirements:
- User should be able to select how the output is to be sorted.
- Depending on input parameters you may need to access different tables or columns.
- Users should be able to choose the comparison operator, for instance
@country = 'Germany'
or@country != 'Germany'
. - Users should be able to add or remove columns from the output and for an aggregation query what to aggregate on.
- Anything else you can imagine – or you were not able to imagine, but the users wanted it anyway.
The article includes discussions and examples on the first two points. For the last three, there is only a generic discussion in the section Advanced Options for Dynamic Searches. (As a preview: maybe such conditions should not be implemented in a stored procedure in T‑SQL at all.)
Static SQL with OPTION (RECOMPILE)
Why Static SQL?
Solutions for dynamic search conditions in static SQL almost always include the query hint OPTION (RECOMPILE), although there are a few simple cases where the hint is not needed, and we saw an example of this in the introducing chapter Alternate Key Lookup.
The advantages with these solutions are:
- As long as the search conditions are moderately complex, the code you get is compact and relatively easy to maintain.
- Since the query is recompiled every time, you get a query plan that is optimised for the exact search conditions and values at hand.
- You don’t have to worry about permissions; it works like it always does for stored procedures. That is, the user only needs to have permission to run the stored procedure; no direct permissions on the tables are needed. (Presuming that ownership chaining is in effect, which is the normal case.)
But there are also disadvantages:
- When the requirements grow in complexity, the complexity of the query tends to grow non-linearly, so that what once was a relatively simple query evolves to a beast that hardly anyone understands, even less wants to touch.
- If the search routine is called with high frequency, the recurring compilation can cause an overload on the server.
In the following I will elaborate these points in more detail.
The Basic Technique
The basic technique for static SQL with OPTION (RECOMPILE) is illustrated by search_orders_1, which I initially show in a simplified form without the parameters @employeestr and @employeetbl.
CREATE PROCEDURE search_orders_1 @orderid int = NULL, @status char(1) = NULL, @fromdate date = NULL, @todate date = NULL, @minprice decimal(10,2) = NULL, @maxprice decimal(10,2) = NULL, @custid nchar(5) = NULL, @custname nvarchar(40) = NULL, @city nvarchar(25) = NULL, @region nvarchar(15) = NULL, @country nvarchar(15) = NULL, @prodid int = NULL, @prodname nvarchar(40) = NULL AS SELECT o.OrderID, o.OrderDate, o.EmployeeID, o.Status, c.CustomerID, c.CustomerName, c.Address, c.City, c.Region, c.PostalCode, c.Country, c.Phone, od.UnitPrice, od.Quantity, p.ProductID, p.ProductName, p.UnitsInStock, p.UnitsOnOrder FROM Orders o JOIN [Order Details] od ON o.OrderID = od.OrderID JOIN Customers c ON o.CustomerID = c.CustomerID JOIN Products p ON p.ProductID = od.ProductID WHERE (o.OrderID = @orderid OR @orderid IS NULL) AND (o.Status = @status OR @status IS NULL) AND (o.OrderDate >= @fromdate OR @fromdate IS NULL) AND (o.OrderDate <= @todate OR @todate IS NULL) AND (od.UnitPrice >= @minprice OR @minprice IS NULL) AND (od.UnitPrice <= @maxprice OR @maxprice IS NULL) AND (o.CustomerID = @custid OR @custid IS NULL) AND (c.CustomerName LIKE @custname + '%' OR @custname IS NULL) AND (c.City = @city OR @city IS NULL) AND (c.Region = @region OR @region IS NULL) AND (c.Country = @country OR @country IS NULL) AND (od.ProductID = @prodid OR @prodid IS NULL) AND (p.ProductName LIKE @prodname + '%' OR @prodname IS NULL) ORDER BY o.OrderID OPTION (RECOMPILE)
The effect of all the @x IS NULL
clauses is that if an input parameter is NULL, then the corresponding AND-condition is always true. Thus, the only conditions that are in effect are those where the search parameter has a non-NULL value. Sounds simple enough, but there is a very big difference in performance with or without that last line present:
OPTION (RECOMPILE)
The hint instructs SQL Server to recompile the query every time. Without this hint, SQL Server produces a plan that will be cached and reused. This has a very important implication: the plan must work with all possible input values of the parameters. When SQL Server builds an execution plan for a stored procedure, it “sniffs” the values of the parameters and optimises the queries in the procedure for these values – but in a way so that plan the will be correct no matter which values that are passed. To illustrate this, let’s first play with the procedure search_orders_0 which is the same as the above, save for the very last line. That is, OPTION (RECOMPILE) is not there.
Then run this:
EXEC search_orders_0 @orderid = 11000
This appears to run well and the result set appears more or less directly. But now try:
EXEC search_orders_0 @prodid = 76
This ran for 11 seconds on my machine, which definitely is not acceptable. If you want to study this more, you can run sp_recompile on search_orders_0 and run the two calls in reverse order. You will find that the search on product ID is now quick, but the search for a single order runs for several seconds. Thus, we can conclude that search_orders_0 without the compile hint is of little value.
Note: At a casual look, the pattern may seem similar to what we saw in find_customer earlier, and you may think that startup filters could come into play here. But observe that in search_orders_1, AND and OR have swapped positions compared to find_customer. Startup filters are not applicable to the pattern in search_orders_1.
In many cases when you write a stored procedure, you want caching, because compilation comes with a price. But this assumes that the cached plan is good for all parameter values. This is not the case with a procedure of this kind. When the user provides a single order ID, we want the optimizer to use the indexes on OrderID in Orders and Order Details. But if the user performs a search on a product ID or a product name, we want to use the index on ProductID in Order Details and work from there. That is, we want different plans for different search conditions.
This is exactly what we achieve with the hint OPTION (RECOMPILE). Since SQL Server is instructed to recompile the query every time, there is no need to cache the plan, and thus there is no need to consider any other parameter values. That is, SQL Server can handle all the variables as if they are constants. Thus, if the search_orders_1 is called like this:
EXEC search_orders_1 @orderid = 11000
SQL Server will in essence optimise this WHERE clause:
WHERE (o.OrderID = 11000 OR 11000 IS NULL) AND (o.Status = NULL OR NULL IS NULL) AND (o.OrderDate >= NULL OR NULL IS NULL) AND (o.OrderDate <= NULL OR NULL IS NULL) AND (od.UnitPrice >= NULL OR NULL IS NULL) AND (od.UnitPrice <= NULL OR NULL IS NULL) AND (o.CustomerID = NULL OR NULL IS NULL) ...
SQL Server is smart enough to remove all these NULL IS NULL from the query, so in reality the optimizer works with this WHERE clause:
WHERE o.OrderID = 11000
The choice of using the indexes on OrderID to drive the query becomes a no-brainer. And if you take this call:
EXEC search_orders_1 @custid = 'ALFKI'
The WHERE clause effectively becomes:
WHERE (o.OrderID = NULL OR NULL IS NULL) AND (o.Status = NULL OR NULL IS NULL) AND (o.OrderDate >= NULL OR NULL IS NULL) AND (o.OrderDate <= NULL OR NULL IS NULL) AND (od.UnitPrice >= NULL OR NULL IS NULL) AND (od.UnitPrice <= NULL OR NULL IS NULL) AND (o.CustomerID = N'ALFKI' OR N'ALFKI' IS NULL) ...
The optimizer decides that the index on CustomerID is good.
Some more test cases that you can try and look at the query plan. (If you are running the full search_orders_1, there is a plan from an initial query not shown above. Just ignore this plan.):
EXEC search_orders_1 @prodid = 76 EXEC search_orders_1 @prodid = 76, @custid = 'RATTC' EXEC search_orders_1 @fromdate = '19980205', @todate = '19980209' EXEC search_orders_1 @city = N'Bräcke', @prodid = 76
You will find that all plans are different.
It is interesting to compare these two calls:
EXEC search_orders_1 @custid = 'ERNTC', @fromdate = '19980218', @todate = '19980219' EXEC search_orders_1 @custid = 'BOLSR', @fromdate = '19980101', @todate = '19981231'
The parameter profile is the same, but there is still a big difference. The customer ERNTC accounts for 10 % of the orders in Northgale, whereas BOLSR only has a single order. On the other hand, the first call only asks for information for a date interval of two days, whereas the second call asks for an entire year. There are indexes on both CustomerID and OrderDate, but there is no composite index with these columns. You may sense that the optimal index choice is not the same for these two calls, and indeed if you look at the plans, you will find that they are different. The first call utilises both indexes and then performs a hash join on them, whereas the second only uses the index on CustomerID. That is, with OPTION (RECOMPILE), we can not only get different plans depending on which parameters the caller passes, but we can also get different plans depending on the actual parameter values!
The Isnull/Coalesce Trap
Rather than using OR like above, some people write one of:
o.orderID = isnull(@orderid, o.OrderID) o.orderID = coalesce(@orderid, o.OrderID)
isnull and coalesce both return the first argument if it is non-NULL, else they will return the second value. (There are some differences between them, but that is outside the scope for this article.) Thus, if @orderid is NULL, you get o.OrderID = o.OrderID
, a complete no-op – or so it may seem. You can see a full example of this in search_orders_2. This procedure is a little more compact than search_orders_1 with all its OR. But while it may seem appealing, my strong recommendation is that you stay away from this pattern. You see, there is a trap. Run this:
EXEC search_orders_1 @orderid = 10654 EXEC search_orders_2 @orderid = 10654
The first call return four rows, but the second call returns no rows at all! Why? The reason is that for the customer on this order, the column Region is NULL. When @region is NULL, the condition
c.Region = coalesce(@region, c.Region)
in essence becomes NULL = NULL
. But in SQL, NULL is not equal to NULL. NULL stands for “unknown value”, and any comparison with NULL yields neither TRUE nor FALSE in the three-valued logic of SQL, but UNKNOWN. Whence, no rows are returned.
To avoid this trap, some people write things like:
coalesce(c.Region, '') = coalesce(@region, c.Region, '')
This is not only kludgier, but since the column is entangled into an expression, this may preclude the use of any index on the column.
I have also seen people apply the “workaround” to write this particular condition as (c.Region = @region OR @region IS NULL)
. But it seems better to use a single pattern that works under all conditions.
Handling Multi-Valued Parameters
Let’s now go back to search_orders_1 and look at how the multi-valued parameters @employeestr and @employeetbl are handled.
For @employeestr there is this condition in the WHERE clause:
AND (o.EmployeeID IN (SELECT n FROM intlist_to_tbl(@employeestr)) OR @employeestr IS NULL)
The table-valued function intlist_to_tbl takes a comma-separated list and cracks it into table of integers. For instance, this call:
SELECT * FROM intlist_to_tbl('15,26, 17')
produces this result set:
listpos n
———– ———–
1 15
2 26
3 17
intlist_to_tbl is not a built-in function in SQL Server, but is one that I have added to Northgale. You may know, though, that starting with SQL 2016, SQL Server ships with a built-in function string_split for this purpose, and with this function, the query to unpack the table becomes:
SELECT convert(int, str) FROM string_split(@employeestr), ',')
This is a little bulkier, since string_split returns strings and there is no built-in function specifically for integer lists. Another issue is that string_split does not return the list position, which sometimes is useful. There is also a potential performance problem with string_split that I will cover in the next section.
No matter if you use a user-written function or string_split, the effect of the OR condition is that if @employeestr is NULL and OPTION (RECOMPILE) is present, SQL Server will factor out the function call as well as the condition on EmployeeID, so the function is only invoked if the caller actually passes a list of employees.
Note: For a general discussions of splitting lists of values into table format, see my article Arrays and Lists in SQL Server.
Here is a test case:
EXEC search_orders_1 @employeestr = '402,109,207', @custid = 'PERTH'
The call should return 20 rows.
While passing comma-separated lists is popular, myself I think it is cleaner to pass values in a table-valued parameter (TVP). If you are unacquainted with these, I have an article Using Table-Valued Parameters in SQL Server and .NET, where I show how to use these with .NET and it includes an example of how to crack a comma-separated list client-side to pass the values as a TVP.
For this reason. search_orders_1 also supports the parameter @employeetbl. You may have noticed in the procedure definition earlier, that there is no default value specified for this parameter, but don’t be lured. A table-valued parameter always has a default value of an empty table, so you can always leave it out.
To be able to use a TVP, you need to define a table type, and Northgale comes with this table type:
CREATE TYPE intlist_tbltype AS TABLE (val int NOT NULL PRIMARY KEY)
For a dynamic search with OPTION (RECOMPILE), a table-valued parameter requires an extra step before the actual search query to find out if any rows were passed or not:
DECLARE @hasemptbl bit = CASE WHEN EXISTS (SELECT * FROM @employeetbl) THEN 1 ELSE 0 END
and then this variable reappears in the main query:
AND (o.EmployeeID IN (SELECT val FROM @employeetbl) OR @hasemptbl = 0)
With OPTION (RECOMPILE), this means that the subquery against @employeetbl and the check against EmployeeID are factored out when @hasemptbl is 0. The variable @hasemptbl is needed, because you would put the EXISTS subquery directly in the main query, it would not work out; a table variable is never handled as a constant.
Here is a test case:
DECLARE @tbl intlist_tbltype INSERT @tbl (val) VALUES(402), (109), (207) EXEC search_orders_1 @employeetbl = @tbl, @custid = 'PERTH'
Again, 20 rows should be returned.
Multi-Valued Parameters and Performance
Tip: if you are in a bit of a hurry, you can skip this section. It digs into some details that may be a little bewildering for inexperienced readers.
So far, it may seem that multi-valued parameters are quite the same as plain scalar parameters. However, there is a difference with regards to performance. For scalar parameters, the optimizer knows the exact values and is able to build the plan accordingly, so even if the parameter profile is the same, you can still get different execution plans, and we saw an example of this earlier.
However, this cannot happen with the multi-valued parameters in search_orders_1. All the optimizer knows about @employeetbl is the number of rows in the table variable. If the distribution of the values in the search column is fairly even, this information is sufficient. But this is not the case with the employees in Northgale where the most active employee has 5 736 orders, and the least active only has 11. This means that depending on which employee(s) the user has selected in combination with other parameters, different plans may be preferable.
If you pass a comma-separated list in @employeestr, the situation can be even bleaker, depending on how you crack the list into a table and which version of SQL Server you have:
- With a user-defined multi-statement function such as intlist_to_tbl the following applies:
- Up to SQL 2012 (or more precisely up to compatibility level 110), the optimizer assumes that the function returns one single row, which is a decent guess, assuming that users would rarely enter more than just a handful of values.
- In SQL 2014 and SQL 2016 (compatibility levels 120 and 130), the assumption is 100 rows, which is likely to be way too high in this context.
- In SQL 2017 and later (compatibility level 140 and up), the optimizer employs interleaved execution and runs the function before compiling the rest of the query based on the number of rows returned. This puts UDFs on par with table variables. …but this is only true for read-only statements, for instance SELECT statements that return data. For a SELECT that is part of an INSERT or SELECT INTO, there is no interleaved execution and the estimate is fixed to 100 rows. The same is true for UPDATE or DELETE.
- If you use string_split, there is a blind guess of 50 rows (again too high) and there is never any interleaved execution no matter the SQL version.
- If you use a string splitter based on the CLR, XML or JSON, the optimizer makes blind guess of too many rows, the exact number depending on the method and there is no interleaved execution.
And in none of the cases, the optimizer has any knowledge of the values in the list. While all these alternatives are a bit confusing, there is a way out of this that also addresses skews. To wit, bounce the data over a temp table:
DECLARE @hasemps bit = 0 CREATE TABLE #emps (empid int NOT NULL PRIMARY KEY) IF @employeestr IS NOT NULL BEGIN INSERT #emps(empid) SELECT number FROM intlist_to_tbl(@employeestr) UPDATE STATISTICS #emps SELECT @hasemps = 1 END
You can also do this for a table-valued parameter:
IF EXISTS (SELECT * FROM @employeetbl) BEGIN INSERT #emps(empid) SELECT val FROM @employeetbl UPDATE STATISTICS #emps SELECT @hasemps = 1 END
And the condition in the query goes:
AND (o.EmployeeID IN (SELECT empid FROM #emps) OR @hasemps = 0)
You find a procedure that utilises this method in search_orders_3.
A temp table has distribution statistics, and for a short list with only a handful values, the optimizer knows exactly what values there are in the temp table, and therefore it can produce a more accurate plan.
Pay attention to the command UPDATE STATISTICS. This is necessary to be sure that the temp table has the correct statistics. You may have heard of autostats, but the rules for autostats are such that it may not always set in, if the number of values is small. (An interesting behaviour is that without UPDATE STATISTCS, you may be running with the statistics from the previous execution of the procedure, due to how SQL Server caches temp-table definitions. My fellow MVP Paul White discusses this further in this blog post.)
Here are some examples you can try:
DECLARE @tbl intlist_tbltype INSERT @tbl (val) VALUES(900), (901), (902) EXEC search_orders_1 @employeetbl = @tbl EXEC search_orders_1 @employeestr = '900,901,902' EXEC search_orders_3 @employeestr = '900,901,902'
These three employees have in total only 34 orders, and the search returns 287 rows. When you look at the execution plans, I suggest that you focus on the operator which is just to the right of the SELECT operator and compare Actual Number of Rows For All Executions to Estimated Number of Rows Per Execution. (Note that the actual text of the labels varies between different versions of SSMS.) The general pattern is that the two calls to search_orders_1 have gross over-estimates. The exact numbers depend on your version of SQL Server (or more precisely which compatibility level you have set Northgale to). The call with @employeetbl has an estimate is in the range of 25 000 to 29 000 rows whereas for @employeestr the range is from 9 000 rows on SQL 2008 to 96 000 on SQL 2016 due to changes in estimates for table-valued functions as I discussed above. The call to search_orders_3 has a lot better estimate, around 2 800 rows. Yes, this is still an over-estimate by a factor of ten, but this can be explained the by the fact that there are in total over 750 employees in Northgale, so the histogram does not have unique counts per employee. These three employees are in a histogram step where other employees have more orders.
Next, you can try:
DECLARE @tbl intlist_tbltype INSERT @tbl (val) VALUES(337) EXEC search_orders_1 @employeetbl = @tbl EXEC search_orders_1 @employeestr = '337' EXEC search_orders_3 @employeetbl = @tbl
This is the most active employee in Northgale, and each call returns 42 705 rows. But if you check the estimates, the estimate when you pass @employetbl is between 8 500 rows and 9 600 rows, that is, lower than before. The reason for this underestimate is quite obvious given the information above. Since all the optimizer knows is how many employees you pass in, it can only make an assumption from a statistical average. Thus, it is logical that the estimate is only a third of the estimate for when we searched for three employees. For search_orders_3, I get estimates from 30 000 rows to just below 39 000 rows depending on the SQL Server version, which is decently close to the real number.
Since all these calls returns quickly, you may wonder if these numbers really matter. But take these numbers as general examples. Overall, the closer to the correct values the estimates are, the more likely that SQL Server will find a good plan. Whereas, conversely, if estimates that are off, this may result in less optimal index choices. Poor estimates can also lead to memory grants that are excessive or too small, in the latter case leading to Sort or Hash operators spilling to disk. In fact, this is something you can explore by looking at the execution plans, although exactly how this exhibits depends on the compatibility level / SQL Server version. Tip: this is better to explore the effects on memory grants on compatibility level 140 and lower. By a quirk of fate, also search_orders_3 results in an excessive memory-grant on SQL 2019 for the search on employees 900 to 902.
Before we leave the multi-valued search conditions, I like to cover an approach which is more elaborate (and possibly overly elaborate). It seems reasonable to assume that most of the time that users will only search for a very small number of employees. In that case, you could opt to read these few values into variables, so you have a condition like this:
AND (o.EmployeeID IN (@emp1, @emp2, @emp3, @emp4) OR @emp1 IS NULL)
If the user only enters one value, you put that value in @emp1 and leave the other three NULL. With OPTION (RECOMPILE), this means that this is no different to a single-valued search parameter like @custid. If the user selects two employees you put these values in @emp1 and @emp2. If the user enters four values, you populate all four variables. If the user enters five or more, you leave all variables NULL and instead you put the values in a temp table. You find a full implementation of this idea in search_orders_4.
If you want to compare plans, you can try these calls:
EXEC search_orders_3 @employeestr = '900,901,902', @custid = 'SAVSH' EXEC search_orders_4 @employeestr = '900,901,902', @custid = 'SAVSH'
When I test, I find that the call to search_orders_3 uses only the index on CustomerID, whereas search_orders_4 use both the indexes on EmployeeID and CustomerID. However, I have not evaluated which of the two that has the best performance.
I would suggest that the approach with the temp table will be good enough in many situations, but you may face a situation where this trick with variables works better for you.
Choice of Sort Order
If users need to be able to choose the sort order, this can be handled with static SQL and OPTION (RECOMPILE) as in this example:
ORDER BY CASE @sortcol WHEN 'OrderID' THEN o.OrderID WHEN 'EmployeeID' THEN o.EmployeeID WHEN 'ProductID' THEN od.ProductID END, CASE @sortcol WHEN 'CustomerName' THEN c.CustomerName WHEN 'ProductName' THEN p.ProductName END, CASE @sortcol WHEN 'OrderDate' THEN o.OrderDate END
That is, you have a parameter that holds the column to sort by (or some other identifier that maps to the column name) and then you use CASE to select that parameter. But why are there three CASE expressions? Recall that a CASE expression has a static data type, which is determined according to the rules of data-type precedence in SQL Server. That is, THEN-expressions that are of types with lower precedence will be converted to the type with the highest precedence in the CASE expression. Specifically, string types have lower precedence than numeric and date/time data types. Thus, if you mix string columns with other types in the same CASE expression, attempts to sort on a string column will die with a conversion error. Therefore, you need to have one CASE expression for numeric columns, one for string columns, one for dates etc.
While there are three entries in the ORDER BY clause, at least two of them will be always NULL. With OPTION (RECOMPILE), the optimizer is able to figure this all out, so if the caller passes @sortcol = 'CustomerName'
, it will be no different than if the code had said ORDER BY c.CustomerName
.
If you want to support both ascending and descending sorts, you will need to double everything:
ORDER BY CASE WHEN @isdesc = 1 THEN CASE @sortcol WHEN 'OrderID' THEN o.OrderID WHEN 'EmployeeID' THEN o.EmployeeID WHEN 'ProductID' THEN od.ProductID END DESC, CASE WHEN @isdesc = 0 THEN CASE @sortcol WHEN 'OrderID' THEN o.OrderID WHEN 'EmployeeID' THEN o.EmployeeID WHEN 'ProductID' THEN od.ProductID END ASC
For brevity, I included only the numeric columns here, but as you see, it’s starting to get a little ugly. Now, imagine that users should be able to select multiple sort columns with different data types and also ascending/descending for each column. If you would try the above strategy above, it would grow to something completely unmanageable. In order to deter you, I have assembled search_orders_5 which supports two sort columns which can be either ascending or descending. The ORDER BY clause is longer than the rest of the procedure!
Jonathan Van Houtte contacted me and offered a suggestion with only one CASE expression for a sorting on a single column ascending:
ORDER BY CASE @sortcol WHEN 'OrderID' THEN row_number() OVER(ORDER BY o.OrderID) WHEN 'EmployeeID' THEN row_number() OVER(ORDER BY o.EmployeeID) WHEN 'ProductID' THEN row_number() OVER(ORDER BY od.ProductID) WHEN 'CustomerName' THEN row_number() OVER(ORDER BY c.CustomerName) WHEN 'ProductName' THEN row_number() OVER(ORDER BY p.ProductName) WHEN 'OrderDate' THEN row_number() OVER(ORDER BY o.OrderDate) END
He evades the need for using multiple CASE expressions by using row_number. This does add some extra operators to the query plan, but I have not been able to detect any noticeable performance penalty, nor do I have any reason to assume that there would be any.
However, while this is a little leaner, it does not help much to reduce the complexity if you want to support multiple sort columns or ascending/descending. The code will still grow out of hand. That is, as long you give users complete freedom to combine sort columns. But Jonathan suggested that you can work with pre-defined sort sets, as in this example:
ORDER BY CASE @sortset WHEN 'Natural' THEN row_number() OVER (ORDER BY o.OrderID, od.ProductID) WHEN 'Geographic' THEN row_number() OVER (ORDER BY c.Country, c.City, c.CustomerName, o.OrderID) WHEN 'ByProduct' THEN row_number() OVER (ORDER BY p.ProductName, od.OrderID) END
We can conclude from this section that is only practical to support dynamic sorting in a solution with static SQL where you either only permit a single sort column, with no choice of ascending/descending, or you work with pre-defined sort sets. If your requirements go beyond that, it is time to consider a solution with dynamic SQL instead, which, as we shall see later, is more manageable. Or – this is not an option that should be overlooked – sort the data client-side.
Optional Tables
Sometimes you may need to access a table only if a certain condition is given. Let’s add one more parameter to our procedure: @suppl_country. If this parameter is provided, the procedure should only return information about products with a supplier from the given country. However, we are not required to return any data from the Suppliers table. A suggestion that I originally got from Phillipp Sumi is that you should use an EXISTS clause in this way:
@suppl_country IS NULL OR EXISTS (SELECT * FROM Suppliers s WHERE s.SupplierID = p.SupplierID AND s.Country = @suppl_country)
You find this condition in search_orders_6. If you run this:
EXEC search_orders_6 @fromdate = '19980101', @todate = '19980115', @country = 'Sweden' EXEC search_orders_6 @fromdate = '19980101', @todate = '19980115', @suppl_country = 'Sweden'
and then look at the query plans, you will see that the first plan does not include Suppliers. This should not really come as a surprise, given what we have learnt about OPTION (RECOMPILE).
There is an interesting observation to make here. Comment out OPTION (RECOMPILE) from search_orders_6 and recreate it. Run the first call above, this time preceded by the command SET STATISTICS IO ON. If you look at the query plan, you will find that Suppliers now appears in the plan. However, when you look at the output from SET STATISTICS IO ON, you will see something like this:
Table ‘Order Details’. Scan count 607, logical reads 1842
Table ‘Orders’. Scan count 9, logical reads 25191
Table ‘Customers’. Scan count 9, logical reads 691
Table ‘Products’. Scan count 9, logical reads 238
Table ‘Worktable’. Scan count 0, logical reads 0
Table ‘Worktable’. Scan count 0, logical reads 0
The exact output depends on your version of SQL Server, but whichever version you have, Suppliers is missing from the output. This is a variation of what we saw in the section Using OR for alternate key lookup. If you look in the plan, you will find that right above the Clustered Index Seek operator on Suppliers there is a Filter operator. If you hover over this operator, you will find that it has a Startup Expression Predicate, which prevents Suppliers from being accessed as long as @supply_country is NULL.
Thus, if you encounter a situation where your only dynamic search condition is that depending on a parameter you should filter the rows on their presence in a certain table, it may be possible to avoid OPTION (RECOMPILE) and rely on a startup filter instead. However, always check that you actually get the filter you expect.
Note: some people with a background in languages like C++ may find this trite and think that this is just a matter of operator shortcutting. However, there is no operator shortcutting in SQL, but in SQL operands can be computed in any order, and the behaviour would be the same if the conditions were written as EXISTS () OR @suppl_country IS NULL
.
Alternate Tables
You could find yourself in a situation where depending on some condition you need to read data from different tables. To take a simple example, in Northgale there is a second set of order tables, HistoricOrders and HistoricOrderDetails. To permit the user to search these tables rather than Orders and Order Details, we add a parameter @ishistoric which defaults to 0. To implement this requirement, we replace the join between Orders and Order Details in a derived table which is a UNION ALL of two join queries that both have a WHERE clause which includes or excludes the join query depending on the value of @ishistoric:
FROM (SELECT o.OrderID, o.OrderDate, od.UnitPrice, od.Quantity, o.CustomerID, od.ProductID, o.EmployeeID FROM Orders o JOIN [Order Details] od ON o.OrderID = od.OrderID WHERE @ishistoric = 0 UNION ALL SELECT o.OrderID, o.OrderDate, od.UnitPrice, od.Quantity, o.CustomerID, od.ProductID, o.EmployeeID FROM HistoricOrders o JOIN HistoricOrderDetails od ON o.OrderID = od.OrderID WHERE @ishistoric = 1) AS u JOIN Customers c ON o.CustomerID = u.CustomerID JOIN Products p ON p.ProductID = u.ProductID
You find the complete solution in search_orders_7. Thanks to OPTION (RECOMPILE), the execution plan will include only one set of order tables.
I should hasten to point out that would it be the case that the choice of tables is the only dynamic part of your stored procedure, there is no need to use OPTION (RECOMPILE) for this pattern. Without OPTION (RECOMPILE), both sets of tables will be in the plan, but there will be startup filters based on the variable, so that only one set is accessed at run-time.
It is worth observing, that the query has a certain amount of repetition – the SELECT list and the join conditions. The more alternate tables there are, the case for dynamic SQL grows stronger, as with dynamic SQL you can reduce that repetition as we shall see later.
When OPTION (RECOMPILE) Hurts You
We have now seen many of the advantages with using OPTION (RECOMPILE), but before you start to use it all over town, you need to understand that too frivolous use of OPTION (RECOMPILE) can cause severe pain to your system. There is after all a reason why SQL Server in the normal case caches query plans. Without plan caching many systems would be crumbling under the load of query compilation.
When you consider to use OPTION (RECOMPILE), you need to have an idea of how often your query will be executed. Say that at peak activity, you can expect users to run search_orders a few times per minute. In this case, the extra time you spend on compilation is clearly ignorable, not the least if 50 ms of compilation can reduce execution time from five minutes to 100 ms. But assume instead that there are over 100 calls to the procedure every second. And assume furthermore, in the majority of the calls @orderID is the only input parameter. Recompiling the query every time to produce the same plan all over again is a perfect waste of CPU resources.
One way to alleviate this situation is to introduce an IF statement so that you have:
IF @orderid IS NOT NULL BEGIN SELECT ... WHERE O.OrderID = @orderid AND -- Conditions on Order Details here. -- No OPTION (RECOMPILE) here! END ELSE BEGIN SELECT ... WHERE -- same conditions as before OPTION (RECOMPILE) END
By adding a separate branch for @orderid that does not have OPTION (RECOMPILE), the common requests for a single order can be served from a cached plan, whereas other conditions still result in compilation every time. Assume now that once you have this in production, you find that the load from compilation is still a tad high, and you identify that there many are requests with @custid and @fromdate, with @fromdate typically being at most a week ago.
This calls for one more branch without OPTION (RECOMPILE), and you can see a complete solution in search_orders_8. However, there is a risk that this will not work out well. You may recall the discussion on parameter sniffing early in the article and this could strike here. Say that the first call to search_orders_8 happens to be a search on @prodid. This means that the searches for the other two branches will be optimised for a search on this product ID and with NULL values for the order ID, the customer ID and the order date. This means that you may not get the best plans for the two fixed branches.
Would you run into trouble – or just be paranoid when it comes to performance – you can avoid the parameter-sniffing issue by pushing the fixed branches to inner procedures which are only invoked if these search conditions are given. You find this pattern in search_orders_9, which includes two sub-procedures. However, you may not even have to look at it to realise that this is not a very attractive alternative from the perspective of code maintainability.
And this is about as far this path can take you. If you would find more common input combinations that cause too much compilation, this gets out of hand, and that’s where you need to look into dynamic SQL as with dynamic SQL, you can get one cached plan for each combination of input parameters. More about that later.
Then again, not all search problems are the same. Say that of the many search conditions you expose, only three are indexed, and furthermore you force the users to select one of these, and the rest are only additional filters. Under these circumstances, having three fixed branches with no OPTION (RECOMPILE) may be a perfectly feasible solution.
You should beware of that with OPTION (RECOMPILE) you are missing out a feature known as memory-grant feedback. It was introduced in SQL 2017, where it only applies to queries which involves at least one columnstore index. Starting with SQL 2019, it applies to all queries. When the optimizer produces a plan, it estimates how much memory that is needed for operations like sorting and hashing, and this memory grant is fixed during the execution. If the grant is too small, the operation spills to disk and slows down the query, and if the grant is a lot higher than the actual need, this wastes resources and can cause other queries that needs memory to wait. With memory-grant feedback, SQL Server can adjust a grant so that it is closer to the actual value in recent executions. But this can only happen to cached plans from their second execution and on, and with OPTION (RECOMPILE) there is never any second execution.
The History of Forced Recompilation
You can consider this entire section a side note with some extra reading, so feel free to skip this section.
Up to SQL 2000, the only way to request compilation on every execution was to add the clause WITH RECOMPILE to the procedure header:
CREATE PROCEDURE search_orders ... WITH RECOMPILE AS
The procedure search_orders_10 has this option, but it does not include OPTION (RECOMPILE). If you run
SET STATISTICS TIME ON EXEC search_orders_10 @prodid = 76 EXEC search_orders_10 @orderid = 11000 SET STATISTICS TIME OFF
you will find that both run with acceptable speed, which was not really the case when we tried search_orders_0 without OPTION (RECOMPILE). However, if you look into the execution plans for the second call, you will find that there is no Clustered Index Seek, instead you will find an Index Scan of one of the non-clustered indexes. If you look at the last output line from SET STATISTICS TIME (which is the time it took to run search_orders_10) you will see an execution time of some 200-300 ms, which as such is tolerable. However, Northgale only has one million orders and as I noted earlier, a real-world database could easily have a hundredfold volume or more. Which means that the execution time will also hundredfold, as there are a hundredfold more orders to scan. That is, it would take 20-30 seconds to find a single order, which is entirely unacceptable.
The reason for this is that while the procedure is recompiled every time, the optimizer does not perform any flow analysis, but optimises each statement in isolation. That is, it must consider the possibility that before execution comes to the SELECT statement, there may have been other statements which have modified the values of the parameters. Thus, the optimizer cannot handle the parameters as constants as with OPTION (RECOMPILE). It can sniff the parameter values to optimise the plan for these specific values, but the plan must still be correct for any set of values. This precludes seeks, but scans must be used instead.
With SQL 2005, Microsoft introduced statement recompile for stored procedures (previously, a stored procedures were always recompiled in whole). They also exposed this feature with the hint OPTION (RECOMPILE). However, that was statement recompile for something to be put in the cache – and thus must be correct for all possible values. That is, the situation was no different from WITH RECOMPILE in SQL 2000 and earlier.
But that was not really what I and other users expected when we learnt about the RECOMPILE hint, and Microsoft listened to our laments, and when was SQL 2008 released, OPTION (RECOMPILE) now handled variables as constants, which meant a revolution for dynamic search conditions. Previously, dynamic SQL had been the only tenable solution. It did not last long, though. There was a bug, so if two users called the same procedure with OPTION (RECOMPILE) simultaneously, they could get each other’s result sets, and by CU4 of SQL 2008, Microsoft reverted back to the behaviour from SQL 2005, because it was not an easy bug to fix.
However, thankfully, there was a company in Bulgaria that had invested so heavily in the new behaviour of OPTION (RECOMPILE) that there was a business case which Microsoft had to support. So by SQL 2008 SP1 CU5, OPTION (RECOMPILE) again handled variables as constants and the hint has continued to work tin hat way since then. Maybe Microsoft would have fixed OPTION (RECOMPILE) at some point anyway, but this customer certainly made it happen rather sooner than later.
When SQL 2014 was relatively new, one more bug around OPTION (RECOMPILE) where users were getting each other’s results was uncovered. But while the bug in early SQL 2008 was very simple to repro, this bug only exhibited under very rare circumstances. In any case, Microsoft has fixed that bug as described in KB2965069 and the fix was also backported to SQL 2012. (But not SQL 2008, which had gone out of mainstream support by then.)
Overall, while OPTION (RECOMPILE) certainly can be abused, it is a very powerful device when it comes to performance tuning. And not the least, it is simple to use. Not only for dynamic searches.
Dynamic SQL
Why Dynamic SQL?
It might as well be said directly: solutions with dynamic SQL require more from you as a programmer. Not only in skill, but foremost in discipline and understanding of what you are doing. Dynamic SQL is a wonderful tool when used correctly, but in the hands of the unexperienced novice it far too often leads to solutions that are flawed and hopeless to maintain.
That said, the main advantages with dynamic SQL are:
- Dynamic SQL gives you a lot more flexibility; as the complexity of the requirements grow, the complexity of the code tends to grow linearly or less than so.
- Query plans are cached by the query string, meaning that commonly recurring search criterias will not cause unnecessary recompilation.
The disadvantages with dynamic SQL are:
- As I said above, poor coding discipline can lead to code that is difficult to maintain.
- Dynamic SQL introduces a level of difficulty from the start, so for problems of low to moderate complexity, it’s a bit of too heavy artillery.
- Because queries are built dynamically, testing is more difficult, and some odd combination of parameters can prove to yield a syntax error when a poor user tries it.
- You need to consider permissions on the tables accessed by dynamic SQL. Users do not get permission just because the dynamic query is constructed in a stored procedure; it does not work that way.
It may seem from this list that there are more disadvantages with dynamic SQL than advantages, but if you look more closely, you will see that the list is more or less complementary with the corresponding list for static SQL. Static SQL with OPTION (RECOMPILE) is good for many everyday situations, but when the requirements of what the users should be able to do become more complex, or some of the searches are too frequent for recurring recompilation to be permissible, this is when you turn to dynamic SQL, fully aware of that it comes with a price.
On my web site, I have a longer article, The Curse and Blessings of Dynamic SQL, that discusses dynamic SQL in more detail, and which gives lots of tips for how to write good dynamic SQl, when to use it – and when to not. In the text that follows, I make several references to sections in that article for a more detailed discussion about issues we encounter.
The main focus for this chapter is how to implement dynamic searches by using dynamic SQL in T‑SQL procedures. However, it is also perfectly possible to implement dynamic searches in client code, and I briefly discuss this option in the last section, Advanced Options for Dynamic Searches in this chapter.
Note: a third possibility is to implement dynamic searches in CLR procedures written in .NET, and older versions of this article included examples for this, but I have removed this section. I think that the CLR option is only of interest, if you are already using the CLR, and in that case you probably already understand how to use it for dynamic search conditions.
Dynamic SQL Fundamentals
It is far too common to see people who build query strings by inlining values like this:
' AND col = ''' + convert(varchar, @value) + ''''
This is a very wrong way of doing it. It opens for SQL injection, it litters the plan cache, it gives problems for users who want to search for customers named O’Brien, and you can get a nightmare with date and time values. And most of all: it is very difficult to get right, this will be clear from later sections in this chapter where I take a closer look at inlining.
The normal way to build dynamic SQL is to build parameterised queries with sp_executesql, so that the above fragment reads
' AND col = @value'
We will look more at exactly how to do this when we arrive at the call to sp_executesql in search_orders_20, our first procedure that uses dynamic SQL. But we before we go there, we need to discuss a few more fundamentals.
SQL Injection
When you work with dynamic SQL, you must always beware of the risk for SQL injection. As I mentioned, building SQL strings by concatenating strings is an unacceptable anti-pattern, which is far too common. Consider this line:
' AND col = ''' + @value + ''''
Say that whatever is in @value, it is something the user behind the keyboard has full control over. It could come from a text box, a URL or a cookie. The user could enter something with a single quote that terminates the intended statement, inject whatever commands he or she likes to run, and the user would end the input with two hyphens to comment out the original closing quote. If the application connects with the user’s own credentials, and the user can connect to the database by other means, for instance SSMS, it does not really matter. But it is not uncommon that web applications run with a high-privileged account – despite that this is a gross violation of best practice. In this case, an SQL injection hole like the one above permits a malicious user to run both this and that on the server, and in worst case taking over it completely.
I will point out the risk for SQL injection in several places and show what an attacker may do. You may find some of these examples far-fetched and think that there are very few people who will even try to exploit them. For one thing, you might say, it would take them forever to figure out how to perform an injection attack for a specific case. However, you could not be more wrong. SQL injection is a serious issue, and it is probably one of the most common ways that sites get hacked. People who engage in these attacks are by no means not just teenage kids who do this for fun, but this is an industry. These people have plenty of time and good tools to find injection holes, because there is a lot of money to be made, for instance by installing ransomware or stealing data.
For a longer discussion on SQL injection, see the chapter Beware of SQL Injection in my Curse and Blessings article.
Permissions
With stored procedures using static SQL, you normally don’t have to bother about permissions. As long as the procedure and the tables have the same owner, it is sufficient for the users to have rights to run the stored procedures because of a feature known as ownership chaining. But ownership chaining never applies to dynamic SQL. Even if you build the query string inside a stored procedure, the string is never part of the procedure itself, but constitutes its own owner-less scope.
Therefore, as long as you make no other arrangements, you will have to grant the users who run your stored procedures SELECT permissions on the tables, be that per table or on schema or database level. This can be perfectly acceptable, but there can also be situations where this is a total no-no. Thankfully, there are alternatives:
- Create a certificate and sign the procedure with this certificate. Create a user from the certificate, and grant this user the required SELECT permissions.
- Add the clause
EXECUTE AS 'someuser'
to the procedure. This should be a user created WITHOUT LOGIN and which has been granted the required SELECT permissions.
I will not go into further details on these techniques here. Rather I refer you to my article Packaging Permissions in Stored Procedures, where I discuss these techniques in detail.
Implementing search_orders with a Parameterised Query
After these proceedings, it’s time to look at search_orders_20, our first search procedure to use dynamic SQL. Because it is the first procedure, I include it in its entirety. It is quite long, and for this reason I have numbered the lines in the right margin. (These numbers are not included in the file for the procedure.)
CREATE PROCEDURE search_orders_20 -- 1 @orderid int = NULL, -- 2 @status char(1) = NULL, -- 3 @fromdate date = NULL, -- 4 @todate date = NULL, -- 5 @minprice decimal(10,2) = NULL, -- 6 @maxprice decimal(10,2) = NULL, -- 7 @custid nchar(5) = NULL, -- 8 @custname nvarchar(40) = NULL, -- 9 @city nvarchar(25) = NULL, -- 10 @region nvarchar(15) = NULL, -- 11 @country nvarchar(15) = NULL, -- 12 @prodid int = NULL, -- 13 @prodname nvarchar(40) = NULL, -- 14 @employeestr varchar(MAX) = NULL, -- 15 @employeetbl intlist_tbltype READONLY, -- 16 @debug bit = 0 AS -- 17 -- 18 DECLARE @sql nvarchar(MAX), -- 19 @paramlist nvarchar(4000), -- 20 @nl char(2) = char(13) + char(10) -- 21 -- 22 SELECT @sql = -- 23 'SELECT o.OrderID, o.OrderDate, o.EmployeeID, o.Status, -- 24 c.CustomerID, c.CustomerName, c.Address, c.City, c.Region, -- 25 c.PostalCode, c.Country, c.Phone, od.UnitPrice, od.Quantity, -- 26 p.ProductID, p.ProductName, p.UnitsInStock, p.UnitsOnOrder -- 27 FROM dbo.Orders o -- 28 JOIN dbo.[Order Details] od ON o.OrderID = od.OrderID -- 29 JOIN dbo.Customers c ON o.CustomerID = c.CustomerID -- 30 JOIN dbo.Products p ON p.ProductID = od.ProductID -- 31 WHERE 1 = 1' + @nl -- 32 -- 33 IF @orderid IS NOT NULL -- 34 SELECT @sql += ' AND o.OrderID = @orderid' + @nl -- 35 -- 36 IF @status IS NOT NULL -- 37 SELECT @sql += ' AND o.Status = @status' + @nl -- 38 -- 39 IF @fromdate IS NOT NULL -- 40 SELECT @sql += ' AND o.OrderDate >= @fromdate' + @nl -- 41 -- 42 IF @todate IS NOT NULL -- 43 SELECT @sql += ' AND o.OrderDate <= @todate' + @nl -- 44 -- 45 IF @minprice IS NOT NULL -- 46 SELECT @sql += ' AND od.UnitPrice >= @minprice' + @nl -- 47 -- 48 IF @maxprice IS NOT NULL -- 49 SELECT @sql += ' AND od.UnitPrice <= @maxprice' + @nl -- 50 -- 51 IF @custid IS NOT NULL -- 52 SELECT @sql += ' AND o.CustomerID = @custid' + @nl -- 53 -- 54 IF @custname IS NOT NULL -- 55 SELECT @sql += ' AND c.CompanyName LIKE @custname + ''%''' + @nl -- 56 -- 57 IF @city IS NOT NULL -- 58 SELECT @sql += ' AND c.City = @city' + @nl -- 59 -- 60 IF @region IS NOT NULL -- 61 SELECT @sql += ' AND c.Region = @region' + @nl -- 62 -- 63 IF @country IS NOT NULL -- 64 SELECT @sql += ' AND c.Country = @country' + @nl -- 65 -- 66 IF @prodid IS NOT NULL -- 67 SELECT @sql += ' AND od.ProductID = @prodid' + @nl -- 68 -- 69 IF @prodname IS NOT NULL -- 70 SELECT @sql += ' AND p.ProductName LIKE @prodname + ''%''' + @nl -- 71 -- 72 IF @employeestr IS NOT NULL -- 73 SELECT @sql += ' AND o.EmployeeID IN' + -- 74 ' (SELECT n FROM dbo.intlist_to_tbl(@employeestr))' + @nl -- 75 -- 76 IF EXISTS (SELECT * FROM @employeetbl) -- 77 SELECT @sql += ' AND o.EmployeeID IN (SELECT val FROM @employeetbl)' + @nl -- 79 SELECT @sql += ' ORDER BY o.OrderID' + @nl -- 80 -- 81 IF @debug = 1 -- 82 PRINT @sql -- 83 -- 84 SELECT @paramlist = '@orderid int, -- 85 @status char(1), -- 86 @fromdate date, -- 87 @todate date, -- 88 @minprice decimal(10,2), -- 89 @maxprice decimal(10,2), -- 90 @custid nchar(5), -- 91 @custname nvarchar(40), -- 92 @city nvarchar(25), -- 93 @region nvarchar(15), -- 94 @country nvarchar(15), -- 95 @prodid int, -- 96 @prodname nvarchar(40), -- 97 @employeestr varchar(MAX), -- 98 @employeetbl dbo.intlist_tbltype READONLY' -- 99 -- 100 EXEC sp_executesql @sql, @paramlist, -- 101 @orderid, @status, @fromdate, @todate, @minprice, -- 102 @maxprice, @custid, @custname, @city, @region, -- 103 @country, @prodid, @prodname, @employeestr, @employeetbl -- 104
Variable Declarations
On line 19, I declare the variable @sql which will hold my query string. The type should always be nvarchar(MAX). It should be MAX, so that you can fit any query text into the variable. It must be nvarchar, or else sp_executesql will not accept it. I’m ignoring the variable @paramlist for now. On line 21, I define the variable @nl which I set to the standard line-ending in Windows, CR-LF. While technically a variable, @nl is a constant in this procedure.
Query Nucleus
On lines 23-32, I compose the nucleus of the dynamic SQL query. That is, the query we will get when all input parameters are left out. The important thing to observe here is that I use two-part notation for all tables and also specify the schema, although it is always dbo. I don’t do this in the stored procedures for static SQL, as in that case there is no actual benefit, but it creates a bit of noise to repeat dbo.
all over again. This is different with dynamic SQL, where you can face a performance penalty if you leave out the schema for reasons I will explain in the section Compilation and Caching below.
The condition WHERE 1 = 1
on line 32 is there so that all other conditions can be added as “AND something” without having to worry about whether there already is a WHERE or not. The optimizer is smart enough to entirely ignore this condition, so there is no penalty for it. I add @nl to the string to introduce a line break to make the generated string easier to read, something I will return to.
Adding the Conditions
On lines 34-71, I check all the single-valued search parameters. If a parameter is non-NULL, I add a condition for the corresponding column to the SQL string. I use the +=
operator which is a shortcut for @sql = @sql +
. Note that if I want to include a quote in the query string, I need to double it, see for instance line 71. Again, I concatenate @nl after all conditions.
On lines 73-75, I handle the @employeestr parameter in the same manner as I did in search_orders_1. That is, I use the function intlist_to_tbl to crack the list into table format. You may think that you could just do something like '... IN (' + @employeestr + ')'
, but that is a very bad idea that opens for SQL injection, and we will see an example of this later. Never do this, promise! Observe that just like for the tables, I need to refer to the function with two-part notation.
On lines 77-78, I handle the parameter @employeetbl, and this is perfectly straightforward. The only deviation is that I use EXISTS to see whether this parameter was specified rather than checking for NULL.
Finally, on line 80 I add the ORDER BY clause.
The Debug PRINT
On lines 82-83, I do something very important: if the parameter @debug is 1, I print the SQL string. This is one of these things that cannot be enough emphasised: always include a parameter that permits you to print the generated SQL string. One of the distinct disadvantages with dynamic SQL is that you can happen to concatenate the strings incorrectly leading to syntax errors, maybe only with some combination of input parameters. Finding this error by looking at the code that generates the dynamic SQL is hard, but once you see the SQL string, the error may be immediately apparent. For instance, a typical error is a missing space, leading to code that reads:
WHERE 1 = 1 AND o.OrderDate <= @todateAND p.ProductName LIKE @prodname
If you look closely in the procedure code, I have already take precautions to avoid such goofs by having a leading space in string literals, and also by adding the “constant” @nl to each condition.
However, the main purpose of @nl is a different one: to make sure that the generated query string looks reasonably pretty. This may seem overly elaborate if you think that the string is only generated to be executed. But every once in a while, the query will have a syntax error or not do what you intended, and you will need look at it. And in those situations you will appreciate if it is easily legible. Having the query split up over multiple lines also means that you can use the line number in an error message to get an idea of where the error might be. (Although with modern versions of SSMS, this is anything but trivial since SSMS remaps the line number SQL Server reports. See the section Error Messages and Line Numbers in my Curse and Blessings article for more details about this problem.)
If the query string is very long, it may appear to be truncated in SQL Server Management Studio because it only prints the first 4000 characters. If you run into this issue, see the section The Necessity of Debug Prints in my main article on dynamic SQL where I discuss various workarounds.
Running the Query
On lines 85 to 99, I set up the parameter list, and on lines 101 to 104 I execute the query using sp_executesql. It is extremely important to understand this procedure. This is the vehicle to run parameterised queries.
sp_executesql is a system procedure that takes two fixed parameters. The first parameter is a batch of SQL statements and the second parameter is a parameter list. These parameters must be of the type nvarchar; you cannot pass varchar. (And thus I have declared the parameters @sql and @paramlist accordingly.) The remaining parameters are the actual parameters passed to the parameters @paramlist. They must match the data types in the parameter list in the same way as when you call a regular stored procedure. That is, for these parameters, implicit conversion is permitted. The system procedure executes the batch using the parameters values you pass it.
If this sounds a little abstract, you can think of sp_executesql this way:
CREATE PROCEDURE sp <parameter list> AS <SQL batch> go EXEC sp <actual parameter values>
That is, you create a nameless stored procedure and execute it in one go.
In search_orders_20, the parameter names in the query string are the same as in the surrounding parameters in the stored procedure. But don’t be lured. They are physically separate and I could have used completely different names and the procedure would still have worked. The key is that the dynamic SQL cannot see any variables in the calling procedure; the dynamic SQL is a procedure of its own.
You may wonder about the syntax for the contents of @paramlist, and it follows the syntax for stored procedures, so if you find use for an OUTPUT parameter with a dynamic search procedure, yes, you can do that. Note that @paramlist includes all parameters passed to search_orders_20, save the @debug parameter, no matter whether they appear in the generated query string. There are languages where doing things like that will reward you with an error or a warning, but (thankfully) T‑SQL doesn’t mind.
Finally, observe that I refer to the table type in the parameter list in two-part notation, just like the tables and the table-valued function.
For more information and examples with sp_executesql, please see the section on sp_executesql in my article The Curse and Blessings of Dynamic SQL.
Trying it out
Here are some test cases:
EXEC search_orders_20 @orderid = 11000, @debug = 1 EXEC search_orders_20 @custid = 'ALFKI', @debug = 1 EXEC search_orders_20 @prodid = 76, @debug = 1 EXEC search_orders_20 @prodid = 2754, @custname = 'Alfreds snabb', @debug = 1 EXEC search_orders_20 @fromdate = '19980205', @todate = '19980209', @debug = 1 EXEC search_orders_20 @city = 'Bräcke', @prodname = 'Lakka', @debug = 1 EXEC search_orders_20 @employeestr = '402,109,207', @custid = 'PERTH', @debug = 1 DECLARE @tbl intlist_tbltype INSERT @tbl (val) VALUES(402), (109), (207) EXEC search_orders_20 @employeetbl = @tbl, @custid = 'PERTH', @debug = 1
I have added the @debug parameter to all calls, so that you can look at the generated SQL.
If you try these and inspect the query plans, you will see that the available indexes on the search columns are used in most cases, although there are a few exceptions. For instance, the index on Orders.EmployeeID is not used in these two examples. You can also compare the plans with the plans you get with search_orders_1. To a great extent they are identical, but there are some deviations. When I tested, I got different plans for search_orders_1 and search_orders_20 for the search on customer ID on all versions but SQL 2019. On the other hand, for the search on @city and @custname, I got different plans on SQL 2019 but no other version. (A casual look at the plan shapes may lead you think that there are differences on all versions for the plans that uses @custname or @prodname, but that is only because search_orders_20 has some extra operators to deal with the expression which adds the per-cent character to the search pattern.)
We will see later a few more cases where we get different plans for static SQL and dynamic SQL and why that happens.
Compilation and Caching
You have learnt that sp_executesql defines a nameless stored procedure and executes it directly. You may ask if this procedure is saved in the database. No, it is not. But, and this is the key, the query plan is saved in the cache. So the next time a user runs your search procedure with exactly the same set of search parameters, SQL Server will reuse the existing plan for that query.
That is, when you use static SQL, and there are these three calls.
EXEC search_orders_1 @orderid = 11000 EXEC search_orders_1 @orderid = 11001 EXEC search_orders_1 @orderid = 11002
There are also three compilations, because nothing is cached and recompilation is forced every time. Whereas with:
EXEC search_orders_20 @orderid = 11000 EXEC search_orders_20 @orderid = 11001 EXEC search_orders_20 @orderid = 11002
There is a compilation the first time, and on the second and third calls, the existing plan is reused. Add one more parameter:
EXEC search_orders_20 @orderid = 11000, @prodid = 76
And this will be a new compilation and a new cache entry. The existing plan for the search on OrderID alone is unaffected.
Now, if the procedure does not have a name, how does SQL Server find it in the cache? SQL Server looks up the procedure in the cache by computing hash of the query string (including the parameter list). And that is the exact query string: there is no normalisation of upper/lower, spacing, removal of comments etc. Add a single space or a comment, and it is a new cache entry. On the other hand, the parameter values are not part of the query string – just like the parameter values you pass to a stored procedure are not part of the procedure definition.
As you may understand by now, the fact that plans are cached by parameter profile is the reason why dynamic SQL is to prefer when the call frequency is that high that the cost of compilation every time becomes a problem.
In the previous section, I noted in a few places that I use two-part notation and this is also related to caching. When you write a stored procedure with static SQL, there is no problem with saying only FROM Orders
and leaving out the schema, since the schema is resolved from the procedure owner, which is normally dbo, who cannot have a different default schema than dbo.
With dynamic SQL it is different, since names are resolved from the current user who could have any schema as its default. For instance, take the user Jill with the default schema Jill. In this case, Orders alone should resolve firstly to Jill.Orders and only if that table does not exist to dbo.Orders. Even if there is no Jill.Orders when the query is compiled, SQL Server needs to consider that Jill.Orders could appear at any time how unlikely that may be. But this means that Jill cannot share cache entry with the user Jack with the default schema Jack. Now, imagine that all users have their own default schema – the cache will be littered with plans for the same query. By prefixing all objects with dbo you remove the ambiguity, and all users can share the same cache entry. Yes, it is very common that all users have dbo as their default schema, in which case this is a non-issue. However, this is nothing you should rely on, but you should use two-part notation with dynamic SQL to adhere to best practice.
In search_orders_20 this applies to all the table names, the user-defined function and not to forget, the user-defined type in the parameter list. As I mentioned above, the parameter list is also part of the nameless stored procedure.
Confession: it only took me twelve years from the first publication of this article until I had dbo in all places where it is needed in search_orders_20. It took about seven years until I prefixed the UDF and another five years until I prefixed the table type. Very embarrassing.
Special Search Conditions
For static SQL, we looked at how to implement some special search conditions. Let’s review these for dynamic SQL as well.
Optional Tables
The example with the @suppl_country parameter is of course trivial to handle with dynamic SQL:
IF @suppl_country IS NOT NULL SELECT @sql += ' AND EXISTS (SELECT * FROM dbo.Suppliers s WHERE s.SupplierID = p.SupplierID AND s.Country = @suppl_country)' + @nl
It is just another condition to add. You find this piece of code in search_orders_21.
Alternate Tables
This is also something which is simple to handle with dynamic SQL. To implement the parameter @ishistoric, I have changed the lines 28-29 in search_orders_20 with these in search_orders_21:
FROM dbo.' + CASE @ishistoric WHEN 0 THEN 'Orders' WHEN 1 THEN 'HistoricOrders' END + ' o JOIN dbo.' + CASE @ishistoric WHEN 0 THEN '[Order Details]' WHEN 1 THEN 'HistoricOrderDetails' END + ' od ON o.OrderID = od.OrderID
Compared to the solution with static SQL in search_orders_7, this is a little more compact and less repetitive.
Some readers may think that you could just pass a table name from the application, but that is not very good idea. What if the table is renamed or moved to a different schema? If all references to tables are in stored procedures they are relatively easy to find, even if they are in dynamic SQL. (For a discussion of finding references in stored procedures, see my short story Where Is that Table Used? ) But an unexpected reference from client code can easily be missed, leading to code that unexpectedly breaks.
Also, if you pass a table name from the application to splice it into the query string, you need to consider SQL injection. You can wrap the name in quotename() (discussed more closely below), but what if the client passes the name in two-part notation? Sure, this can be handled, and I discuss how in the section Receiving Names of Tables and Other Objects as Parameters in my Curse and Blessings article. But rather than doing all that, it is easier to map a variable to a table name through CASE. Not the least in this example when you can use one parameter to map to a pair of tables.
Choice of Sort Order
It may seem that this could be done as simple as:
@sql += ' ORDER BY ' + @sortcol
It cannot be denied that it does have a certain allure. @sortcol does not have to be a single column, but the application could pass a value like 'CustomerID, OrderID DESC'
. However, this creates a dependency between client and database as I discussed for alternate tables above. The DBA could get the idea refactor the database, so that columns are now called order_id, customer_id etc. This could be done without affecting the client, as SELECT lists returning these columns could be changed to read:
SELECT O.order_id AS OrderID, ...
But if the client passes actual column names, the client will also be affected by the name change. Now, you may argue that such radical name changes never happen in real life, and you may be right. But it has certainly happened that I have changed an occasional column name or two because of misspelling or whatever reason.
However, the real issue here is SQL injection. Yeah, it could be that the users select the sort conditions from a dropdown and there is no way they can type something. But when you write your SQL code, you should never make any such assumptions. What is true for the app today, may not be true tomorrow. Someone slaps on a web UI and the parameter values come in through a URL. Maybe that web application connects as sa. (Bad! Bad! Bad!). And then some user out on the evil Internet passes '1 DROP DATABASE Important --
. When you write your stored procedures, you should always have SQL injection in mind and assume the worst.
There is a simple way to prevent SQL injection. You can wrap @sortcol in quotename():
@sql += ' ORDER BY ' + quotename(@sortcol)
quotename() adds brackets around the value, doubling any right brackets there may be in it, so that you still have a legal identifier. However, this means that @sortcol can only be a single column name, and there can be no ASC/DESC. Well, you could parse the string to get the components, but then it is no longer as simple as it looked initially.
So in the end, I think it is much better to use CASE to map input values to sort columns:
SELECT @sql += ' ORDER BY ' + CASE @sortcol WHEN 'OrderID' THEN 'o.OrderID' WHEN 'EmployeeID' THEN 'o.EmployeeID' WHEN 'ProductID' THEN 'od.ProductID' WHEN 'CustomerName' THEN 'c.CompanyName' WHEN 'ProductName' THEN 'p.ProductName' WHEN 'OrderDate' THEN 'o.OrderDate' ELSE 'o.OrderID' END + CASE @isdesc WHEN 0 THEN ' ASC' ELSE ' DESC' END
This is reminiscent of how we did it for static SQL, but it is a lot simpler, since we don’t need to have different CASE expressions for different data types. And we can add a parameter for ASC/DESC without having to double everything. Note here the ELSE branch of the CASE expression. This is needed in case the client passes NULL or something unexpected in @sortcol. Without the ELSE, the CASE expression would return NULL, and @sql would be set to NULL, and nothing would happen when the procedure runs. The @debug output would be blank, so it would be difficult to understand what is going on.
If you want multiple sort columns, you could repeat the above, but a better option is introduce a mapping table:
DECLARE @namecolmap TABLE (sortname nvarchar(30) NOT NULL PRIMARY KEY, colexpr nvarchar(100) NOT NULL) INSERT @namecolmap (sortname, colexpr) VALUES('OrderID', 'o.OrderID'), ('EmployeeID', 'o.EmployeeID'), ('ProductID', 'od.ProductID'), ('CustomerName', 'c.CustomerName'), ('ProductName', 'p.ProductName'), ('OrderDate', 'o.OrderDate') IF @sortcol1 IS NOT NULL BEGIN SELECT @sql += ' ORDER BY ' + isnull((SELECT colexpr FROM @namecolmap WHERE sortname = @sortcol1), '[Unknown sort option]') + CASE @isdesc1 WHEN 0 THEN ' ASC' ELSE ' DESC' END END IF @sortcol2 IS NOT NULL BEGIN SELECT @sql += ', ' + isnull((SELECT colexpr FROM @namecolmap WHERE sortname = @sortcol2), '[Unknown sort option]') + CASE @isdesc2 WHEN 0 THEN ' ASC' ELSE ' DESC' END END
This is the code you find in search_orders_21 where the defaults for @sortcol1 and @sortcol2 are NULL, so by default nothing is sorted. There is more than one way to handle the situation where the lookup in the mapping table finds no row. Here I have opted to use something that produces a compilation error, but you may prefer something else. Whatever, just make sure that you don’t produce a NULL @sql string!
OPTION (RECOMPILE) and Dynamic SQL
We will now start to look at situations where search_orders_20 performs less well than search_orders_1 and how we can address these issues.
You may remember this pair from the chapter on static SQL with OPTION (RECOMPILE):
EXEC search_orders_1 @custid = 'ERNTC', @fromdate = '19980218', @todate = '19980219' EXEC search_orders_1 @custid = 'BOLSR', @fromdate = '19980101', @todate = '19981231'
While these two calls have the same parameter profile, there are still differences. The customer ERNTC is by far the most active customer in Northgale, while BOLSR only have a single order. At the same time, the date intervals are quite different in length. You may also recall that we got different execution plans for these two calls with search_orders_1.
This does not happen if you instead perform the searches with search_orders_20. The plan for ERNTC will be placed into the cache, and will also be used for the search on BOLSR, but if you check the actual execution plan for BOLSR, you will find a Hash Match operator that spills to disk, which is bad.
We will look at several solutions to this dilemma, and we start with the most simple-minded of them all: OPTION (RECOMPILE). I have said that one of the advantages of using dynamic SQL over static SQL with OPTION (RECOMPILE) is that we can get the benefit of a cached plan and reduce the cost of compilation. But that does not mean we cannot use OPTION (RECOMPILE) at all in combination with dynamic SQL.
If you are using dynamic SQL for the sole reason that you think that a solution with static SQL is too messy, but you are not expecting any particularly high call frequency, there is all reason change line 80 in search_orders_20 to read:
SELECT @sql += ' ORDER BY o.OrderID OPTION (RECOMPILE)' + @nl
Even if you expect high call frequency so that you deem that you cannot use OPTION (RECOMPILE) on a general basis, you may still deem that the frequency of the particular combination above will be such that it pays off to use OPTION (RECOMPILE):
IF @custid IS NOT NULL AND (@fromdate IS NOT NULL OR @todate IS NOT NULL) SELECT @sql += ' OPTION (RECOMPILE)' + @nl
Or, going back to the discussion we had for static SQL, what if the vast majority of the calls are by @orderid, and the other combinations are used a lot less? In such case, this may be a good bet:
SELECT @sql += ' ORDER BY o.OrderID' + @nl IF @orderid IS NULL SELECT @sql += ' OPTION (RECOMPILE)' + @nl
And there is not really an end to it. Thanks to the flexibility of dynamic SQL, the possibilities are about endless.
In the following sections, we assume that OPTION (RECOMPILE) for one reason or another is not an option for us, when we encounter situations where dynamic SQL with cached plans gives us problems we don’t get with OPTION (RECOMPILE) and how we can address these issues.
Columns with Filtered Indexes
Here are two calls to find new orders, one with static SQL with OPTION (RECOMPILE) and one with dynamic SQL. I’ve bracketed these in SET STATISTICS IO, so that we can compare the two approaches.
SET STATISTICS IO ON EXEC search_orders_1 @status = 'N' EXEC search_orders_20 @status = 'N' SET STATISTICS IO OFF
The output from STATISTICS IO indicates that there are some differences:
Table ‘Order Details’. Scan count 6, logical reads 18
Table ‘Customers’. Scan count 0, logical reads 12
Table ‘Orders’. Scan count 1, logical reads 20
Table ‘Products’. Scan count 1, logical reads 80
Table ‘Worktable’. Scan count 0, logical reads 0
Table ‘Orders’. Scan count 9, logical reads 25170
Table ‘Products’. Scan count 0, logical reads 380
Table ‘Order Details’. Scan count 6, logical reads 18
Table ‘Customers’. Scan count 0, logical reads 12
If you look into the query plans, you find that with static SQL and OPTION (RECOMPILE) in search_orders_1, there is an Index Seek on the index on Orders.Status. But with dynamic SQL in search_orders_20 that index is not used, but the Orders table is accessed through a Clustered Index Scan.
This is not just bad luck, but there is good reason for this. Recall that the index on Status is defined like this:
CREATE INDEX status_ix ON Orders(Status) WHERE Status <> 'C'
Why does SQL Server use the index in one case but not the other? When OPTION (RECOMPILE) is in effect, @status can be handled as a constant, and the index can be used as the value 'N'
is covered by the index. But with dynamic SQL, there is a plan that is entered into the cache for possible reuse, so SQL Server must produce a plan which gives correct result also if @status is 'C'
.
One way to handle the situation is to add the index filter to the query when the search parameter has a value included by the index filter. To do this, change these lines:
IF @status IS NOT NULL SELECT @sql += ' AND o.Status = @status' + @nl
to read:
IF @status IS NOT NULL SELECT @sql += ' AND o.Status = @status' + CASE WHEN @status <> 'C' THEN ' AND o.Status <> ''C''' ELSE '' END + @nl
You find these lines in search_orders_22 and if you try this call, you will find that there is an Index Seek:
EXEC search_orders_22 @status = 'N', @debug = 1
Search on a Single Date
To study this case, we need an extra index:
CREATE INDEX ExtraIX ON Orders(OrderDate, EmployeeID)
This index is not included in the script that builds Northgale, so you need to create it. If you are on SQL 2019 or later, lower the compatibility level to 140 (I will explain why later):
ALTER DATABASE Northgale SET COMPATIBILITY_LEVEL = 140
Then run these two calls to search_orders_1 and search_orders_20, which both search for orders handled by certain employees on one specific day:
SET STATISTICS IO ON DECLARE @employees intlist_tbltype INSERT @employees VALUES(331), (339), (538) EXEC search_orders_1 @fromdate = '19980217', @todate = '19980217', @employeetbl = @employees EXEC search_orders_20 @fromdate = '19980217', @todate = '19980217', @employeetbl = @employees SET STATISTICS IO OFF
I expect you to get an output like this:
Table ‘Products’. Scan count 0, logical reads 180
Table ‘Order Details’. Scan count 33, logical reads 99
Table ‘Customers’. Scan count 0, logical reads 66
Table ‘Worktable’. Scan count 0, logical reads 0
Table ‘Orders’. Scan count 3, logical reads 108
Table ‘#A944F117’. Scan count 1, logical reads 2
Table ‘Products’. Scan count 0, logical reads 180
Table ‘Order Details’. Scan count 33, logical reads 99
Table ‘Customers’. Scan count 0, logical reads 66
Table ‘Worktable’. Scan count 0, logical reads 0
Table ‘Orders’. Scan count 3, logical reads 30679
Table ‘#A944F117’. Scan count 1, logical reads 2
As you can see, with search_orders_20, we read a lot more from the Orders table, and if you look at the execution plans, you find that search_orders_1 uses ExtraIX, but this is not the case for search_orders_20. Why is this?
In search_orders_1 with OPTION (RECOMPILE), SQL Server is able to shortcut the interval condition on @fromdate and @todate
OrderDate >= @fromdate AND OrderDate <= @todate
to be
OrderDate = '19980217'
This makes it very easy to utilise ExtraIX. SQL Server knows that all matching rows are located in one contiguous area in the index.
But this shortcut is not possible with search_orders_20, since the optimizer is building a plan that will be stored in cache, and which must work with any length of the date interval. With an interval, the matching rows are scattered into several places in the index in the general case, and SQL Server would have to scan the index between @fromdate and @todate, which is not so good when the range is long.
This is another example where OPTION (RECOMPILE) gives us something “for free” but where we have to work little harder when we use dynamic SQL. But only a little harder. In search_orders_22, I have this code:
IF @fromdate = @todate SELECT @sql += ' AND o.OrderDate = @fromdate' + @nl ELSE BEGIN IF @fromdate IS NOT NULL SELECT @sql += ' AND o.OrderDate >= @fromdate' + @nl IF @todate IS NOT NULL SELECT @sql += ' AND o.OrderDate <= @todate' + @nl END
If you run the example above with search_orders_22, you will see that the execution plan is the same as with search_orders_1 and so is the output from SET STATISTICS IO.
I think there is all reason to handle date interval this way as a matter of routine unless you believe that searches for a single day would be a rare thing. (Personally, I think that for many searches where you have a from-date and a to-date, the most common search is for a single day.) For other intervals, this may not be equally compelling, although it entirely depends on whether searches for a single value will be common. (And whether there is any index where it actually can matter. For the pair @minprice/@maxprice there is little point since there is no index for on the column UnitPrice.)
The reason I asked you to lower the compatibility level to 140 for SQL 2019 is that on compatibility level 150, the optimizer comes up with a quite clever plan. It reads all rows for 1998-02-17 from ExtraIX and then filters the employees against a bitmap which is part of a Hash Match. While the plan is likely to still be less efficient than the plan for search_orders_1, the difference is nowhere as striking as on older versions.
Before you move on, drop the extra index and restore the compatibility level if you are on SQL 2019:
DROP INDEX ExtraIX ON Orders ALTER DATABASE Northgale SET COMPATIBILITY_LEVEL = 150
Dealing with Parameter Sniffing
In this section, we will look how we can deal with parameter sniffing when we cannot take the cheap way out with OPTION (RECOMPILE). We have looked at this search a few times:
EXEC search_orders_1 @custid = 'ERNTC', @fromdate = '19980218', @todate = '19980219' EXEC search_orders_1 @custid = 'BOLSR', @fromdate = '19980101', @todate = '19981231'
As noted previously, we get different plans for the two searches with search_orders_1, but we get the same cached plan for both with search_orders_20, and this plan is not really good for the search on BOLSR, where it leads to hash spill. What measures can we take to prevent this?
Changing the Query Text
As I discussed previously, SQL Server looks up queries in the cache by hashing the query text as-is without any normalisation. This means that two queries with different query text are different entries in the cache, even if they are logically equivalent and all that differs is a space or a comment. So an idea here is to look at the length of date interval and add a comment to produce different query strings and different cache entries. And, hopefully, different query plans when needed. This is taken from search_orders_23.
IF @fromdate = @todate SELECT @sql += ' AND o.OrderDate = @fromdate' + @nl ELSE BEGIN IF @fromdate IS NOT NULL AND @todate IS NOT NULL BEGIN DECLARE @days int = datediff(DAY, @fromdate, @todate) + 1 SELECT @sql += ' -- ' + CASE WHEN @days < 7 THEN convert(varchar, @days) + ' days.' WHEN @days < 35 THEN convert(varchar, @days / 7) + ' weeks.' ELSE convert(varchar, @days / 30) + ' months.' END + @nl END IF @fromdate IS NOT NULL SELECT @sql += ' AND o.OrderDate >= @fromdate' + @nl IF @todate IS NOT NULL SELECT @sql += ' AND o.OrderDate <= @todate' + @nl END
First there is the special-casing for @fromdate and @todate being the same that I introduced in the previous section. If both @fromdate and @todate are given, but are different, I compute the length of the interval, and then I add a comment about the length to the query text. As you can see, as long as the period is less than a week, I add the length in days. Up to four weeks, I add the length in weeks and beyond that in months. Thus, intervals of 2, 3, 4, 5, 6 and 7 days will all get different cache entries, but 7, 8, 9 up to 13 days will share the same cache entry, and so on. A simpler version would be to simply add the interval length in days, but that could lead to a quite a few cache entries with the same plan, and if date intervals are frequently combined with other search conditions, this could lead some undesirable cache littering. The approach I’m suggesting is still likely to produce multiple cache entries with the same plan, but hopefully in a more reasonable manner.
But as they say, your mileage may vary, and you may have to fine-tune this for your search problem.
Before we run search_orders_23, we will look at a second approach.
Inlining Some Values
The customer ERNTC accounts for 10 % of the orders in Northgale. This makes it perceivable that plans that are good for smaller customers are not good for this customer and vice versa. Also, when someone from ERNTC calls and enquires about their orders, you want the best query plan and response you can get. Therefore, it makes sense to say that if the customer is ERNTC you simply put that into the query string to give the optimizer as much as information as possible. This is also in search_orders_23:
IF @custid IS NOT NULL SELECT @sql += ' AND o.CustomerID = ' + CASE @custid WHEN 'ERNTC' THEN '''ERNTC''' ELSE '@custid' END + @nl
You can try these cases. Look both at the debug output and the execution plans:
EXEC search_orders_23 @custid = 'ERNTC', @fromdate = '19980218', @todate = '19980219', @debug = 1 EXEC search_orders_23 @custid = 'QUISH', @fromdate = '19980218', @todate = '19980219', @debug = 1 EXEC search_orders_23 @custid = 'BOLSR', @fromdate = '19980101', @todate = '19981231', @debug = 1
QUISH is a decently active customer with 1 257 orders. When you look at the execution plans, the plan for ERNTC is the same as before. The plan for QUISH is similar to the plan for ERNTC in that uses both the index on CustomerID and the index on OrderDate and joins them with a Hash Match, but the plan for QUISH has the indexes in reverse order. For BOLSR we now get a more sensible plan that only uses the index on CustomerID, the same plan as we got with search_orders_1.
When it comes to inlining values like this, this is something you may consider when you have big skews. 60 % of your customers are in Berlin (because that is where you are located)? All reason to hard-code Berlin in the query string when Berlin is selected. You have a high-selling product that accounts for 25 % of your sales? By all means. But keep in mind that since each new query string is a new cache entry, inlining values too frivolously will lead to cache pollution which is not good for the overall system performance. You need to know your data.
Here we only looked at inlining constant values, known in advance. In the next section we will discuss inlining of variables values, a more advanced scenario.
Index Hints and Other Hints
Sometimes index hints or other query hints than OPTION (RECOMPILE) can be useful. Returning to our example with ERNTC and BOLSR, we can make the observation that the plan we get for ERNTC, where the seeks on the indexes on CustomerID and OrderDate joined in a Hash Match is not really good for the search on BOLSR because it leads to hash spills. But the opposite does not really apply. The plan on BOLSR which uses only the index on CustomerID works decently for ERNTC as well. Thus, we could opt to force that index:
FROM dbo.Orders o ' + CASE WHEN @custid IS NOT NULL AND (@fromdate IS NOT NULL OR @todate IS NOT NULL) THEN 'WITH (INDEX = custid_ix) ' ELSE '' END
I cannot say that this is my favourite strategy. Overall, index hints is something you should use sparingly. Casually used, they can cause performance problems because you force the optimizer to use the completely wrong index.
A hint that makes a little more sense is the hint OPTIMIZE FOR. If you want plans to always be compiled with the big customers in mind, you could add
IF @custid IS NOT NULL SELECT @sql += ' OPTION (OPTIMIZE FOR (@custid = ''ERNTC''))'
Or if you want to discourage the optimizer from producing different plans depending on what it sniffs for the date parameters, you could add:
IF @fromdate IS NOT NULL AND @todate IS NOT NULL SELECT @sql += ' OPTION (OPTIMIZE FOR (@fromdate UNKNOWN, @todate UNKNOWN))'
The optimizer will now apply its standard assumption for a closed interval, which is 10 % or so, which typically is too much to make the index interesting. It goes without saying that for this to make sense, the searches on date intervals should normally be broad. If the typical search is for a single day, OPTIMIZE FOR UNKNOWN is not that bright – unless you special-case this like I did in search_orders_22.
Adding More Indexes
Before I close this section, I like to add the remark that one reason why you may encounter a situation where you want different plans for different parameters is simply that your indexing is imperfect. If we take the example with the query on customer ID and a date range, the best solution to that particular problem may be to add a composite index on (CustomerID, OrderDate).
Inlining Variable Values
In the previous section we looked at inlining constant values. In this section we will look at how to inline values we get from variables. The situations where you have reason to consider doing so are not that common. In the search_orders example, there are two things that come to mind:
- The @status parameter which only has four possible values. There is a big skew with
'C'
taking up over 99 % of the rows. The distribution of the other three is about the same, but it is definitely reasonable to say that a search for new orders is a different search than a search for orders with errors. Thus, it can make sense to inline @status into the query string. This also addresses the problem with the filtered index. - Rather than having a single big customer like ERNTC, you have a handful, and maybe they change over the time. So you maintain a table with their customer IDs, and if you get a match, you inline the customer ID into the query. I would say that this certainly count as “advanced”.
There is a second purpose with this section. It is far too common to see developers who inline everything and who do not use parameterised queries. So I like to show how much more difficult this is to get right and demonstrate the risk for SQL injection. We will work with the procedure search_orders_24 where all parameters (except for @employeetbl) are inlined. This is not a very good procedure, since about every new search will generate a different query string, a different hash value and a different cache entry. And possibly a unique query plan, but more likely the same plan will be generated multiple times. Note that this is different from when we use static SQL with OPTION (RECOMPILE). Take these three calls:
EXEC search_orders_1 @orderid = 11000 EXEC search_orders_1 @orderid = 11001 EXEC search_orders_1 @orderid = 11002
You get three compilations, and one single cache entry (although it is never really used). Change the sequence to call search_orders_20 instead. Now there is one compilation and there is one cache entry which is reused. If you use search_orders_24, there are again three compilations, but there are now three cache entries. The worst of both worlds so to speak.
Note: there are two settings in SQL Server to alleviate the effects of inlining galore.
- There is a database setting FORCED PARAMETERIZATION. When this is in effect, SQL Server will parameterise all literals in a query. This is not a setting that I recommend, but it may be needed when developers have refused to adhere to best practice and inlined everything.
- There is a server configuration option optimize for ad hoc workloads (which from SQL 2019 is also a database-scoped configuration option). With this setting, the plan for an unparameterised query is initially not placed in cache, but the cache entry holds only the query text. If that query text reappears a second time, the plan is now cached. It is generally considered best practice to have this setting enabled, but note that it does not remove the cache littering entirely, but it only reduces the memory footprint of the litter.
Inlining Numeric Values
Inlining integer and decimal values is straightforward:
IF @orderid IS NOT NULL SELECT @sql += ' AND o.OrderID = ' + convert(varchar(10), @orderid) + @nl + ' AND od.OrderID = ' + convert(varchar(10), @orderid) + @nl
The only thing you need to consider is the length to convert to. (And if you are on SQL 2012 or later, you can avoid this issue by using the concat function. I’m not using it here to permit the code to run on SQL 2008.)
I would think that numeric types are the most likely to be inlined, for instance customer and product IDs. (Yes, in Northgale the customer IDs are nchar(5), a legacy from Northwind, but that is certainly quite odd.)
Inlining String Values
Inlining string values is something you should avoid, for the simple reason there is more than one way where you can go wrong, as this subsection will you show you.
For most parameters in search_orders_24 I have adhered to best practices for inlining, if there is anything that can be called “best practice” for something which itself generally is a bad practice. I have also added two parameters to show the problem with worst practices, and one of them is @city_bad (we will come to the other in the next section). A naïve and inexperienced user who have not learnt to use parameterised statements, may implement the City parameter like this:
SELECT @sql += @sql AND c.City = ''' + @city_bad + ''''
This has a number of problems. One is the missing N for nvarchar. Try this:
EXEC search_orders_24 @city_bad = N'Łódź' EXEC search_orders_24 @city_bad = N'Luleå'
While there are orders from customers in both Łódź (in Poland) and Luleå (in Sweden), you are likely to find that at least one of the calls returns an empty result set, exactly which depends on your collation. (If you and have a UTF-8 collation, introduced in SQL 2019, both calls will return data, though.) String literals preceded by N are nvarchar, and those without are varchar and can only hold characters for the code page of the collation and characters not present in the code page are replaced by fallback characters. You can see this by running this SELECT:
SELECT 'Łódź', 'Luleå'
Again, exactly what you see depends on your collation, but you may see Lódz, Lodz, Lulea, or even have question marks in some places. Put N before the literals, and both names will come back correctly.
That is, however, the small problem. Let’s say that we want to find orders for customers who are located in the same city as the customer TCHAD:
DECLARE @chad nvarchar(25) SELECT @chad = City FROM Customers WHERE CustomerID = 'TCHAD' EXEC search_orders_24 @city_bad = @chad, @debug = 1
This fails, and here is the debug output and the error messages:
SELECT o.OrderID, o.OrderDate, o.EmployeeID, o.Status,
c.CustomerID, c.CustomerName, c.Address, c.City, c.Region,
c.PostalCode, c.Country, c.Phone, od.UnitPrice, od.Quantity,
p.ProductID, p.ProductName, p.UnitsInStock, p.UnitsOnOrder
FROM dbo.Orders o
JOIN dbo.[Order Details] od ON o.OrderID = od.OrderID
JOIN dbo.Customers c ON o.CustomerID = c.CustomerID
JOIN dbo.Products p ON p.ProductID = od.ProductID
WHERE 1 = 1
AND c.City = ‘N’Djamena’ ORDER BY o.OrderID
Msg 102, Level 15, State 1, Line 293
Incorrect syntax near ‘Djamena’.
Msg 105, Level 15, State 1, Line 293
Unclosed quotation mark after the character string ‘
ORDER BY o.OrderID
‘.
That customer is indeed located in Chad, more precisely in the capital, N’Djamena, and the syntax error is caused by the single quote in the name. Not really a good user experience, is it?
There is also the issue with SQL injection. Although the parameter only permits for 25 characters, some damage could still be done. You can try the effect of:
EXEC search_orders_24 @city_bad = '''DROP TABLE Important --', @debug = 1
I recommend that you study the debug output to see how the injection is possible.
Let us now look at the proper way to inline a string parameter. We have already mentioned the function quotename() a few times, but it is time to give it a closer look. This built-in function delimits a string with the delimiter(s) you specify, and, this is the important part: if the string includes the closing delimiter, this character is doubled. The default delimiter is []
, and the main purpose of quotename() is to quote identifiers when you generate queries from metadata, but you can use other delimiters as seen in these examples:
SELECT quotename('Order Details') -- => [Order Details] SELECT quotename('Two o' + char(39) + 'clock', '''') -- => 'Two o''clock' SELECT quotename('abc{}def', '{') -- => {abc{}}def}
This is how the proper @city parameter in search_orders_24 is implemented:
IF @city IS NOT NULL SELECT @sql += ' AND c.City = N' + quotename(@city, '''') + @nl
There are three changes to @city_bad:
- There is an N to indicate that the literal is nvarchar, so that no characters can be lost.
- I wrap the variable in quotename() with single quote as the delimiter.
- There is an @nl at the end of the line to make the output easier to read.
You can try these calls:
DECLARE @chad nvarchar(25) SELECT @chad = City FROM Customers WHERE CustomerID = 'TCHAD' EXEC search_orders_24 @city = @chad, @debug = 1 EXEC search_orders_24 @city = N'Łódź', @debug = 1 EXEC search_orders_24 @city = N'Luleå', @debug = 1 EXEC search_orders_24 @city = '''DROP TABLE Important --', @debug = 1
The first three all return data, while the last one does not for obvious reasons. And none of them fail. Or drop any tables.
When you inline string parameters in dynamic SQL, you should wrap them in quotename() as a matter of routine. Well, almost. quotename() returns NULL if the input exceeds 128 characters. I find it difficult to see a situation where it makes sense to inline such a long value in a dynamic search query, though. But would you have need for this elsewhere, you can use my function quotestring() from my Curse and Blessings article.
I mentioned that one parameter in search_orders is reasonable to inline is the @status parameter. Here is how it looks like in search_orders_24:
IF @status IS NOT NULL SELECT @sql += ' AND o.Status = ' + quotename(@status, '''') + @nl
While @status is only char(1), making SQL injection kind of impossible, there is still no reason not to use quotename(). What if the application unexpectedly sends a single quote? Should that lead to a syntax error?
Apart from that, do you see a difference compared to the example with the @city parameter? Yes, the N for nvarchar is not there. And, no, this not a whim, it is very important. If you have an nvarchar column and an nvarchar parameter, you need to have the N there, so that characters outside the set for varchar are not mutilated. But if you have a varchar column, you should never put an N before the literal, because this can have disastrous effects on performance. When two data types meet in SQL Server, SQL Server applies its rules for data-type precedence, and converts the value of lower precedence to the other type. char and varchar have lower precedence than nchar and nvarchar, so would you incorrectly say:
IF @status IS NOT NULL SELECT @sql += ' AND o.Status = N' + quotename(@status, '''') + @nl
the Status column would be converted to nvarchar. If you have a Windows collation, this is only a speed bump, as the index on Status can still be utilised, if in a little less efficient way. But if you have an SQL collation, the index cannot be used at all, which means that you could be in for a full table scan.
Inlining Date/time Values
It is difficult to see a case where it would make sense to inline a date or time value into a query string, so I will be brief. Here is how @todate is handled in search_orders_24:
IF @todate IS NOT NULL SELECT @sql += ' AND o.OrderDate <= ' + quotename(convert(char(8), @todate, 112), '''') + @nl
This is the pattern you should follow if you ever get the idea to inline a date value into a query string. Convert the date value to char(8), using exactly format code 112 (which gives you YYYYMMDD), and then wrap that in quotename(). To inline a value with both date and time, use code 126 instead and a longer data type. If you want to know why and more details, I refer you my fellow MVP Tibor Karaszi’s article The ultimate guide to the datetime datatypes.
Multi-Valued Parameters with Dynamic SQL
If you know that OPTION (RECOMPILE) will always be present in the SQL string when your multi-valued parameter is specified, it can certainly make sense to inline the values to give the optimizer full information about them. And the converse applies: If OPTION (RECOMPILE) is not included in the string, I find it very difficult to see a situation where it would be good idea to inline the list of values. Keep in mind that with multi-line parameters there cannot really be any parameter-sniffing issues, since the optimizer does not know the values in the list. (It can sniff the number of elements in a TVP, and on SQL 2017 also for a comma-separated list under some circumstances as I discussed earlier, though.)
It seems simple, doesn’t it? This is code is in search_orders_24:
IF @employeebad IS NOT NULL SELECT @sql += ' AND o.EmployeeID IN (' + @employeebad + ')' + @nl
And sure, this gives the desired result:
EXEC search_orders_24 @employeebad = '402,109,207', @debug = 1
But as you may guess from the parameter name, there are things that are not so good. Look at this:
EXEC search_orders_24 @debug = 1, @employeebad = '0) CREATE DATABASE KilroyWasHere SELECT TOP(1) * FROM Orders o --'
At first, it may seem that we are passing a piece of garbage which should end with a conversion error or similar. But if you run it, you may notice a small delay whereupon you get back two result sets, of which the first is empty. And if you check the database dropdown in SSMS, you can see that you now have a database named KilroyWasHere.
So how did this SQL injection attack work? This was the SQL string that was generated:
SELECT o.OrderID, o.OrderDate, o.EmployeeID, o.Status, c.CustomerID, c.CustomerName, c.Address, c.City, c.Region, c.PostalCode, c.Country, c.Phone, od.UnitPrice, od.Quantity, p.ProductID, p.ProductName, p.UnitsInStock, p.UnitsOnOrder FROM dbo.Orders o JOIN dbo.[Order Details] od ON o.OrderID = od.OrderID JOIN dbo.Customers c ON o.CustomerID = c.CustomerID JOIN dbo.Products p ON p.ProductID = od.ProductID WHERE 1 = 1 AND o.EmployeeID IN (0) CREATE DATABASE KilroyWasHere SELECT TOP(1) * FROM Orders o --) ORDER BY o.OrderID
The first line in the input, 0)
, closes the intended query string. The next line is the injected command, and an attacker could put anything here, and the only limitation would be the permissions of the login running the procedure. (So if it is a web application logging in as sa, guess what.) The last line with SELECT TOP serves to neutralise the ORDER BY clause in the original query string. The comment characters that end of the injection string kill the right parenthesis that originally was intended to be the end of the IN expression.
And please don’t tell me But I pass my comma-separated list from SSRS, so there could be SQL injection. Never make such assumptions. Next year someone decides to call your procedure from a web app, passing the comma-separated list in a URL which an attacker could manipulate to his or her liking.
If you want to inline a comma-separated list, you need to first parse it to table format, and then form a comma-separated list from the table so that you know what you are putting into the SQL string. search_orders_24 has the parameter @employeestr which has this proper implementation:
IF @employeestr IS NOT NULL BEGIN CREATE TABLE #emps (empid int NOT NULL PRIMARY KEY) INSERT #emps(empid) SELECT n FROM dbo.intlist_to_tbl(@employeestr) IF @@rowcount < 64 BEGIN SELECT @employeestr = (SELECT convert(varchar(10), empid) + ',' FROM #emps FOR XML PATH('')) SELECT @employeestr = substring(@employeestr, 1, len(@employeestr) - 1) SELECT @sql += ' AND o.EmployeeID IN (' + @employeestr + ')' + @nl END ELSE SELECT @sql += ' AND o.EmployeeID IN (SELECT empid FROM #emps)' + @nl END
To build the comma-separated list I use FOR XML PATH which can be used to build concatenated strings from a result set. With this method, the list gets a trailing extraneous comma which I remove with the call to substring on the next line. The code above works on all versions from SQL 2008 and up. On SQL 2017 and later, the list can be built in a more natural way (with no need for the call to substring).
SELECT @employeestr = string_agg(convert(varchar(10), empid), ',') FROM #emps
You may note above that I only build a comma-separated list if there are less than 63 values. The reason for this limit is that if there are 64 or more values in an IN list, SQL Server will lump these values into an internal temp table, so then you may just as well do it yourself. And more importantly, if there are very many values in the list, thousands or more, there starts to be a performance penalty in parsing and compiling that many values. The temp table is a protection against such surprises.
If you try
EXEC search_orders_24 @debug = 1, @employeestr = '0) CREATE DATABASE KilroyWasHere SELECT TOP(1) * FROM Orders o --'
You will see that this fails with the expected conversion error, and thus we have prevented SQL injection.
If your pass your multi-valued parameter in a TVP you don’t need a temp table, as you can use TVP itself when are more than 63 rows. I’m not showing the code here, but you can see the pattern in search_orders_24.
If you don’t have OPTION (RECOMPILE) with your multi-valued parameter, you might well be content as long as you get correct cardinality estimates, since you are striving for a cached plan. However, as noted in the section Multi-Valued Parameters and Performance, the blind estimates for a comma-separated list are way too high in most cases. The best way to evade this issue, in my opinion, is to pass the data in a table-valued parameter. If this is not feasible, because it does not work well with your client code, try to use a string-splitter that permits SQL Server to apply interleaved execution. (See the section Multi-Valued Parameters and Performance for details.)
What about bouncing the data over a temp table, as I discussed for static SQL? This will give you statistics and better estimates – but you will also get frequent recompiles due to autostats. And the scenario here is that you are striving for a cached plan, isn’t it? Nevertheless, if you decide to use a temp table, you should create the temp table inside the dynamic SQL, at least if you are on SQL 2017 or earlier. You can see an example of how do this in search_orders_25.
It may seem simpler to create and populate the temp table in the main procedure, outside the dynamic SQL as we did in search_orders_24. However, if you do this on SQL 2017 or earlier, there will be compilation every time just as if you had specified OPTION (RECOMPILE). Because the temp table is created outside the scope where it is used, SQL Server thinks it is a new table, and the schema may not be the same as on last execution, so it recompiles the statement as a safety measure. Microsoft did actually address this in SQL 2019, which is able to determine that the schema has not changed, and therefore an existing plan can be used. (But autostats can still cause recompiles.)
Advanced Options for Dynamic Searches
In the beginning of the article, I mentioned some possible requirements for dynamic searches that I have not touched so far:
- Users should be able to choose the comparison operator, for instance
@country = 'Germany'
or@country != 'Germany'
. - Users should be able to add or remove columns from the output and for an aggregation query what to aggregate on.
- Anything else you can imagine – or you were not able to imagine, but the users wanted it anyway.
Can requirements like these be implemented in a stored procedure that builds dynamic SQL? Sure, it can. It is all a matter of building strings. The challenge is the parameter list. How would you express such complex conditions in a parameter list with scalar parameters? OK, you could pass the conditions in an XML document. Still… It does not really sound fun to me. No, if it has to be this dynamic, I think it is better to do this in client code where you can expose a nice pretty object-oriented interface. I think the only reason to do this from a stored procedure would be that it is absolutely out of the question to grant users direct SELECT permissions on the tables, leaving a stored procedure that is signed with a certificate as the only option.
Let me say this: there is absolutely nothing wrong with implementing dynamic searches by building queries in client code. The same rules apply as when you do it in T‑SQL: you must use parameterised statements and never inline parameter values. I am not going to take up space here with how to do that, but I refer you to the section Running Dynamic SQL From Client Code in my article Curse and Blessings article. After all, building dynamic SQL strings is a matter of string processing, and client-side languages are better on string processing than T‑SQL.
You may have settled on a policy of “only stored procedures”, and as long as your dynamic search problem falls within what I have discussed in this article up to this section, there is little reason not to stick to that policy. But would the requirements prove to be more complex as I outlined above, I think you should let pragmatism to win over principles (unless permissions are involved).
You may get the idea that you should do a hybrid and build some part of the SQL, for instance the WHERE clause, client-side and send that to SQL Server. DON’T DO THAT, full stop! This has all sorts of problems. That procedure would be wide open to SQL injection. Furthermore, you would now have a very tight coupling between client and server, which will give you a maintenance nightmare. Either do it all client-side, or do it in SQL Server.
Conclusion
You have now learnt that there are two ways to implement dynamic search conditions: static SQL and dynamic SQL. Solutions with static SQL almost always use OPTION (RECOMPILE), except in a very simple cases where the optimizer’s use of startup filters can be good enough. You have also seen that solutions with static SQL are easy to implement as long as the requirements are moderately complex. Once the requirements increase in complexity, the solutions with static SQL easily become unwieldy. A second disadvantage with solutions using OPTION (RECOMPILE) is that very frequent searches can incur an unwanted load on the system because of all the compilations.
If you count pages in this article, you will find that more pages were spent on dynamic SQL than on static SQL. The reason for this is not a token of that the author favours dynamic SQL, but it simply reflects fact that dynamic SQL is more difficult to work with and requires more understanding from you as a programmer. Dynamic SQL has the advantage that when the conditions to handle increase in diversity, the complexity of the code grows more linearly than with static SQL. Also, as we have seen, correctly used, dynamic SQL reduces the amount of resources needed for query compilation. A special issue with dynamic SQL that you must not forget is that you need to cater for permissions on the tables accessed by the dynamic SQL. Certificate signing is the best way to resolve this when direct SELECT permissions on the tables are not acceptable. You must always be on guard to make sure that you don’t open your code for SQL injection.
It is important to stress that you cannot only apply the methods in this article on auto-pilot. You need to make your own judgements. And moreover, you need to test your queries, both for logic and for performance. Since you may have a multitude of parameter combinations, it may not be feasible to test all combinations, but you should at least test all parameters in isolation, as well as combinations you expect to be common. With dynamic SQL, you should be careful to test all parameters in combination with some other parameter, because with dynamic SQL you can easily slip so that a certain combination results in a syntax error.
Finally, when testing for performance, you need a database of some size. The article was based on Northgale, and while larger than its source Northwind, it is still a very small database.
Original article on https://bit.ly/3uRdp12