Overview
Clauses such as SELECT
, WHERE
, and HAVING
(among others) allow you to enter U-SQL expressions.
An expression in a programming language is a combination of explicit values, constants, variables, operators, and functions that are interpreted according to the particular rules of precedence and of association for a particular programming language, which computes and then produces another value.
The simplest way of thinking of a U-SQL expression is that it is a merely C# expression with some U-SQL extensions such as the AND
, OR
, NOT
operators.
Casting types
Expressions can also be converted to a different type
@output=
SELECT
Start,
Region,
((double) Duration) AS DurationDouble
FROM @searchlog;
Calling methods .NET Types
RowSet columns are strongly typed. U-SQL allows you to call methods defined on those types in the SELECT clause. Because U-SQL supports .NET types such as string and DateTime, you can use all the methods available on those types.
// Find what day of year each session took place
@output=
SELECT
Start,
Region,
Start.DayOfYear AS StartDayOfYear
FROM @searchlog;
Order of Evaluation in Expressions
There’s a common pattern C# developers are used to, as shown below:
if ( (QueryString!=null) && (QueryString.StartsWith("bing") )
{
// do something
}
This pattern depends on a C# behavior (common to many languages) called “short-circuiting.” Simply put, in the above example, when the code runs there’s no logical reason to check both conditions if the first one returns false. Short circuiting is useful because evaluating each condition may be expensive. Thus, it is a technique that compilers use to improve the performance of C# code.
When trying to do the same thing in U-SQL there are two paths you can pick. Both are valid expressions, but one will cause problems that may not be obvious at first.
The Right Choice: Use && to keep the desired short-circuiting behavior
@rs1 =
SELECT *
FROM @data
WHERE ((Name!=null) && (Name.StartsWith("bing"));
The Wrong Choice: Use AND which does NOT match the short-circuiting behavior.
@rs1 =
SELECT *
FROM @data
WHERE ((Name!=null) AND (Name.StartsWith("bing"));
The second translation that uses AND will sometimes fail saying that a NullReferenceException has occurred. (sometimes = it might work on your local box but might fail in the cluster)
The reason is simple and by-design: with AND/OR U-SQL will try to perform certain optimizations that result in better performance – for example it may evaluate the second part of the expression first because it assumes that there is no relationship between the two conditions.
This is a standard optimization technique and the same thing is done in many systems such as SQL. The gain this optimization provides in performance is well worth the occasional confusion it causes for new U-SQL users – so this behavior will never change.
Summary: if you need this short-circuiting behavior use && and ||
.
As an alternative you can use the SQL-like ALL/ANY operators which are equivalent to && and ||.
You CANNOT circumvent the order of evaluation by using multiple U-SQL statements
Of course, then you’ll be tempted to write your script by splitting apart the expression as shown below.
@rs1 =
SELECT *
FROM @data
WHERE Name != null;
@rs2 =
SELECT *
FROM @rs1
WHERE Name.StartsWith("bing");
The assumption here is that the first statement executes, before the second. This assumption is wrong.
This won’t work either. U-SQL is declarative language not an imperative one. Just because @rs1
is defined earlier than @rs2
in the script above it does NOT imply that the WHERE condition in @rs1
is evaluated before the WHERE in @rs2
. U-SQL reserves the right to combine multiple statements together and perform optimizations. You MUST use the && operator if you want to perform short-circuiting.
Useful expressions
If you are a C# developer, the expressions below will seem obvious. If you have a SQL background or come from another language, please read this section. It will save you a lot of time later.
if null, pick a default value, otherwise use the value
stringcol1 ?? "defval"
The ??
is the C# null coalescing operator.
if an expression is true, pick one value, otherwise pick a different one
<expr> ? "was_true" : "was_false"
if an expression is true, pick one value, otherwise pick a different one
<expr> ? "was_true" : "was_false"
Test if null
<expr> == null : "was_false"
String equality
Case-sensitive equality
"ABC" == "ABC"
stringcol1 == "ABC"
Case-insensitive equality
string.Equals(
"Abc",
"ABC",
System.Text.StringComparison.CurrentCultureIgnoreCase )
String starts/ends with text
stringcol1.StartsWith( "prefix" )
stringcol1.EndsWith( "prefix" )
String contains text
stringcol1.Contains("prefix" )
Substrings
Get every character after the second one
stringcol1.Substring(2)
Get a string of 5 characters starting after the second character in the original string
stringcol1.Substring(2, 5)
String length
stringcol.Length
Merging strings
stringcol1 + "_foo_" + stringcol2
Or
string.Format( "{0}_foo_{1}" , stringcol1, stringcol2 );
Remove whitespace
From the beginning and the end of the sting
stringcol1.Trim()
From the beginning
stringcol1.TrimStart()
From the end
stringcol1.TrimEnd()
Change case
stringcol1.ToLower()
stringcol1.ToUpper()
Tips for SQL developers
Lots of people come to U-SQL from SQL and ask how U-SQL accomplishes things they are familiar with in SQL. A great example in creating an uppercase string.
Given, what we’ ve covered so far A SQL developer will expect to write the following in U-SQL
@output =
SELECT
UPPER( Region ) AS NewRegion
FROM @searchlog;
Unfortunately, said developer will be disappointed to find out that U-SQL has no UPPER()
method. The C# developer, on the other hand, knows what to do: use the string type’s intrinsic ToUpper()
method.
@output =
SELECT
Region.ToUpper() AS NewRegion
FROM @searchlog;