Part 4: U-SQL Expressions

Overview

Clauses such as SELECTWHERE, and HAVING (among others) allow you to enter U-SQL expressions.

An expression in a programming language is a combination of explicit values, constants, variables, operators, and functions that are interpreted according to the particular rules of precedence and of association for a particular programming language, which computes and then produces another value.

The simplest way of thinking of a U-SQL expression is that it is a merely C# expression with some U-SQL extensions such as the ANDORNOT operators.

Casting types

Expressions can also be converted to a different type

@output= 
    SELECT 
        Start, 
        Region, 
        ((double) Duration) AS DurationDouble
    FROM @searchlog;

Calling methods .NET Types

RowSet columns are strongly typed. U-SQL allows you to call methods defined on those types in the SELECT clause. Because U-SQL supports .NET types such as string and DateTime, you can use all the methods available on those types.

// Find what day of year each session took place

@output= 
    SELECT 
        Start, 
        Region, 
        Start.DayOfYear AS StartDayOfYear
    FROM @searchlog;

Order of Evaluation in Expressions

There’s a common pattern C# developers are used to, as shown below:

if ( (QueryString!=null) && (QueryString.StartsWith("bing") )
{
    // do something
}

This pattern depends on a C# behavior (common to many languages) called “short-circuiting.” Simply put, in the above example, when the code runs there’s no logical reason to check both conditions if the first one returns false. Short circuiting is useful because evaluating each condition may be expensive. Thus, it is a technique that compilers use to improve the performance of C# code.

When trying to do the same thing in U-SQL there are two paths you can pick. Both are valid expressions, but one will cause problems that may not be obvious at first.

The Right Choice: Use && to keep the desired short-circuiting behavior

@rs1 = 
    SELECT * 
    FROM @data
    WHERE ((Name!=null) && (Name.StartsWith("bing"));

The Wrong Choice: Use AND which does NOT match the short-circuiting behavior.

@rs1 = 
    SELECT * 
    FROM @data
    WHERE ((Name!=null) AND (Name.StartsWith("bing"));

The second translation that uses AND will sometimes fail saying that a NullReferenceException has occurred. (sometimes = it might work on your local box but might fail in the cluster)

The reason is simple and by-design: with AND/OR U-SQL will try to perform certain optimizations that result in better performance – for example it may evaluate the second part of the expression first because it assumes that there is no relationship between the two conditions.

This is a standard optimization technique and the same thing is done in many systems such as SQL. The gain this optimization provides in performance is well worth the occasional confusion it causes for new U-SQL users – so this behavior will never change.
Summary: if you need this short-circuiting behavior use && and ||.
As an alternative you can use the SQL-like ALL/ANY operators which are equivalent to && and ||.

You CANNOT circumvent the order of evaluation by using multiple U-SQL statements

Of course, then you’ll be tempted to write your script by splitting apart the expression as shown below.

@rs1 = 
    SELECT * 
    FROM @data
    WHERE Name != null;

@rs2 =
    SELECT *
    FROM @rs1
    WHERE Name.StartsWith("bing");

The assumption here is that the first statement executes, before the second. This assumption is wrong.

This won’t work either. U-SQL is declarative language not an imperative one. Just because @rs1 is defined earlier than @rs2 in the script above it does NOT imply that the WHERE condition in @rs1 is evaluated before the WHERE in @rs2. U-SQL reserves the right to combine multiple statements together and perform optimizations. You MUST use the && operator if you want to perform short-circuiting.

Useful expressions

If you are a C# developer, the expressions below will seem obvious. If you have a SQL background or come from another language, please read this section. It will save you a lot of time later.

if null, pick a default value, otherwise use the value

stringcol1 ?? "defval"

The ?? is the C# null coalescing operator.

if an expression is true, pick one value, otherwise pick a different one

<expr> ? "was_true" : "was_false"

if an expression is true, pick one value, otherwise pick a different one

<expr> ? "was_true" : "was_false"

Test if null

<expr> == null : "was_false"

String equality

Case-sensitive equality

"ABC" == "ABC"
stringcol1 == "ABC"

Case-insensitive equality

string.Equals(
    "Abc",
    "ABC",
System.Text.StringComparison.CurrentCultureIgnoreCase )

String starts/ends with text

stringcol1.StartsWith( "prefix" )
stringcol1.EndsWith( "prefix" )

String contains text

stringcol1.Contains("prefix" )

Substrings

Get every character after the second one

stringcol1.Substring(2)

Get a string of 5 characters starting after the second character in the original string

stringcol1.Substring(2, 5)

String length

stringcol.Length

Merging strings

stringcol1 + "_foo_" + stringcol2

Or

string.Format( "{0}_foo_{1}" , stringcol1, stringcol2 );

Remove whitespace

From the beginning and the end of the sting

stringcol1.Trim()

From the beginning

stringcol1.TrimStart()

From the end

stringcol1.TrimEnd()

Change case

stringcol1.ToLower()
stringcol1.ToUpper()

Tips for SQL developers

Lots of people come to U-SQL from SQL and ask how U-SQL accomplishes things they are familiar with in SQL. A great example in creating an uppercase string.

Given, what we’ ve covered so far A SQL developer will expect to write the following in U-SQL

@output = 
    SELECT 
        UPPER( Region ) AS NewRegion
    FROM @searchlog;

Unfortunately, said developer will be disappointed to find out that U-SQL has no UPPER() method. The C# developer, on the other hand, knows what to do: use the string type’s intrinsic ToUpper() method.

@output = 
    SELECT 
        Region.ToUpper() AS NewRegion
    FROM @searchlog;