Part 2: Transforming RowSets

In this chapter you’ll focus on the fundamental ways in which rowsets can be created and modified.

Creating a RowSet from another RowSet

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM "/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT *
    FROM @searchlog;

OUTPUT @output 
    TO "/SearchLog_output.tsv"
    USING Outputters.Tsv();

Effectively this script just copies the data without transforming it.

However, if you look at the output file you will notice some things about the default behavior of Outputters.Tsv:

  • values have been surrounded by double-quotes – which is the default behavior of Outputters.Tsv. You can disable the quoting by using Outputters.Tsv( quoting:false ).
  • DateTime values are output in a longer standard format
  • The default encoding is UTF-8 and the BOM (Byte Order Mark) is not written.

Creating constant RowSets

RowSets can be directly created in a U-SQL script.

@departments =
  SELECT * 
  FROM (VALUES
      (31, "Sales"),
      (33, "Engineering"),
      (34, "Clerical"),
      (35, "Marketing")
    ) AS D( DepID, DepName );
DepIDDepName
31Sales
33Engineering
34Clerical
35Marketing

RowSet refinement

Rowsets can be defined from other rowsets. In fact rowsets can be defined from themselves. This is called RowSet refinement and can be a powerful way to make your code easier to read.

@output = 
    SELECT
        *,
        Urls.Length AS UrlsLength
    FROM @searchlog;

@output = 
    SELECT 
        Url,
        UrlLength
    FROM @output;

Adding new columns with SELECT

We can create new Columns with the SELECT clause. Simply use a C# expression and give it a column name with AS.

IMPORTANT: For the sake of brevity we will start omitting the EXTRACT and OUTPUT statements in the code samples.

@output = 
    SELECT 
        *, 
        (Query + Query) AS Query2
    FROM @searchlog;

In the example above, we added a column without having to explicitly list out all the other columns. Below you can see how we can add just specific columns.

@output = 
    SELECT 
        Region,
        Query, 
        (Query + Query) AS Query2, 
        Urls.Length AS UrlsLength
    FROM @searchlog;

Filtering records with WHERE

Now let’s transform the data by filtering out records with the WHERE clause

@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM "/SearchLog.tsv"
    USING Extractors.Tsv();

@output = 
    SELECT *
    FROM @searchlog
    WHERE Region == "en-gb";

OUTPUT @output 
    TO "/SearchLog_output.tsv"
    USING Outputters.Tsv();

The SQL Logical Operators

The AND/OR/NOT operators can be combined with parentheses to create more complex logical expressions

@output =
    SELECT Start, Region, Duration
    FROM @searchlog
    WHERE (Duration >= 2*60 AND Duration <= 5*60) OR NOT (Region == "en-gb");

The C# Logical Operators

U-SQL also supports the C# logical operators

@output =
 SELECT Start, Region, Duration
 FROM @searchlog
 WHERE (Duration >= 2*60 && Duration <= 5*60) || (!(Region == "en-gb"));

SQL Logical Operators (AND OR) versus C# Operators ( && || )

These operators behave the same except for their short-circuiting behavior:

  • SQL-style logical operators: These DO NOT short-circuit
  • C#-style logical operators: These DO short-circuit

Use the SQL-style logical operators unless you MUST have the short-circuiting behavior. The reasons why this is important are covered in a later chapter that describes the order of evaluation of predicates in expressions.

Find all the sessions occurring before a date

@output =
 SELECT Start, Region, Duration
 FROM @searchlog
 WHERE Start <= DateTime.Parse("2012/02/17");

Find all the sessions occurring between two dates

@output = 
  SELECT Start, Region, Duration
  FROM @searchlog
  WHERE 
   Start >= DateTime.Parse("2012/02/16") 
   AND Start <= DateTime.Parse("2012/02/17");

Filtering on calculated columns

Consider this case where SELECT is used to define a new column called DurationInMinutes

@output =
  SELECT 
      Start, 
      Region,
      Duration/60.0 AS DurationInMinutes
  FROM @searchlog;

There are a couple of approaches for filtering rows based on the DurationInMinutes value

The first option is to use RowSet refinement

@output =
  SELECT 
    Start,
    Region,
    Duration/60.0 AS DurationInMinutes
  FROM @searchlog;

@output =
  SELECT *
  FROM @output
  WHERE DurationInMinutes>= 20;

The second option is to repeat the expression in the WHERE clause

@output =
  SELECT 
    Start,
    Region,
    Duration/60.0 AS DurationInMinutes
  FROM @searchlog
    WHERE Duration/60.0>= 20;

WHERE does not work on calculated columns in the same statement

WHERE filters rows coming into to the statement. The DurationInMinutes column doesn’t exist in the input. Therefore WHERE cannot operate on it. So, the example below will not compile

// SYNTAX ERROR: WHERE cannot be used with columns created by SELECT
@output =
  SELECT Start, Region,Duration/60.0 AS DurationInMinutes
  FROM @searchlog
  WHERE DurationInMinutes>= 20;

Numbering rows

Using the ROW_NUMBER windowing function aggregate is how to assign row numbers. ROW_NUMBER is part of Windowing Functions and that topic too complex for this tutorial. See the Windowing Functions documentation for details. However, for now we do want to show you the proper way to number rows in U-SQL using ROW_NUMBER because it is a popular topic.

@rs1 =
 SELECT
   ROW_NUMBER() OVER ( ) AS RowNumber,
   Start,
   Region
 FROM @searchlog;

RowSets must lead to an output

There’s no syntax error here. However, compilation will fail. The reason is that all rowsets must eventually contribute to data being written to a file or table.

The following script fail will to compile the @smallrows does not eventually result in an output to a file or table

@rows =
 EXTRACT
   Name string,
   Amount int,
   FROM "/input.csv"
 USING Extractors.Csv();

@smallrows =
  SELECT Name, Amount
  FROM @rows
  WHERE Amount < 1000;

OUTPUT @rows
  TO "/output/output.csv"
  USING Outputters.Csv();

Escaping Column names

If you need column names that contain whitespace you can enclose the name in [ and ].

The following script shows a column called Order Number.

@b =
  SELECT 
    [Order Number], 
    Part
  FROM @a;