In this chapter you’ll focus on the fundamental ways in which rowsets can be created and modified.
Creating a RowSet from another RowSet
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int,
Urls string,
ClickedUrls string
FROM "/SearchLog.tsv"
USING Extractors.Tsv();
@output =
SELECT *
FROM @searchlog;
OUTPUT @output
TO "/SearchLog_output.tsv"
USING Outputters.Tsv();
Effectively this script just copies the data without transforming it.
However, if you look at the output file you will notice some things about the default behavior of Outputters.Tsv
:
- values have been surrounded by double-quotes – which is the default behavior of
Outputters.Tsv
. You can disable the quoting by usingOutputters.Tsv( quoting:false )
. DateTime
values are output in a longer standard format- The default encoding is UTF-8 and the BOM (Byte Order Mark) is not written.
Creating constant RowSets
RowSets can be directly created in a U-SQL script.
@departments =
SELECT *
FROM (VALUES
(31, "Sales"),
(33, "Engineering"),
(34, "Clerical"),
(35, "Marketing")
) AS D( DepID, DepName );
DepID | DepName |
---|---|
31 | Sales |
33 | Engineering |
34 | Clerical |
35 | Marketing |
RowSet refinement
Rowsets can be defined from other rowsets. In fact rowsets can be defined from themselves. This is called RowSet refinement and can be a powerful way to make your code easier to read.
@output =
SELECT
*,
Urls.Length AS UrlsLength
FROM @searchlog;
@output =
SELECT
Url,
UrlLength
FROM @output;
Adding new columns with SELECT
We can create new Columns with the SELECT clause. Simply use a C# expression and give it a column name with AS.
IMPORTANT: For the sake of brevity we will start omitting the EXTRACT and OUTPUT statements in the code samples.
@output =
SELECT
*,
(Query + Query) AS Query2
FROM @searchlog;
In the example above, we added a column without having to explicitly list out all the other columns. Below you can see how we can add just specific columns.
@output =
SELECT
Region,
Query,
(Query + Query) AS Query2,
Urls.Length AS UrlsLength
FROM @searchlog;
Filtering records with WHERE
Now let’s transform the data by filtering out records with the WHERE clause
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int,
Urls string,
ClickedUrls string
FROM "/SearchLog.tsv"
USING Extractors.Tsv();
@output =
SELECT *
FROM @searchlog
WHERE Region == "en-gb";
OUTPUT @output
TO "/SearchLog_output.tsv"
USING Outputters.Tsv();
The SQL Logical Operators
The AND/OR/NOT operators can be combined with parentheses to create more complex logical expressions
@output =
SELECT Start, Region, Duration
FROM @searchlog
WHERE (Duration >= 2*60 AND Duration <= 5*60) OR NOT (Region == "en-gb");
The C# Logical Operators
U-SQL also supports the C# logical operators
@output =
SELECT Start, Region, Duration
FROM @searchlog
WHERE (Duration >= 2*60 && Duration <= 5*60) || (!(Region == "en-gb"));
SQL Logical Operators (AND OR) versus C# Operators ( && || )
These operators behave the same except for their short-circuiting behavior:
- SQL-style logical operators: These DO NOT short-circuit
- C#-style logical operators: These DO short-circuit
Use the SQL-style logical operators unless you MUST have the short-circuiting behavior. The reasons why this is important are covered in a later chapter that describes the order of evaluation of predicates in expressions.
Find all the sessions occurring before a date
@output =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Start <= DateTime.Parse("2012/02/17");
Find all the sessions occurring between two dates
@output =
SELECT Start, Region, Duration
FROM @searchlog
WHERE
Start >= DateTime.Parse("2012/02/16")
AND Start <= DateTime.Parse("2012/02/17");
Filtering on calculated columns
Consider this case where SELECT is used to define a new column called DurationInMinutes
@output =
SELECT
Start,
Region,
Duration/60.0 AS DurationInMinutes
FROM @searchlog;
There are a couple of approaches for filtering rows based on the DurationInMinutes
value
The first option is to use RowSet refinement
@output =
SELECT
Start,
Region,
Duration/60.0 AS DurationInMinutes
FROM @searchlog;
@output =
SELECT *
FROM @output
WHERE DurationInMinutes>= 20;
The second option is to repeat the expression in the WHERE clause
@output =
SELECT
Start,
Region,
Duration/60.0 AS DurationInMinutes
FROM @searchlog
WHERE Duration/60.0>= 20;
WHERE does not work on calculated columns in the same statement
WHERE
filters rows coming into to the statement. The DurationInMinutes column doesn’t exist in the input. Therefore WHERE cannot operate on it. So, the example below will not compile
// SYNTAX ERROR: WHERE cannot be used with columns created by SELECT
@output =
SELECT Start, Region,Duration/60.0 AS DurationInMinutes
FROM @searchlog
WHERE DurationInMinutes>= 20;
Numbering rows
Using the ROW_NUMBER
windowing function aggregate is how to assign row numbers. ROW_NUMBER
is part of Windowing Functions and that topic too complex for this tutorial. See the Windowing Functions documentation for details. However, for now we do want to show you the proper way to number rows in U-SQL using ROW_NUMBER
because it is a popular topic.
@rs1 =
SELECT
ROW_NUMBER() OVER ( ) AS RowNumber,
Start,
Region
FROM @searchlog;
RowSets must lead to an output
There’s no syntax error here. However, compilation will fail. The reason is that all rowsets must eventually contribute to data being written to a file or table.
The following script fail will to compile the @smallrows
does not eventually result in an output to a file or table
@rows =
EXTRACT
Name string,
Amount int,
FROM "/input.csv"
USING Extractors.Csv();
@smallrows =
SELECT Name, Amount
FROM @rows
WHERE Amount < 1000;
OUTPUT @rows
TO "/output/output.csv"
USING Outputters.Csv();
Escaping Column names
If you need column names that contain whitespace you can enclose the name in [
and ]
.
The following script shows a column called Order Number.
@b =
SELECT
[Order Number],
Part
FROM @a;