U-SQL (Intro) - Techfura

U-SQL is a language that unifies the benefits of SQL with the expressive power of your own code to process all data at any scale. U-SQL’s scalable distributed query capability enables you to efficiently analyze data in the store and across relational stores such as Azure SQL Database. It enables you to process unstructured data by applying schema on read, insert custom logic and UDF’s, and includes extensibility to enable fine grained control over how to execute at scale.

There are some differences from ANSI SQL or T-SQL. For example, its keywords such as SELECT have to be in UPPERCASE.

It’s type system and expression language inside select clauses, where predicates etc are in C#. This means the data types are the C# types and the data types use C# NULL semantics, and the comparison operations inside a predicate follow C# syntax (e.g., a == “foo”). This also means, that the values are full .NET objects, allowing you to easily use any method to operate on the object (eg “f o o o”.Split(‘ ‘) ).

For more information, see U-SQL Reference.

You must complete Tutorial: develop U-SQL scripts using Data Lake Tools for Visual Studio.

In the tutorial, you ran a Data Lake Analytics job with the following U-SQL script:

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

OUTPUT @searchlog   
    TO "/output/SearchLog-first-u-sql.csv"
USING Outputters.Csv();

This script doesn’t have any transformation steps. It reads from the source file called SearchLog.tsv, schematizes it, and outputs the rowset back into a file called SearchLog-from-adltools.csv.

Notice the question mark next to the data type of the Duration field. That means the Duration field could be null.

Some concepts and keywords found in the script:

Rowset variables: Each query expression that produces a rowset can be assigned to a variable. U-SQL follows the T-SQL variable naming pattern, for example, @searchlog in the script. Note the assignment does not force execution. It merely names the expression and gives you the ability to build-up more complex expressions.
EXTRACT gives you the ability to define a schema on read. The schema is specified by a column name and C# type name pair per column. It uses a so-called Extractor, for example, Extractors.Tsv() to extract tsv files. You can develop custom extractors.
OUTPUT takes a rowset and serializes it. The Outputters.Csv() output a comma-separated file into the specified location. You can also develop custom Outputters.
Notice the two paths are relative paths. You can also use absolute paths. For example adl://<ADLStorageAccountName>.azuredatalakestore.net:443/Samples/Data/SearchLog.tsv You must use absolute path to access the files in the linked Storage accounts. The syntax for files stored in linked Azure Storage account is: wasb://<BlobContainerName>@<StorageAccountName>.blob.core.windows.net/Samples/Data/SearchLog.tsv [AZURE.NOTE] Azure Blob container with public blobs or public containers access permissions are not currently supported.

Use scalar variables

You can use scalar variables as well to make your script maintenance easier. The previous U-SQL script can also be written as the following:

DECLARE @in  string = "/Samples/Data/SearchLog.tsv";
DECLARE @out string = "/output/SearchLog-scalar-variables.csv";

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM @in
    USING Extractors.Tsv();

OUTPUT @searchlog   
    TO @out
    USING Outputters.Csv();

Transform rowsets

Use SELECT to transform rowsets:

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@rs1 =
    SELECT Start, Region, Duration
    FROM @searchlog
WHERE Region == "en-gb";

OUTPUT @rs1   
    TO "/output/SearchLog-transform-rowsets.csv"
    USING Outputters.Csv();

The WHERE clause uses C# boolean expression. You can use the C# expression language to do your own expressions and functions. You can even perform more complex filtering by combining them with logical conjunctions (ands) and disjunctions (ors).

The following script uses the DateTime.Parse() method and a conjunction.

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@rs1 =
    SELECT Start, Region, Duration
    FROM @searchlog
WHERE Region == "en-gb";

@rs1 =
    SELECT Start, Region, Duration
    FROM @rs1
    WHERE Start >= DateTime.Parse("2012/02/16") AND Start <= DateTime.Parse("2012/02/17");

OUTPUT @rs1   
    TO "/output/SearchLog-transform-datatime.csv"
    USING Outputters.Csv();

Notice that the second query is operating on the result of the first rowset and thus the result is a composition of the two filters. You can also reuse a variable name and the names are scoped lexically.

Aggregate rowsets

U-SQL provides you with the familiar ORDER BY, GROUP BY and aggregations.

The following query finds the total duration per region, and then outputs the top 5 durations in order.

U-SQL rowsets do not preserve their order for the next query. Thus, to order an output, you need to add ORDER BY to the OUTPUT statement as shown below:

DECLARE @outpref string = "/output/Searchlog-aggregation";
DECLARE @out1    string = @outpref+"_agg.csv";
DECLARE @out2    string = @outpref+"_top5agg.csv";

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@rs1 =
    SELECT
        Region,
        SUM(Duration) AS TotalDuration
    FROM @searchlog
GROUP BY Region;

@res =
SELECT *
FROM @rs1
ORDER BY TotalDuration DESC
FETCH 5 ROWS;

OUTPUT @rs1
    TO @out1
    ORDER BY TotalDuration DESC
    USING Outputters.Csv();
OUTPUT @res
    TO @out2 
    ORDER BY TotalDuration DESC
    USING Outputters.Csv();

U-SQL ORDER BY clause has to be combined with the FETCH clause in a SELECT expression.

U-SQL HAVING clause can be used to restrict the output to groups that satisfy the HAVING condition:

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@res =
    SELECT
        Region,
        SUM(Duration) AS TotalDuration
    FROM @searchlog
GROUP BY Region
HAVING SUM(Duration) > 200;

OUTPUT @res
    TO "/output/Searchlog-having.csv"
    ORDER BY TotalDuration DESC
    USING Outputters.Csv();

Join data

U-SQL provides common join operators such as INNER JOIN, LEFT/RIGHT/FULL OUTER JOIN, SEMI JOIN, to join not only tables but any rowsets (even those produced from files).

The following script joins the searchlog with an advertisement impression log and gives us the advertisements for the query string for a given date.

@adlog =
    EXTRACT UserId int,
            Ad string,
            Clicked int
    FROM "/Samples/Data/AdsLog.tsv"
    USING Extractors.Tsv();

@join =
    SELECT a.Ad, s.Query, s.Start AS Date
    FROM @adlog AS a JOIN <insert your DB name>.dbo.SearchLog1 AS s 
                    ON a.UserId == s.UserId
    WHERE a.Clicked == 1;

OUTPUT @join   
    TO "/output/Searchlog-join.csv"
    USING Outputters.Csv();

U-SQL only supports the ANSI compliant join syntax: Rowset1 JOIN Rowset2 ON predicate. The old syntax of FROM Rowset1, Rowset2 WHERE predicate is NOT supported. The predicate in a JOIN has to be an equality join and no expression. If you want to use an expression, add it to a previous rowset’s select clause. If you want to do a different comparison, you can move it into the WHERE clause.

Create databases, table-valued functions, views, and tables

U-SQL allows you to use data in the context of a database and schema. So you don’t have to always read from or write to files.

Every U-SQL script runs with a default database (master) and default schema (dbo) as its default context. You can create your own database and/or schema. To change the context, use the USE statement to change the context.

Create a table-valued function (TVF)

In the previous U-SQL script, you repeated using EXTRACT reading from the same source file. U-SQL table-valued function enables you to encapsulate the data for future reuse.

The following script creates a TVF called Searchlog() in the default database and schema:

DROP FUNCTION IF EXISTS Searchlog;
CREATE FUNCTION Searchlog() 
RETURNS @searchlog TABLE
(
            UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
)
AS BEGIN 
@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
RETURN;
END;

The following script shows you how to use the TVF defined in the previous script:

@res =
    SELECT
        Region,
        SUM(Duration) AS TotalDuration
    FROM Searchlog() AS S
GROUP BY Region
HAVING SUM(Duration) > 200;

OUTPUT @res
    TO "/output/SerachLog-use-tvf.csv"
    ORDER BY TotalDuration DESC
    USING Outputters.Csv();

Create views

If you only have one query expression that you want to abstract and do not want to parameterize it, you can create a view instead of a table-valued function.

The following script creates a view called SearchlogView in the default database and schema:

DROP VIEW IF EXISTS SearchlogView;

CREATE VIEW SearchlogView AS  
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();

The following script demonstates using the defined view:

@res =
    SELECT
        Region,
        SUM(Duration) AS TotalDuration
    FROM SearchlogView
GROUP BY Region
HAVING SUM(Duration) > 200;

OUTPUT @res
    TO "/output/Searchlog-use-view.csv"
    ORDER BY TotalDuration DESC
    USING Outputters.Csv();

Create tables

Similar to relational database table, U-SQL allows you to create a table with a predefined schema or create a table and infer the schema from the query that populates the table (also known as CREATE TABLE AS SELECT or CTAS).

The following script create a database and two tables:

DROP DATABASE IF EXISTS SearchLogDb;
CREATE DATABASE SeachLogDb
USE DATABASE SearchLogDb;

DROP TABLE IF EXISTS SearchLog1;
DROP TABLE IF EXISTS SearchLog2;

CREATE TABLE SearchLog1 (
            UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string,

            INDEX sl_idx CLUSTERED (UserId ASC) 
                PARTITIONED BY HASH (UserId)
);

INSERT INTO SearchLog1 SELECT * FROM master.dbo.Searchlog() AS s;

CREATE TABLE SearchLog2(
    INDEX sl_idx CLUSTERED (UserId ASC) 
            PARTITIONED BY HASH (UserId)
) AS SELECT * FROM master.dbo.Searchlog() AS S; // You can use EXTRACT or SELECT here

Query tables

You can query the tables (created in the previous script) in the same way as you query over the data files. Instead of creating a rowset using EXTRACT, you now can just refer to the table name.

The transform script you used previously is modified to read from the tables:

@rs1 =
    SELECT
        Region,
        SUM(Duration) AS TotalDuration
    FROM SearchLogDb.dbo.SearchLog2
GROUP BY Region;

@res =
    SELECT *
    FROM @rs1
    ORDER BY TotalDuration DESC
    FETCH 5 ROWS;

OUTPUT @res
    TO "/output/Searchlog-query-table.csv"
    ORDER BY TotalDuration DESC
    USING Outputters.Csv();

Note that you currently cannot run a SELECT on a table in the same script as the script where you create that table.

Conclusion

What is covered in the tutorial is only a small part of U-SQL. Because of the scope of this tutorial, it can’t cover everything, such as:

Use CROSS APPLY to unpack parts of strings, arrays and maps into rows.
Operate partitioned sets of data (file sets and partitioned tables).
Develop user defined operators such as extractors, outputters, processors, user-defined aggregators in C#.
Use U-SQL windowing functions.
Manage U-SQL code with views, table-valued functions and stored procedures.
Run arbitrary custom code on your processing nodes.
Connect to Azure SQL Databases and federate queries across them and your U-SQL and Azure Data Lake data.