Part 6: The U-SQL Catalog and Assemblies

The U-SQL Catalog is the way U-SQL organizes data and code for re-use and sharing.

Catalog organization

ADLA Account
|
|--Catalog
   |
   |--Databases
      |
      |--Schemas
      |  |--Tables
      |  |--Views      
      |  |--Table-valued functions
      |  |--Procedures
      |  |--Assemblies
      |  |--Credentials
      |  |--External Data Sources
      |
      |--Assemblies
  • Every ADLA account has a single U-SQL catalog. The catalog cannot be deleted.
  • Each U-SQL catalog contains one or more U-SQL databases.
  • Every catalog has a master database that cannot be deleted.
  • Each U-SQL database can contain code and data.
  • Data is stored in the form of U-SQL tables.
  • U-SQL code is stored in a database in the form of views, table-valued functions, procedures.
  • .NET code (.NET assemblies) is stored in the database in the form of U-SQL assemblies.

Creating a database

To create a database is simple.

CREATE DATABASE MyDB;

This command will fail if the database already exists. Often, it will be the case that you want to create a database only if it does not exist. In this case the following command is used.

CREATE DATABASE IF NOT EXISTS MyDB;

Deleting a Database

DROP DATABASE MyDB;

DROP DATABASE IF EXISTS MyDB;

Reusing U-SQL code with a table-valued function (TVF)

Many of the scripts in the tutorial have required reading from the searchlog and the code to read from the searchlog is shown below

@searchlog =    
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM "/SearchLog.tsv"
    USING Extractors.Tsv();

Now instead of writing this code over and over in every script we will store the code as a TVF in a database. The name of this function is going to be MyDB.dbo.ExtractSearchLog. The dbo part of that name is the “schema”.

CREATE FUNCTION MyDB.dbo.ExtractSearchLog()
RETURNS @rows 
AS BEGIN
  @rows = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM "/SearchLog.tsv"
    USING Extractors.Tsv();
    RETURN;
END;

Notice that CREATE FUNCTION indicates it will return a RowSet call @rows. Then in the @rows RowSet is defined.

Now that the TVF is created, we can call it this way.

@searchlog = MyDB.dbo.ExtractSearchLog();

U-SQL Tables

U-SQL tables offer a way to store data in form that preserves it schema and organize data to high-performance queries which is very important as data sizes grow.

Creating a table from a RowSet

If you have a RowSet creating a table from it is very simple. The one thing you must remember is that a table must have an index defined. The table gets its schema from the schema of the rowset.

@customers  = 
  SELECT * FROM 
    (VALUES
       ("Contoso",   123 ),
       ("Woodgrove", 456 )
    ) AS D( Customer, Id  );

DROP TABLE IF EXISTS MyDB.dbo.Customers;

CREATE TABLE MyDB.dbo.Customers
( 
    INDEX idx  
    CLUSTERED(customer ASC)
    DISTRIBUTED BY HASH(customer) 
) AS SELECT * FROM @customers;

Creating an empty table and filling it latter

If you need don’t have the data available at the time of table creation. You can create an empty table as shown below. Notice that this time the schema has to be specified.

DROP TABLE IF EXISTS MyDB.dbo.Customers;

CREATE TABLE MyDB.dbo.Customers
( 
    Customer string, 
    Id int, 
    INDEX idx  
        CLUSTERED(Customer ASC)
        DISTRIBUTED BY HASH(Customer) 
);

Then separately, you can fill the table.

@customers  = 
  SELECT * FROM 
    (VALUES
       ("Contoso",   123 ),
       ("Woodgrove", 456 )
    ) AS D( Customer, Id  );

INSERT INTO MyDB.dbo.Customers
    SELECT * FROM @customers;

Reading from a table

@rs = 
  SELECT * 
  FROM MyDB.dbo.Customers;

Assemblies

Step 1: create a .NET assembly

Create a .NET assembly with a filename of OrdersLib.dll with this code. In this case we will simply crate a single static string in the assembly that we will reuse in a U-SQL script.

namespace OrdersLib 
{ 
  public static class Helpers 
  { 
    public static string CustPrefix = "CUST_";
  }
}

Step 2 Upload the assembly into the default ADLS store of the ADLA account

For example place the assembly in this location

"/DLLs/OrdersLib.dll"

Step 3: Register the assembly in a U-SQL database

For example place the assembly in this location

CREATE ASSEMBLY MyDB.OrdersLibAsm
  FROM @"/DLLs/OrdersLib.dll";

The identifier OrdersLibAsm is the name the U-SQL catalog uses for the assembly. It is independent of the assembly filename "OrdersLib.dll".

The dll has been copied to the U-SQL database called MyDB. You can now safely delete the file from "/DLLs/OrdersLib.dll".

Step 4: Run a script that uses the assembly

The REFERENCE ASSEMBLY statement makes the code from the assembly named OrdersLibAsm to the script.

REFERENCE ASSEMBLY MyDB.OrdersLibAsm;

@customers  = 
  SELECT * FROM 
    (VALUES
       ("Contoso",   123 ),
       ("Woodgrove", 456 )
    ) AS D( Customer, Id  );

@customers =
  SELECT 
    (OrdersLib.Helpers.CustPrefix + Customer) AS Customer, 
    Id
  FROM @customers;

Using assemblies

A U-SQL Assembly let’s you share code.

Although a U-SQL assembly is often used to share .NET code – specifically a single .NET assembly. As you’ll see in this section. A U-SQL assembly can do much more than that.

System assemblies

Some assemblies are part of the .NET Base Class Library. They aren’t in the U-SQL catalog, but we still need a way to make them available to a U-SQL script.

REFERENCE SYSTEM ASSEMBLY [System.Xml];

Notice that no database name is provided.

User-provided assemblies

Referencing assemblies in the U-SQL Catalog

CREATE ASSEMBLY MyDB.OrdersLibAsm
  FROM @"/DLLs/OrdersLib.dll";

Below is the standard syntax for referencing an assembly that is in the U-SQL catalog.

REFERENCE SYSTEM ASSEMBLY [DBName].[AssemblyName];

Simplifying code with the USING statement

Similar, to C#’s using statement, U-SQL lets you provide an alias for a namespace, this improve the readability of your code.

USING Exc = System.Exceptions;

Referencing assemblies in the U-SQL Catalog of other ADLA accounts

If the assembly you want is in another ADLA account, and you have access to that assembly you can reference it as shown below.

REFERENCE SYSTEM ASSEMBLY [Account][DBName].[AssemblyName];