Part 8: Set operations and Joins

Set operations are a way of merging rowsets together based on set theoretic operations such as union (UNION). intersection (INTERSECT), complement (EXCEPT).

Sample data

Let’s define two RowSets: @a and @b. Notice that both RowSets have duplicate rows.

@a  =
    SELECT * FROM
        (VALUES
            (1,    "Smith"),
            (1,    "Smith"),
            (2,    "Brown"),
            (3,    "Case")
        ) AS D( DepID, DepName );
@b  =
    SELECT * FROM
        (VALUES
            (1,    "Smith"),
            (1,    "Smith"),
            (1,    "Smith"),
            (2,    "Brown"),
            (4,    "Dey"),
            (4,    "Dey")
        ) AS D( DepID, DepName );

UNION

UNION combines two rowsets.

UNION ALL

As you can see UNION ALL clearly leaves in duplicate rows.

@union_all = 
    SELECT * FROM @a
    UNION ALL
    SELECT * FROM @b;
DepIDName
1Smith
1Smith
2Brown
3Case
1Smith
1Smith
1Smith
2Brown
4Dey
4Dey

UNION DISTINCT

UNION DISTINCT discards duplicate rows.

@union_distinct =
    SELECT * FROM @a
    UNION DISTINCT
    SELECT * FROM @b;
DepIDName
1Smith
2Brown
3Case
4Dey

Schema requirements

UNION by default require that the RowSets have the same schema.

  • Each column must have the same name and data type
  • The columns must appear in the same order in the rowset schema

Finding Common Rows with INTERSECT

Sometimes, we only care about the rows both rowsets have in common. We use the INTERSECT operator to accomplish this. INTERSECT ALL preserves duplicates while INTERSECT removes duplicates.

INTERSECT ALL

@intersect_all =
    SELECT * FROM @a
    INTERSECT ALL
    SELECT * FROM @b;

@intersect_all

DepIDname
1Smith
1Smith
2Brown

INTERSECT DISTINCT

@intersect_distinct =
    SELECT * FROM @a
    INTERSECT DISTINCT
    SELECT * FROM @b;

@intersect_distinct

DepIDname
1Smith
2Brown

Finding Rows That Are NOT in the Other RowSet with EXCEPT

The EXCEPT operator returns all the rows in the left RowSet that are not in the right RowSet.

EXCEPT ALL

@a_except_b_all =
   SELECT * FROM @a
   EXCEPT ALL
   SELECT * FROM @b;

@a_except_b_all

DepIDname
3Case
@b_except_a_all =
   SELECT * FROM @b
   EXCEPT ALL
   SELECT * FROM @a;

@b_except_a_all

DepIDname
1Smith
4Dey
4Dey

EXCEPT DISTINCT

@a_except_b_distinct =
   SELECT * FROM @a
   EXCEPT DISTINCT
   SELECT * FROM @b;

@a_except_b_distinct

DepIDname
3Case
@b_except_a_distinct =
   SELECT * FROM @b
   EXCEPT DISTINCT
   SELECT * FROM @a;

@b_except_a_distinct

DepIDname
4Dey

OUTER UNION

By default UNION requires both RowSets to have matching schemas. OUTER UNION allows the schemas to be different. If one RowSet is missing a column that the other has, that row will be included in the result with a default value for the missing columns.

NOTE: OUTER UNION only supports ALL. It does not support DISTINCT.

The following script will union the two rowsets @left and @right with the partially overlapping schema on columns A and K while filling in null into the “missing cells” of column C and 0 as the default value for type int for column B.

@left =
    SELECT *
    FROM (VALUES ( 1, "x", (int?) 50 ),
                 ( 1, "y", (int?) 60 )
         ) AS L(K, A, C);

@right =
    SELECT *
    FROM (VALUES ( 5, "x", 1 ),
                 ( 6, "x", 2 ),
                 (10, "y", 3 )
         ) AS R(B, A, K);

@res =
    SELECT * FROM @left
    OUTER UNION BY NAME ON (*)
    SELECT * FROM @right;

The result is:

KACB
1“x”500
1“x”5
1“y”600
2“x”6
3“y”10

Joins

A JOIN is a way of combining rowsets together based on key.

Sample data

@departments =
    SELECT * FROM
    (VALUES
    ("31", "Sales"),
    ("33", "Engineering"),
    ("34", "Clerical"),
    ("35", "Marketing")
    ) AS D( DepID, DepName );
@employees =
SELECT * FROM
    (VALUES
    ("31", "Rafferty"),
    ("33", "Jones"),
    ("33", "Heisenberg"),
    ("34", "Robinson"),
    ("34", "Smith"),
    ((string)null, "Williams")
    ) AS D( DepID, EmpName );

CROSS JOIN

@cross_join =
    SELECT
        @departments.DepID AS DepID_Dep,
        @employees.DepID AS DepID_Emp,
        @employees.EmpName, 
        @departments.DepName
    FROM @employees CROSS JOIN @departments;
DepIDDepDepIDEmpEmpNameDepName
3131RaffertySales
3131RaffertySales

|31|33|Jones|Sales| |31|33|Heisenberg|Sales| |31|34|Robinson|Sales| |31|34|Smith|Sales| |31|ERR|Williams|Sales| |33|31|Rafferty|Engineering| |33|33|Jones|Engineering| |33|33|Heisenberg|Engineering| |33|34|Robinson|Engineering| |33|34|Smith|Engineering| |33|ERR|Williams|Engineering| |34|31|Rafferty|Clerical| |34|33|Jones|Clerical| |34|33|Heisenberg|Clerical| |34|34|Robinson|Clerical| |34|34|Smith|Clerical| |34|ERR|Williams|Clerical| |35|31|Rafferty|Marketing| |35|33|Jones|Marketing| |35|33|Heisenberg|Marketing| |35|34|Robinson|Marketing| |35|34|Smith|Marketing| |35|NULL|Williams|Marketing|

CROSS JOIN

@cross_join =
    SELECT
        @departments.DepID AS DepID_Dep,
        @employees.DepID AS DepID_Emp,
        @employees.EmpName,
        @departments.DepName
     FROM 
        @employees CROSS JOIN @departments;
DepID_DepDepID_EmpEmpNameDepName
3131RaffertySales
3133JonesSales
3133HeisenbergSales
3134RobinsonSales
3134SmithSales
31NULLWilliamsSales
3331RaffertyEngineering
3333JonesEngineering
3333HeisenbergEngineering
3334RobinsonEngineering
3334SmithEngineering
33NULLWilliamsEngineering
3431RaffertyClerical
3433JonesClerical
3433HeisenbergClerical
3434RobinsonClerical
3434SmithClerical
34NULLWilliamsClerical
3531RaffertyMarketing
3533JonesMarketing
3533HeisenbergMarketing
3534RobinsonMarketing
3534SmithMarketing
35NULLWilliamsMarketing

SEMIJOIN

SEMIJOIN is a way to filter a RowSet using another RowSet.

A simple example is: “Find all the employees in this RowSet A, where the employee appears also in RowSet B”.

There are two variants:

  • LEFT SEMIJOIN -> Give only those rows in the left rowset that have a matching row in the right rowset.
  • RIGHT SEMIJOIN -> Give only those rows in the right rowset that have a matching row in the left rowset.

NOTE: If you leave out LEFT or RIGHT, and instead simply write SEMIJOIN then what you get is LEFT SEMIJOIN. Do not leave out LEFT or RIGHT always explicitly it.

Find all employees that are in valid departments

@left_semijoin1 =
    SELECT
        @employees.DepID,    
        @employees.EmpName
    FROM @employees
    LEFT SEMIJOIN @departments
        ON @employees.DepID == @departments.DepID;
DepIDEmpName
31Rafferty
33Jones
33Heisenberg
34Robinson
34Smith

Find all departments that has an employee listed in the employee RowSet.

@left_semijoin2 =
    SELECT 
        @departments.DepID,
        @departments.DepName
    FROM @departments
    LEFT SEMIJOIN @employees
    ON @departments.DepID == @employees.DepID;
DepIDDepName
31Sales
33Engineering
34Clerical

ANTISEMIJOIN

An ANTISEMIJOIN is the opposite of SEMIJOIN

There are two variants:

  • LEFT ANTISEMIJOIN -> Give only those rows in the left RowSet that DO NOT have a matching row in the right rowset.
  • RIGHT ANTISEMIJOIN -> Give only those rows in the right RowSet that DO NOT have a matching row in the left RowSet.

Find all departments that don’t have an employee listed in the employees RowSet.

@left_antisemijoin =
    SELECT 
        @departments.DepID,
        @departments.DepName
    FROM @departments
    LEFT ANTISEMIJOIN @employees
        ON @departments.DepID == @employees.DepID;
DepIDDepName
35Marketing

Complex Types

Nullability

Complex types are reference types. You MUST check if they are null first before using them.

Maximum size of a complex value

There is no specific limit inherent to Complex Types

However, rowsets must still fit within the limits defined by U-SQL (currently no row can be bigger than 4MB)

Persistence

U-SQL Tables

Complex Types can be stored in and read from U-SQL Tables without any additional effort.

Files

To read/write Complex Types with files, you must use your own Custom Outputter & Extractor.

ARRAYs

Overview

  • An ordered list of values (all of the same type)
  • Immutable
  • Single-Dimension only

Creating arrays initialized with data

You can create a SQlArray using the SqlArray constructor or use the SqlArray.Create static method

The SqlArray.Create static method can also be used to create arrays.

@output =
    SELECT * 
    FROM
    ( VALUES
        ( "West Virginia", new SqlArray<string>( new string [] { "Charleston", "Huntington", "Parkersburg", "Morgantown", "Wheeling"} ) ),
        ( "Wisconsin", SqlArray.Create( new string [] { "Milwaukee", "Madison", "Green Bay", "Kenosha", "Racine"} ) )

    ) AS T(State, Cities);

Creating Empty arrays

new SqlArray<string> ( )

or

new SqlArray<string> { }

Creating an ARRAY from an .NET Array

DECLARE @wv_cities = new string [] { "Charleston", "Huntington", "Parkersburg", "Morgantown", "Wheeling"};
DECLARE @wi_cities = new string [] { "Milwaukee", "Madison", "Green Bay", "Kenosha", "Racine" };

@cities =
    SELECT * 
    FROM
    ( VALUES
        ( "West Virginia", new SqlArray<string>( @wv_cities ) ),
        ( "Wisconsin", SqlArray.Create( @wi_cities ) )

    ) AS T(State, Cities);

Sample data

@cities =
    SELECT * 
    FROM
    ( VALUES
        ( "Vermont","Burlington,Essex,South,Burlington,Colchester,Rutland" ),
        ( "Virginia","Virginia Beach,Norfolk,Chesapeake,Richmond,Newport News" ),
        ( "Washington","Seattle,Spokane,Tacoma,Vancouver,Bellevue" ),
        ( "West Virginia", "Charleston,Huntington,Parkersburg,Morgantown,Wheeling"),
        ( "Wisconsin","Milwaukee,Madison,Green Bay,Kenosha,Racine"),
        ( "Wyoming","Cheyenne,Casper,Laramie,Gillette,Rock Springs" )
    ) AS T(State, Cities);

@cities =
    SELECT 
        State,
        SqlArray.Create( Cities.Split(',') ) AS Cities 
    FROM @cities;
State stringCities SqlArray
VermontSqlArray{ “Burlington”, “Essex”, “South”, “Burlington”, “Colchester”, “Rutland” }
VirginiaSqlArray{ “Virginia Beach”, “Norfolk”, “Chesapeake”, “Richmond”, “Newport News” }
WashingtonSqlArray{ “Seattle”, “Spokane”, “Tacoma”, “Vancouver”, “Bellevue” }
West VirginiaSqlArray{ “Charleston”, “Huntington”, “Parkersburg”, “Morgantown”, “Wheeling” }
WisconsinSqlArray{ “Milwaukee”, “Madison”, “Green Bay”, “Kenosha”, “Racine” }
WyomingSqlArray{ “Cheyenne”, “Casper”, “Laramie”, “Gillette”, “Rock Springs” }

Array Indexing

Use the array indexing operator [n] where n is a long. The first index is 0 – just like .NET


@output =
    SELECT
        State,
        Cities[0] AS FirstCity
    FROM @output;
StateFirstCity
VermontBurlington
VirginiaVirginia Beach
WashingtonSeattle
West VirginiaCharleston
WisconsinMilwaukee
WyomingCheyenne

Removing members

@output =
    SELECT
        State ,
        SqlArray.Create( Cities.Where( c=>c.StartsWith("C") ) ) AS Cities
    FROM @output;
StateCities
VermontSqlArray{ “Colchester” }
VirginiaSqlArray{ “Chesapeake” }
WashingtonSqlArray{ }
West VirginiaSqlArray{ “Charleston” }
WisconsinSqlArray{ }
WyomingSqlArray{ “Cheyenne”, “Casper” }

Counting members


@output =
    SELECT
        State ,
        SqlArray.Create( Cities.Where( c=>c.StartsWith("C") ) ) AS Cities
    FROM @output;

@output =
    SELECT
        State ,
        Cities ,
        Cities.Count AS NumCities
    FROM @output;
StateCitiesNumCities
VermontSqlArray{ “Colchester” }1
VirginiaSqlArray{ “Chesapeake” }1
WashingtonSqlArray{ }0
West VirginiaSqlArray{ “Charleston” }1
WisconsinSqlArray{ }0
WyomingSqlArray{ “Cheyenne”, “Casper” }2

MAPs

Overview

Creating a SqlMap from constant values

The following snippet creates two rows and each has a SqlMap where the key (K) is a string and the value is also a string.

@projectmembers = 
    SELECT *
    FROM
    ( VALUES
        ( "Website", new SqlMap<string,string> { 
                {"Mallory", "PM"}, 
                {"Bob", "Dev"} ,
                {"Alice", "Dev"} ,
                {"Stan", "Dev"} ,
                {"Chris", "UX"} ,
             } 
        ),
        ( "DB", new SqlMap<string,string> { 
                {"Ted", "Test"}, 
                {"Joe", "Dev"} ,
                {"Chuck", "Dev"} 
             } 
        )
)
AS T(Project, Members);
ProjectMembers
WebsiteSqlMap{ Alice=Dev; Bob=Dev; Chris=UX; Mallory=PM; Stan=Dev }
DBSqlMap{ Chuck=Dev; Joe=Dev; Ted=Test }

Creating Empty maps

new SqlMap<string,string> ( )

or

new SqlMap<string,string> { }

Maps from Maps: Removing members based on keys

@output =
    SELECT Project,
           new SqlMap<string,string>(Members.Where(kv => kv.Key != "Mallory")) AS Members
    FROM @projectmembers;
ProjectMembers
WebsiteSqlMap{ Alice=Dev; Bob=Dev; Chris=UX; Stan=Dev }
DBSqlMap{ Chuck=Dev; Joe=Dev; Ted=Test }

Alternatively you can use the SqlMap.Create static method instead

@output =
    SELECT Project,
           SqlMap.Create(Members.Where(kv => kv.Key != "Mallory")) AS Members
    FROM @projectmembers;
ProjectMembers
WebsiteSqlMap{ Alice=Dev; Bob=Dev; Chris=UX; Stan=Dev }
DBSqlMap{ Chuck=Dev; Joe=Dev; Ted=Test }

Combining rows into maps with MAP_AGG

@projectmembers = 
    SELECT *
    FROM
    ( VALUES
        ( "Website","Mallory", "PM" ), 
        ( "Website","Bob", "Dev" ),
        ( "Website","Alice", "Dev" ) ,
        ( "Website","Stan", "Dev" ) ,
        ( "Website","Chris", "UX" ) ,
        ( "DB", "Ted", "Test" ), 
        ( "DB", "Joe", "Dev" ) ,
        ( "DB", "Chuck", "Dev" ) 
    )
AS T(Project, Employee, Role);

@projectmembers =
    SELECT Project,
           MAP_AGG<string, string>(Employee, Role) AS Members
    FROM @projectmembers_raw
    GROUP BY Project;
ProjectMembers
DBSqlMap{ Chuck=Dev; Joe=Dev; Ted=Test }
WebsiteSqlMap{ Alice=Dev; Bob=Dev; Chris=UX; Mallory=PM; Stan=Dev }

Removing members based on values

@output =
    SELECT Project,
           SqlMap.Create(Members.Where(kv => kv.Value != "Dev")) AS Members
    FROM @projectmembers;
ProjectMembers
WebsiteSqlMap{ Chris=UX; Mallory=PM }
DBSqlMap{ Ted=Test }

counting members

@output =
    SELECT Project,
           Members,
           Members.Count AS Count
FROM @projectmembers;
ProjectMembersCount
WebsiteSqlMap{ Alice=Dev; Bob=Dev; Chris=UX; Mallory=PM; Stan=Dev }5
DBSqlMap{ Chuck=Dev; Joe=Dev; Ted=Test }3

Retrieving values

@output =
    SELECT 
        Project,
        Members["Mallory"] AS MalloryRole
   FROM @projectmembers;

Note that if the key is missing then the default value for the type is returned

ProjectMalloryRole
WebsitePM
DBnull

Checking if a key exists

@output =
    SELECT 
        Project,
        Members.ContainsKey("Mallory") AS ContainsMallory
FROM @projectmembers;
ProjectContainsMallory
WebsiteTrue
DBFalse