Set operations are a way of merging rowsets together based on set theoretic operations such as union (UNION). intersection (INTERSECT), complement (EXCEPT).
Sample data
Let’s define two RowSets: @a
and @b
. Notice that both RowSets have duplicate rows.
@a =
SELECT * FROM
(VALUES
(1, "Smith"),
(1, "Smith"),
(2, "Brown"),
(3, "Case")
) AS D( DepID, DepName );
@b =
SELECT * FROM
(VALUES
(1, "Smith"),
(1, "Smith"),
(1, "Smith"),
(2, "Brown"),
(4, "Dey"),
(4, "Dey")
) AS D( DepID, DepName );
UNION
UNION combines two rowsets.
UNION ALL
As you can see UNION ALL clearly leaves in duplicate rows.
@union_all =
SELECT * FROM @a
UNION ALL
SELECT * FROM @b;
DepID | Name |
---|---|
1 | Smith |
1 | Smith |
2 | Brown |
3 | Case |
1 | Smith |
1 | Smith |
1 | Smith |
2 | Brown |
4 | Dey |
4 | Dey |
UNION DISTINCT
UNION DISTINCT discards duplicate rows.
@union_distinct =
SELECT * FROM @a
UNION DISTINCT
SELECT * FROM @b;
DepID | Name |
---|---|
1 | Smith |
2 | Brown |
3 | Case |
4 | Dey |
Schema requirements
UNION by default require that the RowSets have the same schema.
- Each column must have the same name and data type
- The columns must appear in the same order in the rowset schema
Finding Common Rows with INTERSECT
Sometimes, we only care about the rows both rowsets have in common. We use the INTERSECT operator to accomplish this. INTERSECT ALL preserves duplicates while INTERSECT removes duplicates.
INTERSECT ALL
@intersect_all =
SELECT * FROM @a
INTERSECT ALL
SELECT * FROM @b;
@intersect_all
DepID | name |
---|---|
1 | Smith |
1 | Smith |
2 | Brown |
INTERSECT DISTINCT
@intersect_distinct =
SELECT * FROM @a
INTERSECT DISTINCT
SELECT * FROM @b;
@intersect_distinct
DepID | name |
---|---|
1 | Smith |
2 | Brown |
Finding Rows That Are NOT in the Other RowSet with EXCEPT
The EXCEPT operator returns all the rows in the left RowSet that are not in the right RowSet.
EXCEPT ALL
@a_except_b_all =
SELECT * FROM @a
EXCEPT ALL
SELECT * FROM @b;
@a_except_b_all
DepID | name |
---|---|
3 | Case |
@b_except_a_all =
SELECT * FROM @b
EXCEPT ALL
SELECT * FROM @a;
@b_except_a_all
DepID | name |
---|---|
1 | Smith |
4 | Dey |
4 | Dey |
EXCEPT DISTINCT
@a_except_b_distinct =
SELECT * FROM @a
EXCEPT DISTINCT
SELECT * FROM @b;
@a_except_b_distinct
DepID | name |
---|---|
3 | Case |
@b_except_a_distinct =
SELECT * FROM @b
EXCEPT DISTINCT
SELECT * FROM @a;
@b_except_a_distinct
DepID | name |
---|---|
4 | Dey |
OUTER UNION
By default UNION requires both RowSets to have matching schemas. OUTER UNION allows the schemas to be different. If one RowSet is missing a column that the other has, that row will be included in the result with a default value for the missing columns.
NOTE: OUTER UNION only supports ALL. It does not support DISTINCT.
The following script will union the two rowsets @left
and @right
with the partially overlapping schema on columns A
and K
while filling in null
into the “missing cells” of column C
and 0
as the default value for type int
for column B
.
@left =
SELECT *
FROM (VALUES ( 1, "x", (int?) 50 ),
( 1, "y", (int?) 60 )
) AS L(K, A, C);
@right =
SELECT *
FROM (VALUES ( 5, "x", 1 ),
( 6, "x", 2 ),
(10, "y", 3 )
) AS R(B, A, K);
@res =
SELECT * FROM @left
OUTER UNION BY NAME ON (*)
SELECT * FROM @right;
The result is:
K | A | C | B |
---|---|---|---|
1 | “x” | 50 | 0 |
1 | “x” | 5 | |
1 | “y” | 60 | 0 |
2 | “x” | 6 | |
3 | “y” | 10 |
Joins
A JOIN is a way of combining rowsets together based on key.
Sample data
@departments =
SELECT * FROM
(VALUES
("31", "Sales"),
("33", "Engineering"),
("34", "Clerical"),
("35", "Marketing")
) AS D( DepID, DepName );
@employees =
SELECT * FROM
(VALUES
("31", "Rafferty"),
("33", "Jones"),
("33", "Heisenberg"),
("34", "Robinson"),
("34", "Smith"),
((string)null, "Williams")
) AS D( DepID, EmpName );
CROSS JOIN
@cross_join =
SELECT
@departments.DepID AS DepID_Dep,
@employees.DepID AS DepID_Emp,
@employees.EmpName,
@departments.DepName
FROM @employees CROSS JOIN @departments;
DepIDDep | DepIDEmp | EmpName | DepName |
---|---|---|---|
31 | 31 | Rafferty | Sales |
31 | 31 | Rafferty | Sales |
|31|33|Jones|Sales| |31|33|Heisenberg|Sales| |31|34|Robinson|Sales| |31|34|Smith|Sales| |31|ERR|Williams|Sales| |33|31|Rafferty|Engineering| |33|33|Jones|Engineering| |33|33|Heisenberg|Engineering| |33|34|Robinson|Engineering| |33|34|Smith|Engineering| |33|ERR|Williams|Engineering| |34|31|Rafferty|Clerical| |34|33|Jones|Clerical| |34|33|Heisenberg|Clerical| |34|34|Robinson|Clerical| |34|34|Smith|Clerical| |34|ERR|Williams|Clerical| |35|31|Rafferty|Marketing| |35|33|Jones|Marketing| |35|33|Heisenberg|Marketing| |35|34|Robinson|Marketing| |35|34|Smith|Marketing| |35|NULL|Williams|Marketing|
CROSS JOIN
@cross_join =
SELECT
@departments.DepID AS DepID_Dep,
@employees.DepID AS DepID_Emp,
@employees.EmpName,
@departments.DepName
FROM
@employees CROSS JOIN @departments;
DepID_Dep | DepID_Emp | EmpName | DepName |
---|---|---|---|
31 | 31 | Rafferty | Sales |
31 | 33 | Jones | Sales |
31 | 33 | Heisenberg | Sales |
31 | 34 | Robinson | Sales |
31 | 34 | Smith | Sales |
31 | NULL | Williams | Sales |
33 | 31 | Rafferty | Engineering |
33 | 33 | Jones | Engineering |
33 | 33 | Heisenberg | Engineering |
33 | 34 | Robinson | Engineering |
33 | 34 | Smith | Engineering |
33 | NULL | Williams | Engineering |
34 | 31 | Rafferty | Clerical |
34 | 33 | Jones | Clerical |
34 | 33 | Heisenberg | Clerical |
34 | 34 | Robinson | Clerical |
34 | 34 | Smith | Clerical |
34 | NULL | Williams | Clerical |
35 | 31 | Rafferty | Marketing |
35 | 33 | Jones | Marketing |
35 | 33 | Heisenberg | Marketing |
35 | 34 | Robinson | Marketing |
35 | 34 | Smith | Marketing |
35 | NULL | Williams | Marketing |
SEMIJOIN
SEMIJOIN is a way to filter a RowSet using another RowSet.
A simple example is: “Find all the employees in this RowSet A, where the employee appears also in RowSet B”.
There are two variants:
- LEFT SEMIJOIN -> Give only those rows in the left rowset that have a matching row in the right rowset.
- RIGHT SEMIJOIN -> Give only those rows in the right rowset that have a matching row in the left rowset.
NOTE: If you leave out LEFT or RIGHT, and instead simply write SEMIJOIN then what you get is LEFT SEMIJOIN. Do not leave out LEFT or RIGHT always explicitly it.
Find all employees that are in valid departments
@left_semijoin1 =
SELECT
@employees.DepID,
@employees.EmpName
FROM @employees
LEFT SEMIJOIN @departments
ON @employees.DepID == @departments.DepID;
DepID | EmpName |
---|---|
31 | Rafferty |
33 | Jones |
33 | Heisenberg |
34 | Robinson |
34 | Smith |
Find all departments that has an employee listed in the employee RowSet.
@left_semijoin2 =
SELECT
@departments.DepID,
@departments.DepName
FROM @departments
LEFT SEMIJOIN @employees
ON @departments.DepID == @employees.DepID;
DepID | DepName |
---|---|
31 | Sales |
33 | Engineering |
34 | Clerical |
ANTISEMIJOIN
An ANTISEMIJOIN is the opposite of SEMIJOIN
There are two variants:
- LEFT ANTISEMIJOIN -> Give only those rows in the left RowSet that DO NOT have a matching row in the right rowset.
- RIGHT ANTISEMIJOIN -> Give only those rows in the right RowSet that DO NOT have a matching row in the left RowSet.
Find all departments that don’t have an employee listed in the employees RowSet.
@left_antisemijoin =
SELECT
@departments.DepID,
@departments.DepName
FROM @departments
LEFT ANTISEMIJOIN @employees
ON @departments.DepID == @employees.DepID;
DepID | DepName |
---|---|
35 | Marketing |
Complex Types
Nullability
Complex types are reference types. You MUST check if they are null first before using them.
Maximum size of a complex value
There is no specific limit inherent to Complex Types
However, rowsets must still fit within the limits defined by U-SQL (currently no row can be bigger than 4MB)
Persistence
U-SQL Tables
Complex Types can be stored in and read from U-SQL Tables without any additional effort.
Files
To read/write Complex Types with files, you must use your own Custom Outputter & Extractor.
ARRAYs
Overview
- An ordered list of values (all of the same type)
- Immutable
- Single-Dimension only
Creating arrays initialized with data
You can create a SQlArray using the SqlArray constructor or use the SqlArray.Create static method
The SqlArray.Create static method can also be used to create arrays.
@output =
SELECT *
FROM
( VALUES
( "West Virginia", new SqlArray<string>( new string [] { "Charleston", "Huntington", "Parkersburg", "Morgantown", "Wheeling"} ) ),
( "Wisconsin", SqlArray.Create( new string [] { "Milwaukee", "Madison", "Green Bay", "Kenosha", "Racine"} ) )
) AS T(State, Cities);
Creating Empty arrays
new SqlArray<string> ( )
or
new SqlArray<string> { }
Creating an ARRAY from an .NET Array
DECLARE @wv_cities = new string [] { "Charleston", "Huntington", "Parkersburg", "Morgantown", "Wheeling"};
DECLARE @wi_cities = new string [] { "Milwaukee", "Madison", "Green Bay", "Kenosha", "Racine" };
@cities =
SELECT *
FROM
( VALUES
( "West Virginia", new SqlArray<string>( @wv_cities ) ),
( "Wisconsin", SqlArray.Create( @wi_cities ) )
) AS T(State, Cities);
Sample data
@cities =
SELECT *
FROM
( VALUES
( "Vermont","Burlington,Essex,South,Burlington,Colchester,Rutland" ),
( "Virginia","Virginia Beach,Norfolk,Chesapeake,Richmond,Newport News" ),
( "Washington","Seattle,Spokane,Tacoma,Vancouver,Bellevue" ),
( "West Virginia", "Charleston,Huntington,Parkersburg,Morgantown,Wheeling"),
( "Wisconsin","Milwaukee,Madison,Green Bay,Kenosha,Racine"),
( "Wyoming","Cheyenne,Casper,Laramie,Gillette,Rock Springs" )
) AS T(State, Cities);
@cities =
SELECT
State,
SqlArray.Create( Cities.Split(',') ) AS Cities
FROM @cities;
State string | Cities SqlArray |
---|---|
Vermont | SqlArray{ “Burlington”, “Essex”, “South”, “Burlington”, “Colchester”, “Rutland” } |
Virginia | SqlArray{ “Virginia Beach”, “Norfolk”, “Chesapeake”, “Richmond”, “Newport News” } |
Washington | SqlArray{ “Seattle”, “Spokane”, “Tacoma”, “Vancouver”, “Bellevue” } |
West Virginia | SqlArray{ “Charleston”, “Huntington”, “Parkersburg”, “Morgantown”, “Wheeling” } |
Wisconsin | SqlArray{ “Milwaukee”, “Madison”, “Green Bay”, “Kenosha”, “Racine” } |
Wyoming | SqlArray{ “Cheyenne”, “Casper”, “Laramie”, “Gillette”, “Rock Springs” } |
Array Indexing
Use the array indexing operator [n] where n is a long. The first index is 0 – just like .NET
@output =
SELECT
State,
Cities[0] AS FirstCity
FROM @output;
State | FirstCity |
---|---|
Vermont | Burlington |
Virginia | Virginia Beach |
Washington | Seattle |
West Virginia | Charleston |
Wisconsin | Milwaukee |
Wyoming | Cheyenne |
Removing members
@output =
SELECT
State ,
SqlArray.Create( Cities.Where( c=>c.StartsWith("C") ) ) AS Cities
FROM @output;
State | Cities |
---|---|
Vermont | SqlArray{ “Colchester” } |
Virginia | SqlArray{ “Chesapeake” } |
Washington | SqlArray{ } |
West Virginia | SqlArray{ “Charleston” } |
Wisconsin | SqlArray{ } |
Wyoming | SqlArray{ “Cheyenne”, “Casper” } |
Counting members
@output =
SELECT
State ,
SqlArray.Create( Cities.Where( c=>c.StartsWith("C") ) ) AS Cities
FROM @output;
@output =
SELECT
State ,
Cities ,
Cities.Count AS NumCities
FROM @output;
State | Cities | NumCities |
---|---|---|
Vermont | SqlArray{ “Colchester” } | 1 |
Virginia | SqlArray{ “Chesapeake” } | 1 |
Washington | SqlArray{ } | 0 |
West Virginia | SqlArray{ “Charleston” } | 1 |
Wisconsin | SqlArray{ } | 0 |
Wyoming | SqlArray{ “Cheyenne”, “Casper” } | 2 |
MAPs
Overview
Creating a SqlMap from constant values
The following snippet creates two rows and each has a SqlMap where the key (K) is a string and the value is also a string.
@projectmembers =
SELECT *
FROM
( VALUES
( "Website", new SqlMap<string,string> {
{"Mallory", "PM"},
{"Bob", "Dev"} ,
{"Alice", "Dev"} ,
{"Stan", "Dev"} ,
{"Chris", "UX"} ,
}
),
( "DB", new SqlMap<string,string> {
{"Ted", "Test"},
{"Joe", "Dev"} ,
{"Chuck", "Dev"}
}
)
)
AS T(Project, Members);
Project | Members |
---|---|
Website | SqlMap{ Alice=Dev; Bob=Dev; Chris=UX; Mallory=PM; Stan=Dev } |
DB | SqlMap{ Chuck=Dev; Joe=Dev; Ted=Test } |
Creating Empty maps
new SqlMap<string,string> ( )
or
new SqlMap<string,string> { }
Maps from Maps: Removing members based on keys
@output =
SELECT Project,
new SqlMap<string,string>(Members.Where(kv => kv.Key != "Mallory")) AS Members
FROM @projectmembers;
Project | Members |
---|---|
Website | SqlMap{ Alice=Dev; Bob=Dev; Chris=UX; Stan=Dev } |
DB | SqlMap{ Chuck=Dev; Joe=Dev; Ted=Test } |
Alternatively you can use the SqlMap.Create static method instead
@output =
SELECT Project,
SqlMap.Create(Members.Where(kv => kv.Key != "Mallory")) AS Members
FROM @projectmembers;
Project | Members |
---|---|
Website | SqlMap{ Alice=Dev; Bob=Dev; Chris=UX; Stan=Dev } |
DB | SqlMap{ Chuck=Dev; Joe=Dev; Ted=Test } |
Combining rows into maps with MAP_AGG
@projectmembers =
SELECT *
FROM
( VALUES
( "Website","Mallory", "PM" ),
( "Website","Bob", "Dev" ),
( "Website","Alice", "Dev" ) ,
( "Website","Stan", "Dev" ) ,
( "Website","Chris", "UX" ) ,
( "DB", "Ted", "Test" ),
( "DB", "Joe", "Dev" ) ,
( "DB", "Chuck", "Dev" )
)
AS T(Project, Employee, Role);
@projectmembers =
SELECT Project,
MAP_AGG<string, string>(Employee, Role) AS Members
FROM @projectmembers_raw
GROUP BY Project;
Project | Members |
---|---|
DB | SqlMap{ Chuck=Dev; Joe=Dev; Ted=Test } |
Website | SqlMap{ Alice=Dev; Bob=Dev; Chris=UX; Mallory=PM; Stan=Dev } |
Removing members based on values
@output =
SELECT Project,
SqlMap.Create(Members.Where(kv => kv.Value != "Dev")) AS Members
FROM @projectmembers;
Project | Members |
---|---|
Website | SqlMap{ Chris=UX; Mallory=PM } |
DB | SqlMap{ Ted=Test } |
counting members
@output =
SELECT Project,
Members,
Members.Count AS Count
FROM @projectmembers;
Project | Members | Count |
---|---|---|
Website | SqlMap{ Alice=Dev; Bob=Dev; Chris=UX; Mallory=PM; Stan=Dev } | 5 |
DB | SqlMap{ Chuck=Dev; Joe=Dev; Ted=Test } | 3 |
Retrieving values
@output =
SELECT
Project,
Members["Mallory"] AS MalloryRole
FROM @projectmembers;
Note that if the key is missing then the default value for the type is returned
Project | MalloryRole |
---|---|
Website | PM |
DB | null |
Checking if a key exists
@output =
SELECT
Project,
Members.ContainsKey("Mallory") AS ContainsMallory
FROM @projectmembers;
Project | ContainsMallory |
---|---|
Website | True |
DB | False |