SSIS 2016: An efficient way to handle transform and load of SCD2

Quick Review:
This blog post is about type two slowly changing dimensions (SCD2).  This is when an attribute change in row 1 results in SSIS expiring the current row and inserting a new dimension table row like this –>

Picture

 SSIS comes packaged with a SCD2 task, but just because it works, does not mean that we should use it.  Think of the pre-packaged Microsoft supplied SCD2 task as a suggestion.  If we really want to get the job done, you can also use Cozy Roc or Pragmatic WorksTask Factory (TF).   I strongly suggest Task Factory’s [Dimension Merge Slowly Changing Dimension] add-in to SSIS for the following reasons:

  1. Performance
  2. Easy of use
  3. Rich features

The key to success, in my case, was two-fold

  1. Sort the two OleDB source components (#1 and 3) both in the SELECT statement, and under the Advanced tab of the OleDB source task. 
  2. Be sure to choose the right keys in the TF Upsert destination (#6 and #8).

You can read about additional performance tips from Pragmatic Works on line help (performance tab) and they have an instructional video here.  The point of this blog post is to share a screen print that would have been helpful to me the first time I setup this component.

Picture

Now for the dissection

  1. This is your SELECT transform statement from your source system.  It must contain an ORDER BY clause.  The [OleDB SRC vwDimCustomer] task then must be sorted.  Right mouse click –> Show Advanced Editor –>  Input and Output Properties tab –> Ole DB Source Output –> IsSorted property = True.  On this same tab click on Ole DB Source Output –> Output Columns –> [Your Business Key Column Name]–> SortKeyPosition = 1
  2. This is an optional TF component, but best practice is to not allow NULL values, assuming your data warehouse is modeled and optimized for analytics.  Although all of the columns in vwDimCustomer have IsNull() functions, all the lookup values will come into the SCD2 task as NULL, so this step replaces NULL values with ‘Unknown’, ‘-1’, ‘N/A’, or whatever be your preference.
  3. In order to know if something is new or changed, we need a comparison table and this is the function of #3.  Just like #1, this source data must be sorted.  Example: SELECT * FROM edw.DimCustomer ORDER BY Customer BK.
  4. This is the hinge pin of the SCD2 process.  I will leave you to read up on how to configure the component as I don’t wish to rewrite Pragmatic Works Help.  One word of advice: Do not rest until all warnings have been corrected.  Where it is easy to get lost is configuring the outputs.
  5. The easiest of the four: a BK (business key) does not exist.  Insert it.  This is a traditional OleDB insert and we don’t map the PK (primary key) or SCD2 ExpirationDateTime column.  The PK is an identity seed and is taken care of by SQL Server.  The SCD2 expiration date should be NULL.  If you want a audacious default, like 12/31/9999, then this column must, of course, be mapped.
  6. This TF Upsert takes care of SCD1 rows: Rows that do not have a SCD2 change and are therefore updated completely, including historical columns.  This is a standard t-sql UPDATE.  My personal preference is to use the TF Upsert Column Compare update method so rows that have no actual change, do not have an unnecessary update and meaningless [RowLastUpdated] timestamp.  TF Upsert Column Compare works much like a hash value in many ETL methodologies.  “If exists, go down SCD1 pipeline, but don’t update the row unless there is an actual change.”  Critical success point: key on the BK, not the PK!
  7. We finally get to SCD2 results with #7.  This is where new rows are created because of an identified SCD2 column change.
  8. When a new row is added, the old row must be expired.  This is the functionality of #8.  Because we definitely have a change in the row, there is no purpose to spend the time doing a TF Upsert Column Compare.  Set this TG Upsert to bulk update.  Critical success point: key on PK, not BK!

Conclusion:  Having fulling understood SCD2 concepts, this TF component took me a little bit to configure — I had to actually think about things vs the rhythmic repetition of common data flow tasks.  On first pass I skipped past the sort of the two source components and out of habit, picked up the PK in both TF Upsert components.  I didn’t pay attention to the <New> OleDB mapping and tried to insert my edw.DimCustomer.PK (hello?!) .  My advice is to SLOW DOWN and get the first SCD2 dimension SSIS package built and aggressively tested, then fall back into the rinse and repeat rhythm of SSIS package development.