Performance Optimization Techniques on Azure Data Lake

The following article depicts performance optimization on Azure Data Lake.It is Assumed that people reading this have a general idea on ADL, U-SQL and general querying.Before we move on let’s understand key terms in determining a query performance:

  • Analytic Unit
  • Stages in a job
  • Job Graph

1.     What is an Analytic Unit?

  • It is a unit of Compute resource
  • Currently each Analytic Unit (AU)is equivalent to 2CPU and 6GB RAM
  • More Analytic Units increases parallelism, hence execution

2.     Stages in a job:

  • Preparing: Upload script to cloud
  • Queuing: Job is waiting for Analytic Unit
  • Running: We run the job and incur cost here
  • Finalizing: Output

 3.   What happens when you submit a job (Job Graph)?

After a job is submitted (Success or Failure), following steps are perfomed in the backend

  •  The U-Sql complier looks at script and creates plan. This Plan is called vertex
  •  Each vertex is run by AU
  • More AU implies more vertices are running (More AU implies faster execution with more cost), and a job graph is displayed as below  

1)The green box is called as a super vertex (Which groups similar types of vertices)

2) It is doing an extraction action and has 1 Vertex

3)In most of the cases the Job graph gives us a detailed time of execution, reads and writes performed on Super vertex. 

This helps to decide and narrow down our troubleshooting possibilities

Common Troubleshooting techniques:

1) Index Creation:

For Example: We have a U-SQL query which is taking long time and doing lots of joins. In those cases, create index on Joining columns and Filtered items (Where clause) 

As of today ADL supports only Clustered Index and

  • We can specify index on Multiple columns
  • We can Specify index on which column to sort on (Asc or Desc), as a result fetching goes directly to desired data

2) Choose “Number of Analytic Units” Carefully:

  • Don’t be under the impression that more Analytic Units mean more speed. If we have not properly organized our data, this might result in only higher costs
  • Determine AU based on stages and date size. For example, we have assigned 7 AU’s for our job, and it has 2 stages, the first stage requires 7 AU and second stage requires 2 units, then rest of 5 will remain unused and cost will be incurred for them too.

3)Partition: Partitioning (Dividing ) of data is done to have better manageability and query performance . Idle Size of Partition recommended for a table 1 GB.

Three kinds of data partitioning can be done:

Horizontal partitioning: Horizontal partitioning involves putting different rows into different tables.

For Example: We have a table with list of All continents, we can partition the data by selecting “USA”, “South America ” to be in one table and rest of continents in another table.

  • In simple Horizontal partitioning means we are selecting subset of the tuples(rows) and later union all to get entire data. 

 Vertical partitioning: Involves creating of tables with fewer columns and using additional tables to store remaining columns. 

  • The primary purpose of  this partition is reduced I/O and performance costs associated with fetching of data.
  • Example: Normalization data.

Functional Partitioning: Separate the data which have frequent I/O operations  and  read only data.

 The following are some methods which common and best guidelines in designing and querying data are

Common Troubleshooting techniques:

 1)Index Creation:

 For Example: We have a U-SQL query which is taking long time and doing lots of joins. In those cases, create index on Joining columns and Filtered items (Where clause)

 ADL supports only Clustered Index.

  • We can specify index on Multiple columns
  • We can Specify index on which column to sort on (Asc or Desc), as a result fetching goes directly to desired data 

2) Choose “Number of Analytic Units” Carefully:

  • Don’t be under the impression that more Analytic Units mean more speed. If we have not properly organized our data, this might result in only higher costs
  • Determine AU based on stages and date size. For example, we have assigned 7 AU’s for our job, and it has 2 stages, the first stage requires 7 AU and second stage requires 2 units, then rest of 5 will remain unused and cost will be incurred for them too.

 3)Partition: We partition our data Division for manageability and performance improvements Idle Size of Partition recommended is 1 GB.

Three kinds of data partitioning can be done:

Horizontal partitioning: Horizontal partitioning involves putting different rows into different tables.

For Example: We have a table with list of All continents, we can partition the data by selecting “USA”, “South America ” to be in one table and rest of continents in another table.

  •  In simple Horizontal partitioning means we are selecting subset of the tuples(rows) and later union all to get entire data.

Vertical partitioning: Involves creating of tables with fewer columns and using additional tables to store remaining columns. 

  • The primary purpose of   this partition is reduced I/) and performance costs associated with fetching of data.Example: Normalization data.

 Functional Partitioning: Separate the data which have frequent I/O operations  and  read only data.