Using Azure Data Lake to copy data from CSV file to a SQL database

Image for post

Great, but what is Azure Data Lake?

Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages. It removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics.

According to Microsoft, customers can provision Azure Data Lakes to store an unlimited amount of structured, semi-structured or unstructured data from a variety of sources. The service does not impose limits on account sizes, file sizes, or the amount of data that can be stored in a data lake.

Pricing for Azure Data Lake is dependent upon numerous variables, including storage capacity, the number of analytics units (AUs) per minute. As of this writing, the Azure Data Lake Store service is priced at $0.039 per GB per month for pay as you go, with capacity-based discounts up to 33% for monthly commitments. The Azure Pricing Calculator can help customers determine exact data lake costs.

In this tutorial, you use the Azure portal to create a data factory. Then, you use the Copy Data tool to create a pipeline that copies data from CSV file data to a SQL database.

Prerequisites

  • Azure storage account: Use Blob storage as the source data store. If you don’t have an Azure storage account, see the instructions in Create a storage account.
  • Azure SQL Database: Use a SQL database as the sink data store. If you don’t have a SQL database, see the instructions in Create a SQL database.
  • The Sales Jan 2009 file contains some sales transactions. You can download CSV data here.

Prepare your Blob storage and your SQL database tables by performing these steps.

  • Create a container named adfv2 and upload the CSV file to the container. You can use various tools to perform these tasks, such as Azure Storage Explorer.
Image for post
Blob Storage folder

Use the following SQL script to create a database named dbo.adl_db in your SQL database:

CREATE DATABASE dbo.adl_db

Also, create a sink SQL table:

CREATE TABLE dl_sales (
Transaction_date DATETIME,
Product TEXT,
Price DECIMAL,
Payment_Type TEXT,
Name TEXT,
City TEXT,
State TEXT,
Country TEXT,
Account_Created DATETIME,
Last_Login DATETIME,
Latitude DECIMAL,
Longitude DECIMAL,
)
Image for post
SQL Database created!

Let’s create a data factory.

Image for post
ADF

Azure Data Factory, a cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines.
This platform for these kinds of scenarios below. It is a cloud-based data integration service that allows you to create data-driven workflows in the cloud that orchestrate and automate data movement and data transformation.

Image for post
Image for post
On the left menu, Create a resource + New > Data + Analytics > Data Factory. On the New data factory page, under Name, enter adfdatafactoryprod.
Image for post
Great!! Now you can launch the Azure Data Factory user interface (UI) in a separate tab, select the Author & Monitor tile.
Image for post
On the Let’s get started page and Click Copy Data.
Image for post
On the Properties page, under Task name, enter CopyMyCSVToSqlPipeline. Then select Next. The Data Factory UI creates a pipeline with the specified task name.
Image for post
Create new connection
Image for post
Image for post
Left choose Azure Blob Storage — Right give a new name and choose your Storage Account where CSV is located
Image for post
Blob Storage connection created!
Image for post
Click in Browse and choose your CSV File
Image for post
Checking Dataset
Image for post
Image for post
SQL Database Destination and create new connection
Image for post
Image for post
Left Choose Name, SQL Server Name, Database Name and DB credentials — Right Destination created!
Image for post
Image for post
Choose your SQL Database
Image for post
Image for post
Image for post
On the Deployment page, select Monitor to monitor the pipeline (task).
Image for post
Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to view activity run details and to rerun the pipeline. Select Refresh to refresh the list.

Well Done!! Verify that the data is inserted into the dbo.dl_sales table in your SQL database.

Image for post

Conclusion

Azure Data Lake is an important new part of Microsoft’s ambitious cloud offering. With Data Lake, Microsoft provides service to store and analyze data of any size at an affordable cost. In related posts, we will learn more about Data Lake Store, Data Lake Analytics, and HDInsight.

Original post at https://bit.ly/2NAA1BF