Quick CSV to SQL with Azure Databricks | MagnusMinds Blog

Apr 14, 2023

In this blog, we will explore Azure Databricks, a cloud-based analytics platform, and how it can be used to parse a CSV file from Azure storage and then store the data in a database. Additionally, we will also learn how to process stream data and use Databricks notebook in Azure Data Pipeline.

 

Azure Databricks Overview

Azure Databricks is an Apache Spark-based analytics platform that provides a collaborative workspace for data scientists, data engineers, and business analysts. It is a cloud-based service that is designed to handle big data and allows users to process data at scale. Databricks also provides tools for data analysis, machine learning, and visualization. With its integration with Azure Storage, Azure Data Factory, and other Azure services, Azure Databricks can be used to build end-to-end data processing pipelines.

 

Parsing CSV File from Azure BlobStorage to Database using Azure Databricks

Azure Databricks can be used to parse CSV files from Azure Storage and then store the data in a database. Here are the steps to accomplish this:

 

Configure Various Azure Components

1. Create Azure Resource Group

Image 1

2. Create Azure DataBricks Resource 

Image 2

3. Create SQL Server Resource 

Image 3

4. Create SQL Database Resource

Image 4

5. Create Azure Storage Account 

Image 5

6. Create Azure DataFactory Resource 

Image 6

7. Launch Databricks Resource Workspace 

Image 7

8. Create Computing Cluster 

Image 8

9. Create New Notebook 

Image 9

 

Parsing CSV File from Azure Storage to Database using Azure Databricks

Azure Databricks can be used to parse CSV files from Azure Storage and then store the data in a database. Here are the steps to accomplish this:

1. Create a cluster: First, create a cluster in Azure Databricks as above. A cluster is a group of nodes that work together to process data.

2. Import all the necessary models in the databricks notebook 

%python

from datetime import datetime, timedelta

from azure.storage.blob import BlobServiceClient, generate_blob_sas, BlobSasPermissions

import pandas as pd

import pymssql

import pyspark.sql

Code 1

3. Mount Azure Storage: Next, mount the Azure Storage account in Databricks as follows

#Configure Blob Connection

storage_account_name = "storage"

storage_account_access_key="***********************************"

blob_container = "blob-container"

Code 2

4. Establish The DataBase Connection

#DB connection

conn = pymssql.connect(server='****************.database.windows.net', user='*****', password='*****', database='DataBricksDB')

cursor = conn.cursor()

Code 3

5. Parse CSV file: Once the storage account is mounted, you can parse the CSV file using the following code

#get a list of all blob from the container

blob_list = []

for blob_i in container_client.list_blobs():

blob_list.append(blob_i.name)

# print(blob_list)

    

df_list = []

#Generate SAS key for each file and load to the dataframe 

for blob_i in blob_list:

    print(blob_i)

    sas_i = generate_blob_sas(account_name = storage_account_name,

                             container_name = blob_container,

                             blob_name = blob_i,

                             account_key = storage_account_access_key,

                             permission = BlobSasPermissions(read=True),

                             expiry = datetime.utcnow() + timedelta(hours=12))

 

    sas_url = 'https://' + storage_account_name +'.blob.core.windows.net/' + blob_container + '/' +blob_i

    print(sas_url)

    

    df=pd.read_csv(sas_url)

    df_list.append(df)

Code 4

6. Transform and Store data in a database: Finally, you can store the data in a database using the following code

#Truncate Table Sales

Truncate_Query = "IF EXISTS (SELECT * FROM sysobjects WHERE name='sales' and xtype='U') truncate table sales"

cursor.execute(Truncate_Query)

conn.commit()

 

# SQL Query For Table Creation

create_table_query = "IF NOT EXISTS (SELECT * FROM sysobjects WHERE name='sales' and xtype='U') CREATE TABLE sales (REGION  varchar(max),COUNTRY  varchar(max),ITEMTYPE  varchar(max),SALESCHANNEL  varchar(max),ORDERPRIORITY  varchar(max),ORDERDATE  varchar(max),ORDERID  varchar(max),SHIPDATE  varchar(max),UNITSSOLD  varchar(max),UNITPRICE  varchar(max),UNITCOST  varchar(max),TOTALREVENUE  varchar(max),TOTALCOST  varchar(max),TOTALPROFIT  varchar(max))IF NOT EXISTS (SELECT * FROM sysobjects WHERE name='sales' and xtype='U') CREATE TABLE sales (REGION  varchar(max),COUNTRY  varchar(max),ITEMTYPE  varchar(max),SALESCHANNEL  varchar(max),ORDERPRIORITY  varchar(max),ORDERDATE  varchar(max),ORDERID  varchar(max),SHIPDATE  varchar(max),UNITSSOLD  varchar(max),UNITPRICE  varchar(max),UNITCOST  varchar(max),TOTALREVENUE  varchar(max),TOTALCOST  varchar(max),TOTALPROFIT  varchar(max))"

cursor.execute(create_table_query)

conn.commit()

 

#Insert Data From Main DataFrame

for rows in df_combined.itertuples(index=False,name=None):

    row = str(list(rows))

    row_data = row[1:-1]

    row_data = row_data.replace("nan","''")

    row_data = row_data.replace("None","''")

insert_query = "insert into sales (REGION,COUNTRY,ITEMTYPE,SALESCHANNEL,ORDERPRIORITY,ORDERDATE,ORDERID,SHIPDATE,UNITSSOLD,UNITPRICE,UNITCOST,TOTALREVENUE,TOTALCOST,TOTALPROFIT) values ("+row_data+")"

    print(insert_query)

    cursor.execute(insert_query)

conn.commit()

Code 5

As, Shown here The data from all the files is loaded to the SQL server Table

Image 10

 

Azure Databricks notebook can be used to process stream data in Azure Data Pipeline. Here are the steps to accomplish this:

1. Create a Databricks notebook: First, create a Databricks notebook in Azure Databricks. A notebook is a web-based interface for working with code and data.

2. Create a job: Next, create a job in Azure Data Factory to execute the notebook. A job is a collection of tasks that can be scheduled and run automatically.

3. Configure the job: In the job settings, specify the Azure Databricks cluster and notebook that you want to use. Also, specify the input and output datasets.

4. Write the code: In the Databricks notebook, write the code to process the stream data. Here is an example code:

#from pyspark.sql.functions import window

stream_data = spark.readStream \

    .format("csv") \

    .option("header", "true") \

    .schema("<schema>") \

    .load("/mnt/<mount-name>/<file-name>.csv")

 

stream_data = stream_data \

    .withWatermark("timestamp", "10 minutes") \

    .groupBy(window("timestamp", "10

Code 6
 

How To Use Azure Databrick notebook in Azure Data Factory pipeline and configure the DataFlow Pipeline Using it.

Image 11

1. Create ADF Pipeline 

Image 12

2. Configure Data Pipeline 

Image 13

3. Add Trigger To the PipeLine 

Image 14

4. Configure the trigger 

Image 15

 

These capabilities make Azure Databricks an ideal platform for building real-time data processing solutions.

Overall, Azure Databricks provides a scalable and flexible solution for data processing and analytics, and it's definitely worth exploring if you're working with big data on the Azure platform. With its powerful tools and easy-to-use interface, Azure Databricks is a valuable addition to any data analytics toolkit.

MySQL Federated Engine Data Migration
Mar 13, 2024

Scenario: If someone say you Hey, can you transfer one of MySQL data to another MySQL data and we think about SSIS or other Thing if yes then these article made for you to reduce your effort and save your time Introduction: In the dynamic landscape of database management, the need to seamlessly access and integrate data from multiple sources has become paramount. Whether it's consolidating information from disparate servers or synchronizing databases for backup and redundancy, MySQL offers a robust solution through its querying capabilities. In this guide, we delve into the art of fetching data from one MySQL server to another using SQL queries. This method, often overlooked in favor of complex data transfer mechanisms, provides a streamlined approach to data migration, enabling developers and database administrators to efficiently manage their resources. Through a combination of MySQL's versatile querying language and the innovative use of the FEDERATED storage engine, we'll explore how to establish connections between servers, replicate table structures, and effortlessly transfer data across the network. From setting up the environment to executing queries and troubleshooting common challenges, this tutorial equips you with the knowledge and tools to navigate the intricacies of cross-server data retrieval with ease. As we know We gonna use FEDERATED feature of MySQL workbench so first we need to check that our workbench support FEDERATED engine or not?   Simply open workbench and run below code show engines;   It shows all engines and check our system support FEDERATED OR NOT   If your system also not support don't worry we gonna enable it Open your folder where you save MySQL serve file In my case it in my C drive C>ProgramData>MySQL>MySQL Server 8.0>my.ini    open it in notepad++ or preferable software    Insert FEDERATED key word in script like below   Now need to restart MySQL Press Window+R button and paste services.msc press ok> find MySQL and restart it Now go to workbence and run show engines;  code   Now your FEDERATED engine get supported It show like below   Now our system Support FEDERATED engine This same process need to apply on destination side because both server (from source to destination server) need to support FEDERATED engine Now we make sure to we have permission of access source server for that we need to make user and and give permission of database and tables   Below code demonstrate to make user and give permission to user CREATE USER 'hmysql'@'192.168.1.173' IDENTIFIED BY 'Hardik...'; GRANT ALL PRIVILEGES ON *.* TO 'hmysql'@'192.168.1.173' WITH GRANT OPTION; FLUSH PRIVILEGES;   Now make connection of that user(we make above on source side) on destination server(our system)    Click on plus(+) icon as shown in image and fill all detail   Below image is for detail of user connection   After filling details our user added like below image   Go to user(hardikmysql) and find from which table we want to take data using MySQL query    Here i am taking 'actor' table from 'sakila' database which look like below   Now we need to run FEDERATED query on our system(destination server) with url string   Our MySQL query like below CREATE TABLE `actor` ( `actor_id` smallint unsigned NOT NULL AUTO_INCREMENT, `first_name` varchar(45) NOT NULL, `last_name` varchar(45) NOT NULL, `last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`actor_id`), KEY `idx_actor_last_name` (`last_name`) ) ENGINE=FEDERATED default charset=utf8mb4 CONNECTION='mysql://hmysql:[email protected]:3306/sakila/actor';   Here main part is below ENGINE=FEDERATED default charset=utf8mb4 CONNECTION='mysql://hmysql:[email protected]:3306/sakila/actor';   Here 'mysql' is mandatory for connection string you can not use other word. 'hmysql' is user name 'Hardik...'  is password for user '192.168.1.173' is server adderess '3306' is port number 'sakila' is database name 'actor' is table name   Now run above table code and you get data in our system(destination server)    

Quick Setup Guide: SQL Server Replication
Jan 12, 2024

Setting up replication in SQL Server can be a powerful way to ensure data consistency and availability across multiple servers. In this step-by-step guide, we'll walk through the process of configuring replication on SQL Servers.   Step 1: Understand Replication Types Before diving into configuration, it's crucial to understand the types of replication available in SQL Server.  Snapshot Replication: Takes a snapshot of the data at a specific point in time. Transactional Replication: Replicates changes in real-time as they occur. Merge Replication: Allows bidirectional data synchronization between servers. Choose the replication type that aligns with your specific needs and database architecture.   Step 2: Prepare Your Environment Ensure that your SQL Server environment is ready for replication. This involves verifying that you have the necessary permissions and establishing proper connectivity between the SQL Server instances. Remember that replication involves three key components: Publisher, Distributor, and Subscribers. The Distributor can be on the same server as the Publisher or a separate server.   Step 3: Configure Distributor If a Distributor isn't already set up, proceed to configure one. This involves specifying the server that will act as the Distributor and setting up distribution databases. Use either SQL Server Management Studio (SSMS) or T-SQL scripts for this configuration.   Step 4: Enable Replication on the Publisher 1. Open SSMS and connect to the Publisher. 2. Right-click on the target database and choose "Tasks" > "Replication" > "Configure Distribution." 3. Follow the wizard, specifying the Distributor configured in Step 3.   Step 5: Choose Articles Define the articles by selecting the tables, views, or stored procedures you want to replicate. This step allows you to fine-tune your replication by specifying data filters, choosing columns to replicate, and configuring additional options based on your specific requirements.   Step 6: Configure Subscribers 1. Connect to the Subscribers in SSMS. 2. Right-click on the Replication folder and choose "Configure Distribution." 3. Follow the wizard, specifying the Distributor and configuring additional settings based on your chosen replication type.   Step 7: Configure Subscription With the Distributor and Subscribers configured, it's time to set up subscriptions. 1. In SSMS, navigate to the Replication folder on the Publisher. 2. Right-click on the Local Publications and choose "New Subscriptions." 3. Follow the wizard to configure the subscription, specifying the Subscribers and defining any additional settings.   Step 8: Monitor and Maintain Regular monitoring and maintenance are essential for a healthy replication environment. - Use the Replication Monitor in SSMS to view the status of publications, subscriptions, and any potential errors. - Implement routine maintenance tasks such as backing up and restoring the replication databases.   Conclusion Configuring replication in SQL Server involves a series of well-defined steps. By understanding your replication needs, preparing your environment, and carefully configuring each component, you can establish a robust and reliable replication setup. Regular monitoring and maintenance ensure the ongoing efficiency and performance of your replication environment.

BI ChatBot in Domo: Step-by-Step Guide
Jan 05, 2024

In the ever-evolving landscape of business intelligence (BI), the need for seamless interaction with data is paramount. Imagine a world where you could effortlessly pose natural language questions to your datasets and receive insightful answers in return. Welcome to the future of BI, where the power of conversational interfaces meets the robust capabilities of Domo. This blog post serves as your comprehensive guide to implementing a BI ChatBot within the Domo platform, a revolutionary step towards making data exploration and analysis more intuitive and accessible than ever before. Gone are the days of wrestling with complex queries or navigating through intricate dashboards. With the BI ChatBot in Domo, users can now simply articulate their questions in plain language and navigate through datasets with unprecedented ease. Join us on this journey as we break down the process into manageable steps, allowing you to harness the full potential of BI ChatBot integration within the Domo ecosystem. Whether you're a seasoned data analyst or a business professional seeking data-driven insights, this guide will empower you to unlock the true value of your data through natural language interactions. Get ready to elevate your BI experience and transform the way you interact with your datasets. Let's dive into the future of business intelligence with the implementation of a BI ChatBot in Domo.   Prerequisites: ChatGPT API Key: Prepare for the integration of natural language to SQL conversion by obtaining a ChatGPT API Key. This key will empower your system to seamlessly translate user queries in natural language into SQL commands. DOMO Access: Ensure that you have the necessary access rights to create a new application within the Domo platform. This step is crucial for configuring and deploying the BI ChatBot effectively within your Domo environment.   1: Integrate the HTML Easy Bricks App. Begin the process by incorporating the HTML Easy Bricks App into your project. Navigate to the AppStore and add the HTML Easy Bricks to your collection. Save it to your dashboard for easy access. Upon opening the App for the first time, it will have a default appearance. To enhance its visual appeal and functionality, customize it by incorporating the HTML and CSS code. This transformation will result in the refined look illustrated below.   Image 1: DOMO HTML Easy Brick UI   2: Map/Connect the Dataset to the Card. In this phase, establish a connection between the dataset and the card where users will pose their inquiries. Refer to the image below, where the "Key" dataset is linked to "dataset0." Extend this mapping to accommodate up to three datasets. If your project involves more datasets, consider using the DDX-TEN-DATASETS App instead of HTML Easy Bricks for a more scalable solution. This ensures seamless integration and accessibility for users interacting with various datasets within your Domo environment.   Image 2: Attach Dataset With Card   3: Execute the Query on the Dataset for Results. In this phase, you'll implement the code to execute a query on the dataset, fetching the desired results. Before this, initiate a call to the ChatGPT API to dynamically generate an SQL query based on the user's natural language question. It's essential to note that the below code is designed to only accept valid column names in the query, adhering strictly to MySQL syntax. To facilitate accurate query generation from ChatGPT, create a prompt that includes the dataset schema and provides clear guidance for obtaining precise SQL queries. Here is a call to the ChatGPT API to get SQL Query. VAR GPTKEY = 'key' VAR Prompt = 'Write effective prompt' $.ajax({             url: 'https://api.openai.com/v1/chat/completions',             headers: {               'Authorization': 'Bearer ' + GPTKEY,               'Content-Type': 'application/json'             },             method: 'POST',             data: JSON.stringify({               model: 'gpt-3.5-turbo',               messages: Prompt,               max_tokens: 100,               temperature: 0.5,               top_p: 1.0,               frequency_penalty: 0.0,               presence_penalty: 0.0             }),             success: function (response) {                   //Write code to store the Query into the variable            } });   Refer to the code snippet below for executing the query on Domo and retrieving the results. var domo = window.domo; var datasets = window.datasets; domo.post('/sql/v1/'+ 'dataset0', SQLQuery, {contentType: 'text/plain'}).then(function(data) {   //Write your Java or JQuery code to print data. });   The above code will accept the SQL queries generated by ChatGPT. It's important to highlight that, in the code, there is a hardcoded specification that every query will be applied to the dataset mapped as 'dataset0'. It's advisable to customize this part based on user selection. The code is designed to accept datasets with names such as 'dataset0', 'dataset1', and so forth. Ensure that any modifications align with the chosen dataset for optimal functionality, you can also use the domo.get method to get data for more information visit here. The outcome will be presented in JSON format, offering flexibility for further processing. You can seamlessly transfer this data to a table format and display or print it as needed.   Conclusion Incorporating a BI ChatBot in Domo revolutionizes data interaction, seamlessly translating natural language queries into actionable insights. The guide's step-by-step approach simplifies integration, offering both analysts and business professionals an intuitive and accessible data exploration experience. As datasets effortlessly respond to user inquiries, this transformative synergy between ChatGPT and Domo reshapes how we extract value from data, heralding a future of conversational and insightful business intelligence. Dive into this dynamic integration to propel your decision-making processes into a new era of efficiency and accessibility.

Hardik Dangar

About the Author

Hardik Dangar

Project Lead in Magnusminds

Hardik is working as Project Lead of MSBI in INDIA. Hardik started his career working on SQL Server and MSBI. Hardik is having 5+ years of experience. In the starting of his career he was working on SQL Server, SSIS and SSRS. Hardik likes to explore technical things on SQL Server.