In today's data-driven world, efficient storage and management of data are paramount for businesses of all sizes. Nowadays there are multiple data sources available in the market and the need for analytics is vastly increased, hence having a reliable and scalable database management is essential. Amazon Web Services (AWS) offers a robust set of tools for data storage, including the Simple Storage Service (S3), a highly durable and scalable object storage solution, and a fully managed extract, transform, and load (ETL) service known as AWS Glue Script.
In this blog, we'll discuss about process of storing and archiving data using an AWS S3 bucket and an AWS Glue script. We'll explore the benefits of this approach and provide a step-by-step guide to help you set up your data storage and archiving solution.
Let’s learn how to create S3 Bucket & AWS Glue Script.
Step 1) Create an AWS S3 Bucket: Create an S3 bucket in the AWS Management Console. Choose a unique bucket name, Remember to select the appropriate Region & Access settings according to your requirements and configure the necessary permissions. This bucket will be used to store the archived data.
Step 2) Configure Life-cycle Policies: Once the bucket is created, We can manage the Life-cycle & Replication Policies of the stored object. Open the created bucket & go to the management tab. Configure the rules for Life-cycle & Replication Policies as per the requirement. We can define rules to transition objects to different storage classes or delete them after a certain period.
Step 3) Develop an AWS Glue Script: Now we have created a storage system to store the archived data. The next step is to develop an AWS Glue script to perform the necessary ETL operations on your data. This may include extracting data from various sources, transforming it into the desired format, and loading it into your S3 bucket. AWS Glue supports Python as the scripting language for defining ETL jobs, making it flexible and easy to use for developers and data engineers.
Here's a detailed breakdown of how to develop an AWS Glue script:
We just have to configure the Source, Transformation if needed & the Destination with the proper connection string. For example, For the Relational Database source, we have to provide a JDBC connection of the server OR Data catalog table. Once the connection is successful we can enter the schema & object name & get the data preview in AWS.
After successfully configuring the source & destination location, AWS will automatically generate the ETL script which we can refer to in the script tab.
Please find the structure of it below :
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
from py4j.java_gateway import java_import
source_jdbc_conf = glueContext.extract_jdbc_conf('ConnectionName')
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")
conn = sc._gateway.jvm.DriverManager.getConnection(source_jdbc_conf.get('url') + ";databaseName=DB_NAME", source_jdbc_conf.get('user'), source_jdbc_conf.get('password'))
cstmt = conn.prepareCall("{call dbo.sptoGetthedatandtransferintotable(?)}");
results = cstmt.execute();
# Script generated for node SQL Server table
SQLServertable_node1 = glueContext.create_dynamic_frame.from_options(
connection_type="sqlserver",
connection_options={
"useConnectionProperties": "true",
"dbtable": "Data_Table",
"connectionName": "Connection_name"
},
transformation_ctx="SQLServertable_node1",
)
# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=SQLServertable_node1,
connection_type="s3",
format="glueparquet",
connection_options={"path": "s3://s3_newcreatedbucket//"},
format_options={"compression": "snappy"},
transformation_ctx="S3bucket_node3",
)
conn.close();
job.commit();
Step 4) Version Control: AWS also provides version controling of the job script through GIT so we can track and manage changes.
Step 5) Run the Glue Job: Once you're satisfied with the script's functionality, you can run the Glue job either on-demand or schedule it to run at specific intervals. AWS Glue will execute the script, extract data from the defined sources, perform transformations, and load the transformed data into the specified S3 bucket.
Step 6) Monitor Job Execution: Monitor the job execution in the AWS Glue console or via AWS Cloud Watch. You can track metrics such as job run time, success/failure status, and resource utilization to ensure that your ETL processes are running smoothly.
After following these steps, You should be able to efficiently store/archive data in S3 Bucket using AWS Glue script. Before wrapping up let's understand what are the benefits of AWS S3 & AWS Glue Script Service.
So Overall to summarize this blog we learned that, by leveraging AWS S3 and AWS Glue, you can build a robust data storage and archiving solution that is scalable, durable, and cost-effective. Whether you're dealing with large volumes of data or need to automate the process of archiving historical data, AWS provides the tools and services you need to streamline your data management workflows. Start exploring the possibilities today and unlock the full potential of your data with AWS.
Thank you for your visit. Hoping this blog was helpful & you got what you were looking for. Best of Luck
Working as a Software Engineer at MagnusMinds IT Solutions. Skilled in SQL, AWS services, Python, and Power BI. Love working with data and always eager to learn new technologies to improve tech skills & project success.