AWS

Data Storage in AWS S3 Bucket with Glue

Feb 29, 2024

In today's data-driven world, efficient storage and management of data are paramount for businesses of all sizes. Nowadays there are multiple data sources available in the market and the need for analytics is vastly increased, hence having a reliable and scalable database management is essential. Amazon Web Services (AWS) offers a robust set of tools for data storage, including the Simple Storage Service (S3), a highly durable and scalable object storage solution, and a fully managed extract, transform, and load (ETL) service known as AWS Glue Script.

In this blog, we'll discuss about process of storing and archiving data using an AWS S3 bucket and an AWS Glue script. We'll explore the benefits of this approach and provide a step-by-step guide to help you set up your data storage and archiving solution.

Setting Up Data Storage and Archiving with AWS S3 and AWS Glue:

Let’s learn how to create S3 Bucket & AWS Glue Script.

Step 1) Create an AWS S3 Bucket: Create an S3 bucket in the AWS Management Console. Choose a unique bucket name, Remember to select the appropriate Region & Access settings according to your requirements and configure the necessary permissions. This bucket will be used to store the archived data.

Step 2) Configure Life-cycle Policies: Once the bucket is created, We can manage the Life-cycle & Replication Policies of the stored object. Open the created bucket & go to the management tab. Configure the rules for Life-cycle & Replication Policies as per the requirement. We can define rules to transition objects to different storage classes or delete them after a certain period.

Step 3) Develop an AWS Glue Script: Now we have created a storage system to store the archived data. The next step is to develop an AWS Glue script to perform the necessary ETL operations on your data. This may include extracting data from various sources, transforming it into the desired format, and loading it into your S3 bucket. AWS Glue supports Python as the scripting language for defining ETL jobs, making it flexible and easy to use for developers and data engineers.

Here's a detailed breakdown of how to develop an AWS Glue script:

Create a Glue Job: In the AWS Glue console, navigate to the "Jobs" section and click on "Add job". Provide a name for your job and select the IAM role that grants necessary permissions for Glue to access your data sources and write to the S3 bucket. AWS also provides a visual interface that allows users to create, run, and monitor data integration jobs in AWS Glue. It offers a graphical, no-code interface for building AWS Glue jobs with easy steps.

Define Data Sources & Destination: Identify the data sources you'll be working with. These can include various types of data repositories such as relational databases, data lakes, or even streaming data sources like Amazon Kinesis. AWS Glue supports a wide range of data sources, allowing you to extract data from diverse platforms.

We just have to configure the Source, Transformation if needed & the Destination with the proper connection string. For example, For the Relational Database source, we have to provide a JDBC connection of the server OR Data catalog table. Once the connection is successful we can enter the schema & object name & get the data preview in AWS.

After successfully configuring the source & destination location, AWS will automatically generate the ETL script which we can refer to in the script tab.

Write Glue Script: In the script editor, We can also directly write the Python code that defines our ETL operations. Let's understand it with an example. We will connect the SQL server Database & execute the SP which will transfer the data into a table. From this table we will archive the data into our S3 bucket.

Please find the structure of it below :

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

from py4j.java_gateway import java_import
source_jdbc_conf = glueContext.extract_jdbc_conf('ConnectionName')
  
java_import(sc._gateway.jvm,"java.sql.Connection")
java_import(sc._gateway.jvm,"java.sql.DatabaseMetaData")
java_import(sc._gateway.jvm,"java.sql.DriverManager")
java_import(sc._gateway.jvm,"java.sql.SQLException")

conn = sc._gateway.jvm.DriverManager.getConnection(source_jdbc_conf.get('url') + ";databaseName=DB_NAME", source_jdbc_conf.get('user'), source_jdbc_conf.get('password'))

cstmt = conn.prepareCall("{call dbo.sptoGetthedatandtransferintotable(?)}");
results = cstmt.execute();

# Script generated for node SQL Server table
SQLServertable_node1 = glueContext.create_dynamic_frame.from_options(
    connection_type="sqlserver",
    connection_options={
        "useConnectionProperties": "true",
        "dbtable": "Data_Table",
        "connectionName": "Connection_name"
    },
    transformation_ctx="SQLServertable_node1",
)

# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
    frame=SQLServertable_node1,
    connection_type="s3",
    format="glueparquet",
    connection_options={"path": "s3://s3_newcreatedbucket//"},
    format_options={"compression": "snappy"},
    transformation_ctx="S3bucket_node3",
)

conn.close();
job.commit();

Step 4) Version Control: AWS also provides version controling of the job script through GIT so we can track and manage changes.

Step 5) Run the Glue Job: Once you're satisfied with the script's functionality, you can run the Glue job either on-demand or schedule it to run at specific intervals. AWS Glue will execute the script, extract data from the defined sources, perform transformations, and load the transformed data into the specified S3 bucket.

Step 6) Monitor Job Execution: Monitor the job execution in the AWS Glue console or via AWS Cloud Watch. You can track metrics such as job run time, success/failure status, and resource utilization to ensure that your ETL processes are running smoothly.

After following these steps, You should be able to efficiently store/archive data in S3 Bucket using AWS Glue script. Before wrapping up let's understand what are the benefits of AWS S3 & AWS Glue Script Service.

Benefits of Using AWS S3 and AWS Glue:

Scalability: AWS S3 provides virtually unlimited storage capacity, Individual objects can be up to 5TB in size. Allowing you to scale your storage resources seamlessly as your data grows.
Durability: S3 offers 99.999999999% durability for stored objects, this means that if you store 100 billion objects in S3, you will lose one object at most. This ensures that your data is highly resilient and protected against loss.
Cost-effectiveness: With AWS S3, you only pay for the storage you use, making it a cost-effective solution for businesses of all sizes.
Simplified Management: AWS Glue automates the process of data discovery, transformation, and loading, streamlining the data management process and reducing the need for manual intervention.
Integration: Both AWS S3 and AWS Glue seamlessly integrate with other AWS services, such as Amazon RDS, Amazon Redshift, Amazon Athena, and Amazon EMR, allowing you to build comprehensive data pipelines and analytics workflows.
Availability: Amazon S3 replicates data across multiple disks, so even if one of them fails, customers can still access their data with no downtime. It Ensures that your data is always available whenever we require it.

So Overall to summarize this blog we learned that, by leveraging AWS S3 and AWS Glue, you can build a robust data storage and archiving solution that is scalable, durable, and cost-effective. Whether you're dealing with large volumes of data or need to automate the process of archiving historical data, AWS provides the tools and services you need to streamline your data management workflows. Start exploring the possibilities today and unlock the full potential of your data with AWS.

Thank you for your visit. Hoping this blog was helpful & you got what you were looking for. Best of Luck

TAGS ETL

About the Author

Jeet Modi

Working as a Software Engineer at MagnusMinds IT Solutions. Skilled in SQL, AWS services, Python, and Power BI. Love working with data and always eager to learn new technologies to improve tech skills & project success.