Table of Contents
Data has always been a critical resource for businesses to improve their processes and outperform their competitors. However, to maximize the value of information, organizations must use and store data properly. The achievement of this goal requires powerful solutions. Thus, many businesses that have huge amounts of information and use the ETL process rely on AWS Glue.
In this article, we’re going to answer what is AWS Glue. Read the article to find out more!
What is Glue in AWS?
So what is AWS Glue? It is a serverless ETL service that is integrated across a wide range of Amazon services to prepare and load information easily from multiple sources. Data engineers and developers can use the service to create, run, and monitor ETL jobs with high efficiency and ease.
Amazon Glue allows users to search for both structured and semi-structured data in the Amazon S3 storage or other sources and gives them a 360-degree view of their assets. On top of that, the service provides customization, orchestration, and monitoring of complex jobs.
The core components
The service relies on the interaction of various components that work together to help you design and maintain ETL processes. Let’s look at these components in more detail:
- AWS Glue Console: allows users to create, view, and manage ETL jobs;
- Data Catalog: metadata storage that allows you to create data queries and transformations, and includes crawlers and tables;
- Job: a business logic that is used to perform ETL tasks;
- Job Scheduling System: a managed scheduling system for running ETL jobs;
- Trigger: executes ETL jobs on demand or at a set time;
- Classifier: a program that determines the schema of the data. With it, you can classify CSV, JSON, and many other file types and relational database management systems;
- Crawler: a program that explores data repositories to create metadata tables;
- Data Target: a location to store modified information by a task;
- Database: allows users to create and access database sources and targets;
- Development endpoint: provides a development environment in Apache Spark for testing, developing, and debugging ETL job scripts.
Now that we’ve figured out the main components, it’s time to put them together and see how the service operates.
How it works
We’ve mentioned ETL processes a few times, so let’s clarify what exactly it is before moving forward. When processing their information, companies can choose either an ELT or an ETL approach.
ELT stands for Extract-Load-Transform, which means the data is extracted from the source, loaded into the database in a raw format, and transformed only afterward. ETL (Extract-Transform-Load), on the contrary, implies that the information is extracted, transformed, and only then loaded in the storage. Glue works by the ETL approach, so it’s important to keep that in mind.
Now, back to the topic.
The service uses cloud services provided by Amazon to orchestrate ETL jobs, and to construct data warehouses and data lakes. Then, it connects these services into one management service aka the AWS Glue Console. The console allows users to monitor and create ETL jobs.
AWS Glue can also automatically detect and catalog the data with the Data Catalog. To put the metadata together, the service uses Crawlers. Crawlers and classifiers scan raw data to extract needed attributes and schema information.
Once the Catalog is categorized, the information instantly becomes searchable, queryable, and ready for the ETL process. Users may schedule ETL jobs or select events that trigger a job by using a Job Scheduling System. Further, Glue extracts the information, transforms it by using the code generated in Scala and Python, and loads it into Amazon S3 or Amazon Redshift. The script runs on an Apache Spark environment.
AWS Glue: main features
The main features of the service can be divided into three key categories based on their functions:
- Organize and investigate the data;
- Prepare, clean, and transform the data for analysis;
- Build and monitor jobs.
As you can see, Amazon Glue has all the features one may need to fully operate the information and gain all the needed insights from it. Now let’s look at its most interesting features:
- Automated schema discovery: the service enables developers to automate crawlers that collect schema-related data and store it in a catalog;
- Drag and drop interface: a drag-and-drop job editor allows users to easily set up their ETL process, and AWS Glue will automatically generate code to convert, extract, and upload the data;
- Job Scheduling: ETL jobs can be used on-demand, on a schedule, or based on an event. Schedulers can also be used to build sophisticated ETL pipelines with dependencies between tasks;
- Automated Machine Learning: there is an in-built feature called “FindMatches”.The feature detects imperfect copies of records and reduplicates them;
- Integrated data catalog: combines data from disparate sources into one single repository;
- Automatic code generation: Glue automatically generates scripts based on your input data to extract, transform, and load it. As well, you can use ETL libraries to write your own scripts in Python and Scala, edit existing scripts, and import scripts from external sources;
- Developer endpoints: the service offers developers endpoints to edit, debug, and test ETL code. With its interactive sessions, developers can interactively explore and prepare the information using the IDE or notebook.
AWS Glue: when should it be used?
As mentioned earlier, organizations use Glue to run ETL jobs on Apache Spark-based serverless platforms. Usually, AWS Glue is used to:
- Explore available data and connect to a variety of sources;
- Manage data in a centralized catalog;
- Create, run, and monitor ETL pipelines to load the data into data lakes;
- Prepare data for the analysis.
Is Glue suitable for all businesses? Take a look at the following: according to Enlyft, based on company revenue, 43% of companies using AWS Glue are small (under $50 million), 8% are medium-size, and 42% are large (over $1 billion). It’s not all about size.
Accordingly, a company that uses AWS for running its apps should definitely consider Glue. In this case, the company needs a lot of resources, i.e. DevOps engineers. Moreover, to implement and operate Glue, developers need to be skilled in a wide range of technical expertise.
However, if a company does not use AWS cloud at all, it’s a bit less clear-cut. If a company is willing to invest time and resources in building a data lake from scratch, choosing AWS cloud services may make sense.
Use of AWS Glue: pros and cons
To decide whether to use AWS Glue or not, it is crucial to look at the strengths and weaknesses of the service.
The pros of using AWS Glue include:
- Maintenance and deployment: the service is serverless, which makes maintenance and deployment easy since it is managed by AWS;
- Automatic ETL code: AWS Glue automatically generates ETL code in Scala or Python to streamline data integration operations and also enables you to handle heavy workloads;
- Cost-effective solution: pay only for the resources you use during the job running process;
- Job scheduling: provides easy-to-use tools to generate and monitor job tasks based on schedule, events, or on-demand;
- Data visibility: by using a metadata repository for data, the AWS Glue Data Catalog helps a company keep tabs on all its informational assets;
- Support: Glue easily integrates with many AWS services and supports the data stored in Amazon Redshift, Amazon S3, Amazon MSK, and others.
As for the cons and limitations, they are:
- Limited integration: AWS Glue only works with AWS services, so if you want to integrate it with platforms outside of Amazon, it may be difficult;
- Limited database support: it does not support traditional relational database queries, thereby, it only supports SQL-type queries;
- Lack of testing environment: developers have to test their code on real data because Glue does not provide a testing environment. Hence, the process becomes tedious and time-consuming;
- Need for a specific skill set: AWS Glue runs on Apache Spark. As a result, to modify ETL scripts, developers must know Spark, Scala and Python;
- Impossible for real-time operations: when using Glue, all data is first staged on S3. As a result, incremental sync with the data source is not possible.
Now, being aware of the capabilities of this platform, enterprises may worry about the cost of this solution for their business.
AWS Glue pricing
In addition to the question “what is AWS Glue”, businesses also want to know: how much is AWS Glue?
The good news is that with AWS Glue, you only pay for the time it takes to run your ETL jobs. You don’t have to manage resources or pay for the startup or shutdown times. AWS charges you based on the number of data processing units (DPUs) that are used in your ETL job. Besides, each AWS glue component has its features. Based on that, different subscriptions offer different pricing:
- ETL jobs and interactive sessions: $0.44 per DPU-Hour for a job;
- Data Catalog Storage: charge $1.00 for every 100,000 objects per month. Note that 1,000,000 monthly requests cost the same;
- Crawlers: charge $0.44 per DPU hourly. They charge a minimum of 10 minutes per crawl.
- Databrew interactive sessions: the first 40 sessions are free, then it’s $1.00 per session;
- Databrew jobs: You pay when they calculate how long it took you to clean the data. Each Databrew node costs $0.48 per hour.
It is important to note that pricing may also vary by region. In addition, bear in mind that it is a short list of the costs available on the market. You can find more information on the vendor pricing page about features, pricing subscriptions, and types of Amazon Glue jobs.
Use cases of AWS Glue
There are several common use cases for Glue listed below:
- Integration with Amazon Athena: Athena is a serverless interactive analytics service that allows for the easy creation of databases and tables that can be queried later using AWS Glue Catalog;
- Integration with Amazon S3: you can use it to ingest, clean, transform, and structure your information;
- Integration with Snowflake: users can manage their programmatic data integration process without worrying about physically maintaining it or maintaining any kind of servers and spark clusters to help manage the data integration process;
- Integration with GitHub: allows using ETL code on GitHub;
- Creating an event-driven ETL pipeline: with AWS Lambda, you can trigger an ETL job when new data is added to Amazon S3.
AWS Glue helps companies extract, transform, load, and move their data reliably between multiple sources. Many developers and IT experts have been able to reduce the complexity and manual labor related to ETL thanks to the service. But even though AWS Glue offers a variety of benefits, a business should keep in mind that it still has limitations that are to be carefully considered before the implementation. We hope our article answered the “what is AWS Glue” question in detail and that it will help you make a well-informed decision.