In recent years, as the importance of big data has grown, efficient data processing and analysis have become crucial factors in determining a company’s competitiveness. AWS Glue, a serverless data integration service for integrating data across multiple data sources at scale, addresses these data processing needs. Among its features, the AWS Glue Jobs API stands out as a particularly noteworthy tool. The AWS Glue Jobs API is a robust interface that allows data engineers and developers to programmatically manage and run ETL jobs. By using this API, it becomes possible to automate, schedule, and monitor data pipelines, enabling efficient operation of large-scale data processing tasks. To improve customer experience with the AWS Glue Jobs API, developers have added a new property describing the job mode corresponding to script, visual, or notebook modes.
JobMode Property
The new JobMode property aims at enhancing your user interface experience by categorizing AWS Glue jobs into script, visual, or notebook modes based on the developer’s preferences. AWS Glue users can choose the mode that best fits their workflow. For instance, some ETL developers might prefer the visual mode to create jobs using the AWS Glue Studio visual editor, while data scientists might favor the notebook jobs using AWS Glue Studio notebooks. On the other hand, data engineers and developers might prefer implementing scripts through the AWS Glue Studio script editor or a preferred integrated development environment (IDE). Once the job is created in the preferred mode, it becomes easier to search for it by filtering based on job mode within your saved AWS Glue jobs page. Additionally, if you need to migrate existing iPython notebook files to AWS Glue Studio notebook jobs, you can now choose and set the job mode accordingly. This new API property makes it incredibly streamlined to manage and classify your AWS Glue jobs based on your development preferences.
How CreateJob API Works with the New JobMode Property
Instantiate AWS Glue Client
To utilize the CreateJob API with the new JobMode property, the first step involves instantiating the AWS Glue client using the AWS SDK for Python, also known as Boto3. This sets the foundation for any interactions or operations involving AWS Glue. Essentially, this step is about preparing your environment to use the API efficiently. Setting up the client ensures that subsequent steps, like preparing job definitions and creating jobs, proceed without any issues.
Prepare Job Definition
In the second step, you need to prepare the job definition. This involves inserting the visual nodes in CODE_GEN_JSON_STR
. These nodes essentially represent various stages of the job such as the S3 source, transformation, and S3 destination. Here, you define the exact actions that your job will undertake, from the data it will read, the transformations it will apply, to where the final data will be stored. By structuring these nodes accurately, you ensure that your job performs as intended. The visual nodes simplify the process by providing a clear, visual representation of each step within the job, making the development and debugging process more intuitive.
Create the Job
Now that the AWS Glue client is instantiated and the job definition prepared, you can proceed to create the job. Use the create_job
function, setting the JobMode parameter to VISUAL, SCRIPT, or NOTEBOOK as per your requirement. This function call will instantiate the job in AWS Glue based on the provided parameters and job definition. Your job is now created and will appear in the AWS Glue console or AWS Glue Studio interface. This automated job creation process not only saves time but ensures consistency and accuracy, reducing the possibility of human error during job setup.
Review Created Job
After creating the job, it’s essential to review it to ensure everything is in order. You can check the AWS Glue visual editor to confirm the job creation and visualize the nodes in a directed acyclic graph (DAG). This visualization helps you understand how different components of the job interact with each other, from the data source to the final data destination. In the DAG, you should see stages like the S3 source (node 1), data transformation (node 2), and the S3 destination (node 3). Reviewing the visual representation provides an added level of assurance that your job is set up correctly and is ready to execute as planned.
How CloudFormation Works with the New JobMode Property
Create a Jupyter Notebook
The integration of the new JobMode property offers significant enhancements to AWS CloudFormation by allowing different job modes. To create a notebook job using AWS CloudFormation, begin by drafting your logic and code in a Jupyter Notebook file and saving it as my-glue-notebook.ipynb
. The Jupyter Notebook serves as the scripting environment where you can design and test your ETL processes containing multiple code cells that represent various stages of data processing. This step is crucial as it lays the groundwork for subsequent steps, ensuring you have a functional and complete notebook ready for deployment.
Upload Notebook to S3
Next, you need to upload the Jupyter Notebook file to an S3 bucket, specifically within the notebooks/
folder. This upload ensures that AWS Glue has access to your notebook file during the job execution. The notebook file should be placed in a bucket designated for AWS Glue assets, typically named aws-glue-assets--
. This storage arrangement guarantees that AWS Glue can locate and utilize the notebook efficiently, minimizing any risks of file access errors during the job runtime.
Draft CloudFormation Template
With the notebook file securely stored in S3, the next step involves creating a new CloudFormation template to generate the AWS Glue job. While drafting the template, ensure you set the NotebookJobName
parameter to match the name of your notebook file. This parameter links your CloudFormation stack to the specific notebook file, ensuring that the job created will execute the logic contained within that notebook. Designing this template accurately is essential as it acts as the blueprint for AWS Glue to create, configure, and execute the job based on your specifications.
Deploy CloudFormation Template
Once your template is ready, it’s time to deploy it. Execute the template deployment process, and when prompted, provide the same name as the notebook file for the NotebookJobName
parameter. The deployment process will use the information from the template to create a new AWS Glue job. After completion, you can navigate to the AWS Glue console to verify that the newly created job is correctly named and listed. This step ensures that the job was created and configured as intended, ready for execution when needed.
Confirm Notebook Job Creation
After deploying the CloudFormation template, it’s important to confirm that the Jupyter Notebook job was created correctly. Check the AWS Glue console to verify that the job is listed with the name specified in the CloudFormation template. This verification step guarantees that the job includes the notebook file and all the corresponding cells and logic. It also confirms that the job is configured correctly according to the parameters set in the CloudFormation template. Once verified, your notebook job is ready for use, allowing you to execute complex ETL processes efficiently.
Console Experience
Navigate to ETL Jobs
Accessing and managing your AWS Glue jobs is now more user-friendly with the enhanced console features. Begin by navigating to the AWS Glue console. In the navigation pane, select ETL Jobs to observe all your ETL jobs listed. The console provides various columns such as Job name, Type, Created by, Last modified, and AWS Glue version. These columns offer detailed information about each job, aiding in quick and effective job management. The user-friendly interface allows you to effortlessly browse through your jobs, making the console a practical tool for everyday operations.
Sort and Filter Jobs
Managing a multitude of jobs can be overwhelming, but the AWS Glue console’s sorting and filtering capabilities simplify the process. You can sort and filter the jobs using different columns. For example, you can organize by Job name to find specific jobs easily. Sorting by Type helps to identify jobs based on their function, such as visual, notebook, or script jobs. Columns like Last modified and AWS Glue version provide insights into the job’s history and compatibility. These sorting and filtering options streamline job management, allowing you to focus on tasks requiring immediate attention.
Utilize JobMode Filtering
One of the standout features introduced is the ability to filter jobs by their respective JobMode. The “Created by” column in the AWS Glue console provides information about the JobMode of each job. This feature allows you to filter jobs created by VISUAL, NOTEBOOK, or SCRIPT modes, making it easier to pinpoint and manage jobs according to their development style. This level of filtering enhances job discoverability, ensuring you can locate and manage jobs quickly, regardless of the mode they were created in. It also helps in maintaining a well-organized console experience, contributing to increased operational efficiency.
Conclusion
To effectively use the CreateJob API with the newly introduced JobMode property, you first need to set up the AWS Glue client using the AWS SDK for Python, commonly known as Boto3. Setting up this client is crucial because it lays the groundwork for all further interactions and operations involving AWS Glue. Without this preparation, you wouldn’t be able to proceed smoothly with defining and creating jobs. Essentially, this step is all about initializing your environment so that it is ready for the API calls you will need to make. This ensures that subsequent steps, such as preparing job definitions and actually creating the jobs, go off without a hitch. Additionally, once you have the AWS Glue client ready, you can start focusing on more specific tasks like configuring various job parameters, handling data transformations, and managing job workflows. This setup is vital in streamlining your processes and making sure that your interactions with AWS Glue are efficient and problem-free. By doing so, you mitigate potential issues that might arise, making the entire workflow much more efficient and reliable.