Organizations are evolving IT processes to include data lakes and supporting services. Your organization might start by looking to extend the self-service portals you built using AWS Service Catalog to create data lakes as well. A self-service portal lets users vend required AWS resources within the guardrails defined by your cloud center of excellence (CCOE) team. This removes the heavy lifting from the CCOE team and lets users build their own environments. With AWS Service Catalog, you can also define the constraints on which AWS resources your users can and can’t deploy.
For example, with an appropriately configured self-service portal that supports creation and hydration of data lakes for structured relational data, your users could do the following:
With adequately configured constraints and templates, you can be confident that your users follow the best practices around these services (that is, private subnets, encrypted buckets, and specific security groups).
In this post, I show you how to use AWS Service Catalog to create an IT self-service portal that lets you create a data lake and populate your Data Catalog.
A data lake is a central repository of your structured and unstructured data that you use to store information in a separate environment from your active or compute space. A data lake enables diverse query capabilities, data science use cases, and discovery of new information models. For more information on data lakes, see Data Lakes and Analytics on AWS. Amazon S3 is an excellent choice for building your data lake on AWS because it offers multiple integrations. For more information, see Amazon S3 as the Data Lake Storage Platform.
Before you use your data lake for analytics and machine learning, you must first hydrate it—fill it with data—and create a Data Catalog containing metadata. AWS DMS works well for hydrating your data lake with structured data from your database. With that in place, you can use AWS Glue to automatically discover and categorize data, making it immediately searchable and queryable across data sources such as Glue ETL, Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR.
Other AWS services such as Amazon Data pipeline can be used for hydrating a data lake from a structured or unstructured database in addition to AWS DMS. This post demonstrates a specific solution that uses AWS DMS.
The following diagram shows the typical data lake hydration and cataloging process for databases.
For more information, see to following resources:
Using AWS Service Catalog, you can set up a self-service portal that lets your end users request components from your data lake, along with tools to hydrate it with data and create a Data Catalog.
The automated hydration process using AWS Service Catalog consists of the following:
To allow users from specific teams to vend these resources, you must associate an AWSServiceCatalogEndUserFullAccess managed policy with them. Your users also need IAM permissions to stop or start a crawler and a DMS task.
You can also configure the catalog to use launch constraints, which assume the appropriate, pre-configured IAM roles you configured and execute your CloudFormation template whenever your users activate specific resource. This provides your users capability to execute specific tasks, such as creating a DMS task or S3 bucket within guardrails you define.
After creating these resources, users can run the DMS task and AWS Glue crawler using the AWS console, finally hydrating and populating the Data Catalog.
You can try the above solution by deploying a sample catalog. The sample catalog solution creates a VPC, subnets, and IAM roles. It sets up a sample catalog with service products such as AWS Glue crawlers, DMS tasks, RDS, S3, and corresponding IAM roles for the AWS Glue crawler and DMS target. It also creates an end user and demonstrates how to allow that user to deploy RDS and DMS tasks using only the subnets created for them. The sample catalog also teaches you to configure launch constraints, so you don’t have to grant additional permissions to users. It contains an S3 product that vends service-specific IAM roles with access restricted to a specific S3 bucket.
By configuring catalog in such a manner, you can make implementing the following best practices at scale easier by standardizing and automating:
With AWS CloudFormation, you can automate the manual work by writing a CloudFormation template. With AWS Service Catalog, you can make templates available to end users like data curators, who might not know all the AWS services in detail. With the self-service portal, your users would vend AWS resources only using the CloudFormation template that you standardize and into which you implement your security best practices.
With a self-service portal created using AWS Service Catalog, you can automate the process and leave decision points like RDS engine type, size of the database, VPC, and other configurations to your users. This helps you maintain an appropriate level of security, keeping the nuts and bolts of that security automated behind the scene.
Make sure that you understand how to control the AWS resource you deploy using AWS Service Catalog as well as the general vocabulary before you begin. In this post, you populate a self-service portal using a sample RDS database blueprint from AWS Service Catalog reference blueprints.
To deploy the sample catalog solution discussed earlier, follow these steps.
To deploy this solution, you need administrator access to the AWS account.
A CloudFormation template handles most of the heavy lifting of sample catalog setup:
The templates this post provides are samples and not intended for production use. However, you can review the CloudFormation template to understand the infrastructure it creates.
Next, you can view the catalog:
For this post, set up an RDS database. To do so:
The output contains MasterJDBCConnectionString connection string, which includes the RDS endpoint (the underlined portion in the following example). You can use the same endpoint to connect to the database and create sample data.
This example vends an RDS database, but you can also automate the creation of an Amazon DynamoDB, Amazon Redshift cluster, Amazon Kinesis Data Firehose delivery stream, and other necessary AWS resources.
For security reasons, I provisioned the database in a private subnet. You must set up a bastion host (unless you have VPN or DirectConnect access) to connect it to your database. You can provision an Amazon Linux 2.0-based bastion host in the public subnet and then log on to the same. For more information, see Launch an Amazon EC2 instance.
The my_service_catalog_end_user does not have access to the Amazon EC2 console. Do this step with an alternate user that has permissions to launch an EC2 instance. After you launch an EC2 instance, connect to your EC2 instance.
Next, execute the following commands to create a simple database and a table with two rows:
Switch to the my_service_catalog_end_user role and follow the process outlined in Step 3 to provision a product from the S3 bucket and appropriate roles. If you are using a new AWS account that does not have dms-vpc-role and dms-cloudwatch-logs-role IAM roles, you can select N as parameters; otherwise, you can leave default values for parameters.
After you provision the product, you can see the output and find details of the S3 bucket and DMS/AWS Glue roles that can access the newly created bucket. Make a note of the output, as you need the S3 bucket information and IAM roles in subsequent steps.
Follow the process identical to Step 3 and provision a product from DMS Task product. When you provision the DMS task, on the Parameters page:
Next, open the tasks section of DMS console and locate the newly created task; its status should read Ready. Select the task and choose Restart/Resume. This starts your DMS replication task, which hydrates the S3 bucket you specified earlier with the extract of database chosen.
I granted the my_service_catalog_end_user IAM role additional permissions – dms:StartReplicationTask and dms:StopReplicationTask, to allow users to start and stop DMS tasks. This shows how you can combine minimal permissions outside AWS Service Catalog to enable your users to perform tasks.
After the task completes, its status changes to Load Complete and the S3 bucket you created earlier now contains files filled with data from your database.
Now that you have hydrated your data lake with sample data, you can run AWS Glue crawler to populate your AWS Glue Data Catalog. To do so, follow the process outlined in Step 3 and provision a product from AWS Glue crawler. On the Parameters page, specify the following parameters:
Next, open the crawlers section of the AWS Glue catalog to locate a crawler created by the AWS Service Catalog. The crawler’s status should read Ready. Select the task and then choose Run crawler. When the crawler finishes, you can review what data populated your database from the database console.
As shown in following diagram, you can select the mydb database and see tables to explore the AWS Glue Data Catalog populated by the AWS Glue crawler.
The principles discussed in this post can be extended to logs, streams, and files. You can use various BI tools to extract useful knowledge from your data lake. For information about how to visualize your data, see Harmonize, Query, and Visualize Data from Various Providers using AWS Glue, Amazon Athena, and Amazon QuickSight. You can query your data lake from Amazon SageMaker notebooks. For more information, see Access Amazon S3 data managed by AWS Glue Data Catalog from Amazon SageMaker notebooks.
AWS Service Catalog enables you to build and distribute catalogs of IT services to your organization. In this post, I demonstrated how you can set up a catalog that lets your users vend tools to support creation and hydration of data lakes and maintain tight security standards. You can extend this idea of self-service by supporting resources such as DynamoDB databases, Kinesis Data Firehose delivery streams, Amazon Redshift clusters, and Amazon SageMaker notebooks, granting your users more flexibility and utility in their data lakes within guardrails you define.
If you have questions about implementing the solution described in this post, you can start a new thread on the AWS Service Catalog Forum or contact AWS Support.
Contact us and we will help you cloud deployment: