Data Lakes on AWS

Zahid Un Nabi
7 min readApr 25, 2021

We know what S3 is. Now it’s time to discuss how the data is organized in this service.

Amazon S3 is an amazing object container. Like any bucket, you can put content in it in a neat and orderly fashion, or you can just dump it in. But no matter how the data gets there, once it’s there, you need a way to organize it in a meaningful way so you can find it when you need it.

This is where data lakes come in. A data lake is an architectural concept that helps you manage multiple data types from multiple sources, both structured and unstructured, through a single set of tools.

Let’s break that down. A data lake takes Amazon S3 buckets and organizes them by categorizing the data inside the buckets. It doesn’t matter how the data got there or what kind it is. You can store both structured and unstructured data effectively in an Amazon S3 data lake. AWS offers a set of tools to manage the entire data lake without treating each bucket as separate, unassociated objects.

Many businesses end up grouping data together into numerous storage locations called silos. These silos are rarely managed and maintained by the same team, which can be problematic. Inconsistencies in the way data were written, collected, aggregated, or filtered can cause problems when it is compared or combined for processing and analysis.

For example, one team may use the address field to store both the street number and street name, while another team might use separate fields for street number and street name. When these datasets are combined, there is now an inconsistency in the way the address is stored, and it will make analysis very difficult.

But by using data lakes, you can break down data silos and bring data into a single, central repository that is managed by a single team. That gives you a single, consistent source of truth.

Because data can be stored in its raw format, you don’t need to convert it, aggregate it, or filter it before you store it. Instead, you can leave that pre-processing to the system that processes it, rather than the system that stores it.

In other words, you don’t have to transform the data to make it usable. You keep the data in its original form, however it got there, however it was written. When you’re talking exabytes of data, you can’t afford to pre-process this data in every conceivable way it may need to be presented in a useful state.

Let’s talk about having a single source of truth. When we talk about the truth concerning data, we mean the trustworthiness of the data. Is it what it should be? Has it been altered? Can we validate the chain of custody? When creating a single source of truth, we’re creating a dataset, in this case, the data lake, which can be used for all processing and analytics. The bonus is that we know it to be consistent and reliable. It’s trustworthy.

So to bring it all together, we know that businesses need to easily access and analyze data in a variety of ways, using the tools and frameworks of their choice. Remember the second principle we spoke about in the last topic — moving data between storage and processing is costly. Amazon S3 data lakes provide a single storage backbone for a solution meeting these requirements and tools for analyzing the data without requiring movement.

In the next topic, we are going to discuss the nature of data stored within data processing applications.

Storing business content has always been a point of contention, and often frustration, within businesses of all types. Should content be stored in folders? Should prefixes and suffixes be used to identify file versions? Should content be divided by department or specialty? The list goes on and on.

The issue stems from the fact that many companies start to implement document or file management systems with the best of intentions but don’t have the foresight or infrastructure in place to maintain the initial data organization.

Out of the dire need for organizing the ever-increasing volume of data, data lakes were born.

A data lake is a centralized repository that allows you to store structured, semistructured, and unstructured data at any scale.

On-premises data movement

Data lakes allow you to import any amount of data. Data is collected from multiple sources and moved into the data lake in its original format.

This process allows you to scale to data of any size while saving from defining data structures, schemas, and transformations

Machine Learning

Data lake enable an organization to generate different type of insights including reporting on historical data and implementing machine learning where models are built to forecast likely outcomes and suggest a range of prescribed action to achieve the optimal result.

Real-time data movement

Data lakes allow you to import any amount of data that can come in real-time. Data can be collected from multiple stream data sources and moved into the data lake in original format.

Storage and protection

Data lakes allow you to store relational data from a place such as an operational database and line-of-business applications and non-relational data from a place such as mobile apps, Internet of Things (IoT) devices, and social media.

Data Lake also gives you the ability to understand what is in the lake through crawling, cataloging, and indexing of data.

Finally, data must be secured to ensure your data assets are protected.

Analytics:

Data lakes allow various roles in your organization, such as data scientists, data developers, and business analysts, to access data with their choice of analytic tools and frameworks.

This includes open-source frameworks such as Apache Hadoop, Presto, and Apache Spark and commercial offerings from data warehouse and business intelligence vendors.

Data Lakes allows you to run analytic without the need to move your data to a separate analytic system.

Data lakes promise the ability to store all data for a business in a single repository. You can leverage data lakes to store large volumes of data instead of persisting that data in data warehouses. Data lakes, such as those built-in Amazon S3, are generally less expensive than specialized big data storage solutions. That way, you only pay for the specialized solutions when using them for processing and analytics and not for long-term storage. Your extract, transform, and load (ETL) and the analytic process can still access this data for analytics.

Below are some of the benefits of data lakes. Click each card to see cautions to ensure you implement the best data lake for your situation.

Benefits of a data lake on AWS

  • Are a cost-effective data storage solution. You can durably store a nearly unlimited amount of data using Amazon S3.
  • Implement industry-leading security and compliance. AWS uses stringent data security, compliance, privacy, and protection mechanisms.
  • Allow you to take advantage of many different data collection and ingestion tools to ingest data into your data lake. These services include Amazon Kinesis for streaming data and AWS Snowball appliances for large volumes of on-premises data.
  • Help you to categorize and manage your data simply and efficiently. Use AWS Glue to understand the data within your data lake, prepare it, and load it reliably into data stores. Once AWS Glue catalogs your data, it is immediately searchable, can be queried, and is available for ETL processing.
  • Help you turn data into meaningful insights. Harness the power of purpose-built analytic services for a wide range of use cases, such as interactive analysis, data processing using Apache Spark and Apache Hadoop, data warehousing, real-time analytics, operational analytics, dashboards, and visualizations.

Amazon EMR and data lakes

Businesses have begun realizing the power of data lakes. Businesses can place data within a data lake and use their choice of open source distributed processing frameworks, such as those supported by Amazon EMR. Apache Hadoop and Spark are both supported by Amazon EMR, which can help businesses easily, quickly, and cost-effectively implement data processing solutions based on Amazon S3 data lakes.

Data lake preparation

Data scientists spend 60% of their time cleaning and

organizing data and 19% collecting data sets.

Data preparation is a huge undertaking. There are no easy answers when it comes to cleaning, transforming, and collecting data for your data lake. However, some services can automate many of these time-consuming processes.

Setting up and managing data lakes today can involve a lot of manual, complicated, and time-consuming tasks. This work includes loading the data, monitoring the data flows, setting up partitions for the data, and tuning encryption. You may also need to reorganize data, deduplicate it, match linked records, and audit data over time.

AWS Lake Formation makes it easy to ingest, clean, catalog, transform, and secure your data and make it available for analysis and machine learning. Lake Formation gives you a central console where you can discover data sources, set up transformation jobs to move data to an Amazon S3 data lake, remove duplicates and match records, catalog data for access by analytic tools, configure data access and security policies and audit and control access from AWS analytic and machine learning services.

Lake Formation automatically configures underlying AWS services to ensure compliance with your defined policies. If you have set up transformation jobs spanning AWS services, Lake Formation configures the flows, centralizes their orchestration, and lets you monitor the execution of your jobs.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Zahid Un Nabi
Zahid Un Nabi

Written by Zahid Un Nabi

Blockchain, Data Science & Big Data Analysis Enthusiastic | Python & Server Nerd

No responses yet

Write a response