Lake Life for Industry

Faline Rezvani
Jun 6, 2024
3 min read

Updated: Jul 15, 2024

A single source of truth (SSoT) enables an enterprise to track data lineage from ingestion to prescriptive analytics. The concept of a central data repository facilitates transparency, seamless communication, and enhanced customer and employee satisfaction. Is this state of infrastructural bliss truly achievable? This post will provide a very brief outline of the steps an organization can take towards a SSoT.

As opposed to a data warehouse which assigns structure, or schema upon import, a data lake is a common ingestion storage framework fed by a free-flowing stream of sources, stored safely under a governance model until schema is defined upon export.

Being data-agnostic, sources flowing into a data lake may include anything from user experience (UX) logs to local databases. The purposes of these structured (XLS), semi-structured (CSV, XML, JSON), and unstructured (PDF) data are determined at the time of extraction.

Selecting a service for the data lake and the supporting environment will require consideration of such things as batch or streaming sources, cost parameters, compatibility between existing software and systems, as well as the tools to fit the skills of the team specific to each step along the pipeline.

Data Lake Tools:

Amazon Simple Storage Service (S3)
Azure Data Lake
Hive

Once a flexible containment area is established, an organization can select equally pliable databases as sources to feed the data lake. NoSQL databases operate using a BASE (Basically Available, Soft state, Eventual consistency) model approach to deliver scalability and reliability, keeping in mind that the balance between availability and consistency must at times be a trade-off.

NoSQL Database Tools:

HBase
MongoDB
Cassandra
Amazon DynamoDB

To utilize the continuous flow of data generated per second by a vast number of sources, an organization will use an event stream processor (ESP) to ensure streaming data events can be ingested into the data lake.

ESP Tools:

Apache Kafka
Amazon Data Firehose

Since a data lake holds data of various types, data preparation processes, such as indexing and tokenization must take place.

Data Cleansing Tools:

Elasticsearch
Apache Spark

An organization will choose a software to support their chosen data governance frameworks.

Data Governance Tools:

Amazon Web Services (AWS) Unity Catalog
Apache Atlas
Egnyte

When data are finally called upon to earn a company revenue, additional processing and quality control measures are needed for descriptive, predictive, and prescriptive analytics.

Table Creation:

Delta Lake

Machine Learning Tools:

AWS SageMaker
Azure Databricks

Putting the concept of a SSoT into practice through the use of data lakes can generate intelligent experiences from data collection to infusion and embrace a future of change.

Data Lake Advantages:

Utilizing edge computing, the method of conducting analysis close to the data source
Serving as an antechamber in which to determine relevant data to ship to enterprise data warehouse (EDW)
Taking advantage of Industrial Internet of Things (IIoT)

Use Case

The gaming company, SEGA Europe, has made the journey to a distributed file system, utilizing cloud-native clusters responsible for processing 25,000 events per second. The company began by improving the back end of their products starting with the game, Football Manager.

By developing the APIs that communicate with Amazon Kinesis, the stream feeding data to MongoDB and Amazon S3, where it is encrypted, SEGA Europe changed the way gamer data is collected. The information is cleaned with Apache Spark and sent to Delta Lake for further processing. This streamlined process of data collection and manipulation allows SEGA Europe to draw insight such as customer churn potential and optimize the gaming experience (Targett, 2024).

Each technology, tool, service, and software featured in this post has a variety of uses. Perhaps one of the biggest challenges a company will face is choosing the right path for their needs.

Lake Life for Industry

As opposed to a data warehouse which assigns structure, or schema upon import, a data lake is a common ingestion storage framework fed by a free-flowing stream of sources, stored safely under a governance model until schema is defined upon export.

Being data-agnostic, sources flowing into a data lake may include anything from user experience (UX) logs to local databases. The purposes of these structured (XLS), semi-structured (CSV, XML, JSON), and unstructured (PDF) data are determined at the time of extraction.

Data Lake Tools:

NoSQL Database Tools:

To utilize the continuous flow of data generated per second by a vast number of sources, an organization will use an event stream processor (ESP) to ensure streaming data events can be ingested into the data lake.

ESP Tools:

Since a data lake holds data of various types, data preparation processes, such as indexing and tokenization must take place.

Data Cleansing Tools:

An organization will choose a software to support their chosen data governance frameworks.

Data Governance Tools:

When data are finally called upon to earn a company revenue, additional processing and quality control measures are needed for descriptive, predictive, and prescriptive analytics.

Table Creation:

Machine Learning Tools:

Putting the concept of a SSoT into practice through the use of data lakes can generate intelligent experiences from data collection to infusion and embrace a future of change.

Data Lake Advantages:

Use Case

The gaming company, SEGA Europe, has made the journey to a distributed file system, utilizing cloud-native clusters responsible for processing 25,000 events per second. The company began by improving the back end of their products starting with the game, Football Manager.

Each technology, tool, service, and software featured in this post has a variety of uses. Perhaps one of the biggest challenges a company will face is choosing the right path for their needs.

“You can have data without information, but you cannot have information without data.” - Daniel Keys Moran

References:

Targett, E. (2024, Jan.). Sega’s Felix Baker on taking gaming data to the cloud (thestack.technology)

Recent Posts

Hozzászólások