top of page
Search

Lake Life for Industry

  • Writer: Faline Rezvani
    Faline Rezvani
  • Jun 6, 2024
  • 3 min read

Updated: Jul 15, 2024


A single source of truth (SSoT) enables an enterprise to track data lineage from ingestion to prescriptive analytics.  The concept of a central data repository facilitates transparency, seamless communication, and enhanced customer and employee satisfaction.  Is this state of infrastructural bliss truly achievable?  This post will provide a very brief outline of the steps an organization can take towards a SSoT.

As opposed to a data warehouse which assigns structure, or schema upon import, a data lake is a common ingestion storage framework fed by a free-flowing stream of sources, stored safely under a governance model until schema is defined upon export.

Being data-agnostic, sources flowing into a data lake may include anything from user experience (UX) logs to local databases.  The purposes of these structured (XLS), semi-structured (CSV, XML, JSON), and unstructured (PDF) data are determined at the time of extraction.

Selecting a service for the data lake and the supporting environment will require consideration of such things as batch or streaming sources, cost parameters, compatibility between existing software and systems, as well as the tools to fit the skills of the team specific to each step along the pipeline.

Data Lake Tools:
  • Amazon Simple Storage Service (S3)

  • Azure Data Lake

  • Hive


Once a flexible containment area is established, an organization can select equally pliable databases as sources to feed the data lake.  NoSQL databases operate using a BASE (Basically Available, Soft state, Eventual consistency) model approach to deliver scalability and reliability, keeping in mind that the balance between availability and consistency must at times be a trade-off.

NoSQL Database Tools:
  • HBase

  • MongoDB

  • Cassandra

  • Amazon DynamoDB


To utilize the continuous flow of data generated per second by a vast number of sources, an organization will use an event stream processor (ESP) to ensure streaming data events can be ingested into the data lake.

ESP Tools:
  • Apache Kafka

  • Amazon Data Firehose


Since a data lake holds data of various types, data preparation processes, such as indexing and tokenization must take place.

Data Cleansing Tools:
  • Elasticsearch

  • Apache Spark


An organization will choose a software to support their chosen data governance frameworks.

Data Governance Tools:
  • Amazon Web Services (AWS) Unity Catalog

  • Apache Atlas

  • Egnyte


When data are finally called upon to earn a company revenue, additional processing and quality control measures are needed for descriptive, predictive, and prescriptive analytics.

Table Creation:
  • Delta Lake


Machine Learning Tools:
  • AWS SageMaker

  • Azure Databricks

 
Putting the concept of a SSoT into practice through the use of data lakes can generate intelligent experiences from data collection to infusion and embrace a future of change.

Data Lake Advantages:
  • Utilizing edge computing, the method of conducting analysis close to the data source

  • Serving as an antechamber in which to determine relevant data to ship to enterprise data warehouse (EDW)

  • Taking advantage of Industrial Internet of Things (IIoT)


Use Case
The gaming company, SEGA Europe, has made the journey to a distributed file system, utilizing cloud-native clusters responsible for processing 25,000 events per second.  The company began by improving the back end of their products starting with the game, Football Manager.

By developing the APIs that communicate with Amazon Kinesis, the stream feeding data to MongoDB and Amazon S3, where it is encrypted, SEGA Europe changed the way gamer data is collected.  The information is cleaned with Apache Spark and sent to Delta Lake for further processing.  This streamlined process of data collection and manipulation allows SEGA Europe to draw insight such as customer churn potential and optimize the gaming experience (Targett, 2024).

Each technology, tool, service, and software featured in this post has a variety of uses. Perhaps one of the biggest challenges a company will face is choosing the right path for their needs.

 


“You can have data without information, but you cannot have information without data.”  - Daniel Keys Moran
 
 

 
References:

Hozzászólások


bottom of page