Lake Life for Industry
- Faline Rezvani
- Jun 6, 2024
- 3 min read
Updated: Jul 15, 2024
A single source of truth (SSoT) enables an enterprise to track data lineage from ingestion to prescriptive analytics. The concept of a central data repository facilitates transparency, seamless communication, and enhanced customer and employee satisfaction. Is this state of infrastructural bliss truly achievable? This post will provide a very brief outline of the steps an organization can take towards a SSoT.
As opposed to a data warehouse which assigns structure, or schema upon import, a data lake is a common ingestion storage framework fed by a free-flowing stream of sources, stored safely under a governance model until schema is defined upon export.
Being data-agnostic, sources flowing into a data lake may include anything from user experience (UX) logs to local databases. The purposes of these structured (XLS), semi-structured (CSV, XML, JSON), and unstructured (PDF) data are determined at the time of extraction.
Selecting a service for the data lake and the supporting environment will require consideration of such things as batch or streaming sources, cost parameters, compatibility between existing software and systems, as well as the tools to fit the skills of the team specific to each step along the pipeline.
Data Lake Tools:
Amazon Simple Storage Service (S3)
Azure Data Lake
Hive
Once a flexible containment area is established, an organization can select equally pliable databases as sources to feed the data lake. NoSQL databases operate using a BASE (Basically Available, Soft state, Eventual consistency) model approach to deliver scalability and reliability, keeping in mind that the balance between availability and consistency must at times be a trade-off.
NoSQL Database Tools:
HBase
MongoDB
Cassandra
Amazon DynamoDB
To utilize the continuous flow of data generated per second by a vast number of sources, an organization will use an event stream processor (ESP) to ensure streaming data events can be ingested into the data lake.
ESP Tools:
Apache Kafka
Amazon Data Firehose
Since a data lake holds data of various types, data preparation processes, such as indexing and tokenization must take place.
Data Cleansing Tools:
Elasticsearch
Apache Spark
An organization will choose a software to support their chosen data governance frameworks.
Data Governance Tools:
Amazon Web Services (AWS) Unity Catalog
Apache Atlas
Egnyte
When data are finally called upon to earn a company revenue, additional processing and quality control measures are needed for descriptive, predictive, and prescriptive analytics.
Table Creation:
Delta Lake
Machine Learning Tools:
AWS SageMaker
Azure Databricks
Putting the concept of a SSoT into practice through the use of data lakes can generate intelligent experiences from data collection to infusion and embrace a future of change.
Data Lake Advantages:
Utilizing edge computing, the method of conducting analysis close to the data source
Serving as an antechamber in which to determine relevant data to ship to enterprise data warehouse (EDW)
Taking advantage of Industrial Internet of Things (IIoT)
Hozzászólások