What is Data Lake?

Turkish: Data Lake

A data lake stores raw and processed data in scalable storage for analytics, machine learning, exploration, and archiving.

What is a Data Lake?

A data lake is a data architecture that stores structured tables, log files, event streams, images, and raw files in low-cost scalable storage. Unlike a data warehouse, it does not require every dataset to fit a predefined schema before it is stored; modeling can happen when the data is used.

A retailer might keep point-of-sale transactions, website clickstream data, campaign logs, and customer support records in the same data lake. Analytics teams can then use the data for reporting, forecasting, or machine learning models in different forms.

How Does It Work?

Data lakes are commonly built on object storage such as AWS S3, Google Cloud Storage, or Azure Data Lake Storage. Data arrives from sources through batch or streaming pipelines, lands in a raw layer, and is then cleaned and transformed into more curated layers.

Typical layers include:

  • Raw/Bronze: Data as received from source systems
  • Clean/Silver: Cleaned and standardized data
  • Curated/Gold: Data prepared for reporting or modeling
  • Catalog: Metadata explaining origin, meaning, ownership, and use

Business Use

Data lakes are useful for big data, IoT, customer behavior analysis, log analytics, and AI projects. Without governance, however, a data lake can become a data swamp: a pile of files with unclear source, quality, ownership, and retention.

ETL and ELT processes make lake data usable for analytics. Security, cataloging, lifecycle rules, and cost monitoring should be designed as part of the architecture rather than added after storage grows.