VMware Tanzu Data Lake

Store, process, and analyze structured and unstructured data at petabyte scale on private cloud infrastructure. Tanzu Data Lake provides a curated HDFS-based data lakehouse that integrates with Tanzu Greenplum for unified analytics and AI workloads.

Best for

  • Data engineering teams building scalable data lakehouses
  • Organizations needing HDFS-compatible storage on private cloud
  • Enterprises consolidating structured and unstructured data for analytics
  • AI/ML data pipelines requiring petabyte-scale storage

Why Organizations Choose Tanzu Data Lake

Enterprises generate massive volumes of structured and unstructured data across their operations. Managing this data across fragmented storage systems increases cost, slows analytics, and creates blind spots for AI initiatives. Tanzu Data Lake consolidates data into a unified lakehouse on private cloud infrastructure.

Unified data platform

Unified Data Platform

Images, documents, videos, and sensor data can be stored alongside structured datasets in a single platform. No more managing separate storage silos for different data types.

Query both structured and unstructured data in a unified process through integration with Tanzu Greenplum.

Lower TCO

Lower Total Cost of Ownership

Optimize storage costs with tiered data management. Keep hot data on Tanzu Greenplum nodes for fast analytics while moving cold data to HDFS for cost-effective long-term storage.

Repurpose older hardware for HDFS storage, extending the useful life of existing infrastructure investments.

Private cloud data control

Private Cloud Data Control

Keep sensitive data on infrastructure you control. Tanzu Data Lake runs on private cloud, giving organizations full control over data residency, access, and governance.

Ideal for regulated industries that need to maintain data sovereignty while building modern analytics capabilities.

Tanzu Data Lake Features

Tiered storage

Tiered Hot/Cold Storage

Optimize performance by keeping recent data on Tanzu Greenplum nodes and archiving older partitions to HDFS. Both tiers are queryable through a single unified interface.

Move data seamlessly between storage mediums, including storing data on HDFS in formats like Parquet for efficient long-term retention.

Multi-format support

Multi-Format Data Support

Handle Parquet, AVRO, JSON, ORC, CSV, and more. Store and query images, videos, documents, and sensor data alongside structured datasets.

Preprocess unstructured data before moving it to Tanzu Greenplum for faster analytical queries.

Hadoop migration

Legacy Hadoop Migration

Use distcp to efficiently transition from legacy Hadoop clusters to Tanzu Data Lake. Preserve existing data pipelines and HDFS-based workflows during migration.

Flexible deployment options support both co-located and separated compute and storage topologies.

When Organizations Choose Tanzu Data Lake

Unified Analytics Across Structured and Unstructured Data

Organizations often store structured data in relational databases and unstructured data in separate file systems. This fragmentation makes it difficult to run cross-dataset analytics without complex ETL pipelines.

Tanzu Data Lake brings both data types into a single queryable platform. Combined with Tanzu Greenplum, teams can run SQL queries and Apache Spark workloads across the entire data estate.

  • Unified querying across structured and unstructured data
  • SQL and Apache Spark access to the data lakehouse
  • Tiered storage for cost-effective data retention
  • Integrated with Tanzu Greenplum for high-performance analytics
DISCUSS YOUR ANALYTICS REQUIREMENTS
Enterprise analytics with Tanzu Data Lake

Building AI and ML Data Pipelines on Private Cloud

AI and ML projects require access to large volumes of diverse data types. Many organizations need to keep this data on private infrastructure due to regulatory requirements or data sensitivity concerns.

Tanzu Data Lake provides petabyte-scale storage for training data, embeddings, and model artifacts on private cloud. Native vector querying supports agentic AI applications and RAG workflows.

  • Petabyte-scale storage for AI/ML training datasets
  • Native vector querying for RAG and agentic AI applications
  • Private cloud deployment for data sovereignty
  • Ingest unstructured data and generate embedding vectors
PLAN YOUR AI DATA STRATEGY
AI and ML data pipelines

Modernizing Legacy Hadoop Environments

Many organizations invested heavily in Hadoop clusters that are now difficult to maintain and upgrade. The original Hadoop ecosystem has fragmented, and finding skilled administrators is increasingly challenging.

Tanzu Data Lake provides a VMware-managed HDFS environment that preserves existing data formats and workflows. Teams can migrate from legacy Hadoop clusters using distcp without rebuilding their data pipelines.

  • Efficient migration from legacy Hadoop with distcp
  • Preserve existing HDFS-based workflows and pipelines
  • VMware-managed infrastructure reduces operational burden
  • Repurpose older hardware for cost-effective HDFS storage
DISCUSS YOUR HADOOP MODERNIZATION
Hadoop modernization with Tanzu Data Lake

Licensing & Pricing Guidance

Related Tanzu Data Products

Tanzu Data Lake — Buyer FAQ

Tanzu Data Lake is a curated Hadoop (HDFS) deployment designed for enterprises to manage hybrid storage models. It integrates with Tanzu Greenplum to provide a scalable data lakehouse that supports structured and unstructured data at petabyte scale on private cloud infrastructure.

It is part of the VMware Tanzu Data Intelligence family of data management products.

Tanzu Data Lake supports a wide range of formats including Parquet, AVRO, JSON, ORC, and CSV for structured data. It also handles unstructured data such as images, videos, documents, and sensor data.

This multi-format support allows organizations to consolidate diverse data types into a single queryable platform.

Tanzu Data Lake uses a tiered storage approach with Greenplum. Recent hot data lives on Greenplum nodes for fast analytics, while older cold data resides on HDFS for cost-effective storage. Both tiers can be queried in a unified process.

The Platform Extension Framework (PXF) enables high-speed querying across structured and unstructured datasets stored in HDFS.

Yes. Tanzu Data Lake supports migration from legacy Hadoop clusters using distcp for efficient data transfer. Organizations can transition their existing HDFS-based ecosystems without rebuilding data pipelines.

Flexible deployment options support both co-located and separated compute and storage topologies to match your existing architecture.

Talk to a Data Platform Specialist

VirtualizationWorks helps organizations evaluate Tanzu Data Lake for their analytics and AI data requirements, plan deployment architecture, and understand licensing options.

Contact Us

Have questions about this product, VMware licensing, or deployment options? Fill out the form below and a VirtualizationWorks specialist will follow up.