Store, process, and analyze structured and unstructured data at petabyte scale on private cloud infrastructure. Tanzu Data Lake provides a curated HDFS-based data lakehouse that integrates with Tanzu Greenplum for unified analytics and AI workloads.
Best for
Enterprises generate massive volumes of structured and unstructured data across their operations. Managing this data across fragmented storage systems increases cost, slows analytics, and creates blind spots for AI initiatives. Tanzu Data Lake consolidates data into a unified lakehouse on private cloud infrastructure.
Images, documents, videos, and sensor data can be stored alongside structured datasets in a single platform. No more managing separate storage silos for different data types.
Query both structured and unstructured data in a unified process through integration with Tanzu Greenplum.
Optimize storage costs with tiered data management. Keep hot data on Tanzu Greenplum nodes for fast analytics while moving cold data to HDFS for cost-effective long-term storage.
Repurpose older hardware for HDFS storage, extending the useful life of existing infrastructure investments.
Keep sensitive data on infrastructure you control. Tanzu Data Lake runs on private cloud, giving organizations full control over data residency, access, and governance.
Ideal for regulated industries that need to maintain data sovereignty while building modern analytics capabilities.
Optimize performance by keeping recent data on Tanzu Greenplum nodes and archiving older partitions to HDFS. Both tiers are queryable through a single unified interface.
Move data seamlessly between storage mediums, including storing data on HDFS in formats like Parquet for efficient long-term retention.
Handle Parquet, AVRO, JSON, ORC, CSV, and more. Store and query images, videos, documents, and sensor data alongside structured datasets.
Preprocess unstructured data before moving it to Tanzu Greenplum for faster analytical queries.
Use distcp to efficiently transition from legacy Hadoop clusters to Tanzu Data Lake. Preserve existing data pipelines and HDFS-based workflows during migration.
Flexible deployment options support both co-located and separated compute and storage topologies.
Organizations often store structured data in relational databases and unstructured data in separate file systems. This fragmentation makes it difficult to run cross-dataset analytics without complex ETL pipelines.
Tanzu Data Lake brings both data types into a single queryable platform. Combined with Tanzu Greenplum, teams can run SQL queries and Apache Spark workloads across the entire data estate.
AI and ML projects require access to large volumes of diverse data types. Many organizations need to keep this data on private infrastructure due to regulatory requirements or data sensitivity concerns.
Tanzu Data Lake provides petabyte-scale storage for training data, embeddings, and model artifacts on private cloud. Native vector querying supports agentic AI applications and RAG workflows.
Many organizations invested heavily in Hadoop clusters that are now difficult to maintain and upgrade. The original Hadoop ecosystem has fragmented, and finding skilled administrators is increasingly challenging.
Tanzu Data Lake provides a VMware-managed HDFS environment that preserves existing data formats and workflows. Teams can migrate from legacy Hadoop clusters using distcp without rebuilding their data pipelines.
Tanzu Data Lake is a curated Hadoop (HDFS) deployment designed for enterprises to manage hybrid storage models. It integrates with Tanzu Greenplum to provide a scalable data lakehouse that supports structured and unstructured data at petabyte scale on private cloud infrastructure.
It is part of the VMware Tanzu Data Intelligence family of data management products.
Tanzu Data Lake supports a wide range of formats including Parquet, AVRO, JSON, ORC, and CSV for structured data. It also handles unstructured data such as images, videos, documents, and sensor data.
This multi-format support allows organizations to consolidate diverse data types into a single queryable platform.
Tanzu Data Lake uses a tiered storage approach with Greenplum. Recent hot data lives on Greenplum nodes for fast analytics, while older cold data resides on HDFS for cost-effective storage. Both tiers can be queried in a unified process.
The Platform Extension Framework (PXF) enables high-speed querying across structured and unstructured datasets stored in HDFS.
Yes. Tanzu Data Lake supports migration from legacy Hadoop clusters using distcp for efficient data transfer. Organizations can transition their existing HDFS-based ecosystems without rebuilding data pipelines.
Flexible deployment options support both co-located and separated compute and storage topologies to match your existing architecture.
VirtualizationWorks helps organizations evaluate Tanzu Data Lake for their analytics and AI data requirements, plan deployment architecture, and understand licensing options.
Have questions about this product, VMware licensing, or deployment options? Fill out the form below and a VirtualizationWorks specialist will follow up.