What data formats does Tanzu Data Lake support?

Tanzu Data Lake supports a wide range of formats including Parquet, AVRO, JSON, ORC, and CSV for structured data, as well as images, videos, documents, and sensor data for unstructured workloads.

VMware Tanzu Data Lake

Store, process, and analyze structured and unstructured data at petabyte scale on private cloud infrastructure. Tanzu Data Lake provides a curated HDFS-based data lakehouse that integrates with Tanzu Greenplum for unified analytics and AI workloads.

Best for

Data engineering teams building scalable data lakehouses
Organizations needing HDFS-compatible storage on private cloud
Enterprises consolidating structured and unstructured data for analytics
AI/ML data pipelines requiring petabyte-scale storage

REQUEST PRICING TALK TO A DATA ARCHITECT

Why Organizations Choose Tanzu Data Lake

Enterprises generate massive volumes of structured and unstructured data across their operations. Managing this data across fragmented storage systems increases cost, slows analytics, and creates blind spots for AI initiatives. Tanzu Data Lake consolidates data into a unified lakehouse on private cloud infrastructure.

Unified Data Platform

Images, documents, videos, and sensor data can be stored alongside structured datasets in a single platform. No more managing separate storage silos for different data types.

Query both structured and unstructured data in a unified process through integration with Tanzu Greenplum.

Lower Total Cost of Ownership

Optimize storage costs with tiered data management. Keep hot data on Tanzu Greenplum nodes for fast analytics while moving cold data to HDFS for cost-effective long-term storage.

Repurpose older hardware for HDFS storage, extending the useful life of existing infrastructure investments.

Private Cloud Data Control

Keep sensitive data on infrastructure you control. Tanzu Data Lake runs on private cloud, giving organizations full control over data residency, access, and governance.

Ideal for regulated industries that need to maintain data sovereignty while building modern analytics capabilities.

Tanzu Data Lake Features

Tiered Hot/Cold Storage

Optimize performance by keeping recent data on Tanzu Greenplum nodes and archiving older partitions to HDFS. Both tiers are queryable through a single unified interface.

Move data seamlessly between storage mediums, including storing data on HDFS in formats like Parquet for efficient long-term retention.

Multi-Format Data Support

Handle Parquet, AVRO, JSON, ORC, CSV, and more. Store and query images, videos, documents, and sensor data alongside structured datasets.

Preprocess unstructured data before moving it to Tanzu Greenplum for faster analytical queries.

Legacy Hadoop Migration

Use distcp to efficiently transition from legacy Hadoop clusters to Tanzu Data Lake. Preserve existing data pipelines and HDFS-based workflows during migration.

Flexible deployment options support both co-located and separated compute and storage topologies.

When Organizations Choose Tanzu Data Lake

Unified Analytics Across Structured and Unstructured Data

Organizations often store structured data in relational databases and unstructured data in separate file systems. This fragmentation makes it difficult to run cross-dataset analytics without complex ETL pipelines.

Tanzu Data Lake brings both data types into a single queryable platform. Combined with Tanzu Greenplum, teams can run SQL queries and Apache Spark workloads across the entire data estate.

Unified querying across structured and unstructured data
SQL and Apache Spark access to the data lakehouse
Tiered storage for cost-effective data retention
Integrated with Tanzu Greenplum for high-performance analytics

DISCUSS YOUR ANALYTICS REQUIREMENTS

Enterprise analytics with Tanzu Data Lake

Building AI and ML Data Pipelines on Private Cloud

AI and ML projects require access to large volumes of diverse data types. Many organizations need to keep this data on private infrastructure due to regulatory requirements or data sensitivity concerns.

Tanzu Data Lake provides petabyte-scale storage for training data, embeddings, and model artifacts on private cloud. Native vector querying supports agentic AI applications and RAG workflows.

Petabyte-scale storage for AI/ML training datasets
Native vector querying for RAG and agentic AI applications
Private cloud deployment for data sovereignty
Ingest unstructured data and generate embedding vectors

PLAN YOUR AI DATA STRATEGY

Modernizing Legacy Hadoop Environments

Many organizations invested heavily in Hadoop clusters that are now difficult to maintain and upgrade. The original Hadoop ecosystem has fragmented, and finding skilled administrators is increasingly challenging.

Tanzu Data Lake provides a VMware-managed HDFS environment that preserves existing data formats and workflows. Teams can migrate from legacy Hadoop clusters using distcp without rebuilding their data pipelines.

Efficient migration from legacy Hadoop with distcp
Preserve existing HDFS-based workflows and pipelines
VMware-managed infrastructure reduces operational burden
Repurpose older hardware for cost-effective HDFS storage

DISCUSS YOUR HADOOP MODERNIZATION

Hadoop modernization with Tanzu Data Lake

Related Tanzu Data Products

VIEW ALL PRODUCTS ›

Tanzu Data Lake — Buyer FAQ

Tanzu Data Lake is a curated Hadoop (HDFS) deployment designed for enterprises to manage hybrid storage models. It integrates with Tanzu Greenplum to provide a scalable data lakehouse that supports structured and unstructured data at petabyte scale on private cloud infrastructure.

It is part of the VMware Tanzu Data Intelligence family of data management products.

Tanzu Data Lake supports a wide range of formats including Parquet, AVRO, JSON, ORC, and CSV for structured data. It also handles unstructured data such as images, videos, documents, and sensor data.

This multi-format support allows organizations to consolidate diverse data types into a single queryable platform.

Tanzu Data Lake uses a tiered storage approach with Greenplum. Recent hot data lives on Greenplum nodes for fast analytics, while older cold data resides on HDFS for cost-effective storage. Both tiers can be queried in a unified process.

The Platform Extension Framework (PXF) enables high-speed querying across structured and unstructured datasets stored in HDFS.

Yes. Tanzu Data Lake supports migration from legacy Hadoop clusters using distcp for efficient data transfer. Organizations can transition their existing HDFS-based ecosystems without rebuilding data pipelines.

Flexible deployment options support both co-located and separated compute and storage topologies to match your existing architecture.

VMware Tanzu Data Lake