Skip to main content

What is Lakeflow Connect?

Lakeflow Connect offers simple and efficient connectors to ingest data from popular enterprise applications, databases, cloud storage, local files, message buses, and more. This page outlines some of the ways that Lakeflow Connect can improve ETL performance. It also covers common use cases and the range of supported ingestion tools, from fully-managed connectors to fully-customizable frameworks.

Flexible service models

Lakeflow Connect offers a broad range of connectors for enterprise applications, cloud storage, databases, message buses, and more. It also gives you the flexibility to choose between a fully-managed service and a custom pipeline. The managed service features out-of-the-box connectors that democratize data access with simple UIs and powerful APIs. This allows you to quickly create robust ingestion pipelines while minimizing long-term maintenance costs. If you need more customization, you can use DLT or Structured Streaming. Ultimately, this versatility enables Lakeflow Connect to meet your organization's specific needs.

Unification with core Databricks tools

Lakeflow Connect uses core Databricks features to provide comprehensive data management. For example, it offers governance using Unity Catalog, orchestration using Lakeflow Jobs, and holistic monitoring across your pipelines. This helps your organization manage data security, quality, and cost while unifying your ingestion processes with your other data engineering tools. Lakeflow Connect is built on an open Data Intelligence Platform, with full flexibility to incorporate your preferred third-party tools. This ensures a tailored solution that aligns with your existing infrastructure and future data strategies.

Fast, scalable ingestion

Lakeflow Connect uses incremental reads and writes to enable efficient ingestion. When combined with incremental transformations downstream, this can significantly improve ETL performance.

Common use cases

Customers ingest data to solve their organizations' most challenging problems. Sample use cases include the following:

Use case

Description

Customer 360

Measuring campaign performance and customer lead scoring

Portfolio management

Maximizing ROI with historical and forecasting models

Consumer analytics

Personalizing your customers' purchasing experiences

Centralized human resources

Supporting your organization's workforce

Digital twins

Increasing manufacturing efficiency

RAG chatbots

Building chatbots to help users understand policies, products, and more

Layers of the ETL stack

The following table describes the three layers of ingestion products, ordered from most customizable to most managed:

Layer

Description

Structured Streaming

Structured Streaming is an API for incremental stream processing in near real-time. It provides strong performance, scalability, and fault tolerance.

DLT

DLT builds on Structured Streaming, offering a more declarative framework for creating data pipelines. You can define the transformations to perform on your data, and DLT manages orchestration, monitoring, data quality, errors, and more. Therefore, it offers more automation and less overhead than Structured Streaming.

Fully-managed connectors

Fully-managed connectors build on DLT, offering even more automation for the most popular data sources. They extend DLT's functionality to also include source-specific authentication, CDC, edge case handling, long-term API maintenance, automated retries, automated schema evolution, and so on. Therefore, they offer even more automation for any supported data sources.

Some connectors operate at one level of this ETL stack. For example, Databricks offers fully-managed connectors for enterprise (SaaS) applications (for example, Salesforce) and databases (for example, SQL Servers). Other connectors operate at multiple layers of this ETL stack. For example, you can use Auto Loader with Structured Streaming for full customization or DLT for a more managed experience. This is also true for streaming data from Apache Kafka, Amazon Kinesis, Google Pub/Sub, and Apache Pulsar.

Databricks recommends starting with the most managed layer. If it doesn't satisfy your requirements (for example, if it doesn't support your data source), drop down to the next layer. Databricks plans to expand support for more connectors across all three layers.

ETL stack diagram

File upload and download

You can ingest files that reside on your local network, files that have been uploaded to a volume, or files that are downloaded from an internet location. See Files.

Fully-managed connectors

You can use fully-managed connectors to ingest from SaaS applications and databases. Available connectors include:

Customizable connectors

In addition to the fully-managed connectors, Databricks offers many ways to ingest data. These include customizable connectors for cloud object storage and streaming sources like Kafka. See Standard connectors in Lakeflow Connect.