What is Lakeflow Connect?
Lakeflow Connect offers simple and efficient connectors to ingest data from popular enterprise applications, databases, cloud storage, local files, message buses, and more. This page outlines some of the ways that Lakeflow Connect can improve ETL performance. It also covers common use cases and the range of supported ingestion tools, from fully-managed connectors to fully-customizable frameworks.
Flexible service models
Lakeflow Connect offers a broad range of connectors for enterprise applications, cloud storage, databases, message buses, and more. It also gives you the flexibility to choose between a fully-managed service and a custom pipeline. The managed service features out-of-the-box connectors that democratize data access with simple UIs and powerful APIs. This allows you to quickly create robust ingestion pipelines while minimizing long-term maintenance costs. If you need more customization, you can use DLT or Structured Streaming. Ultimately, this versatility enables Lakeflow Connect to meet your organization's specific needs.
Unification with core Databricks tools
Lakeflow Connect uses core Databricks features to provide comprehensive data management. For example, it offers governance using Unity Catalog, orchestration using Lakeflow Jobs, and holistic monitoring across your pipelines. This helps your organization manage data security, quality, and cost while unifying your ingestion processes with your other data engineering tools. Lakeflow Connect is built on an open Data Intelligence Platform, with full flexibility to incorporate your preferred third-party tools. This ensures a tailored solution that aligns with your existing infrastructure and future data strategies.
Fast, scalable ingestion
Lakeflow Connect uses incremental reads and writes to enable efficient ingestion. When combined with incremental transformations downstream, this can significantly improve ETL performance.
Common use cases
Customers ingest data to solve their organizations' most challenging problems. Sample use cases include the following:
Use case | Description |
---|---|
Measuring campaign performance and customer lead scoring | |
Maximizing ROI with historical and forecasting models | |
Personalizing your customers' purchasing experiences | |
Centralized human resources | Supporting your organization's workforce |
Increasing manufacturing efficiency | |
Building chatbots to help users understand policies, products, and more |
Layers of the ETL stack
The following table describes the three layers of ingestion products, ordered from most customizable to most managed:
Layer | Description |
---|---|
Structured Streaming is an API for incremental stream processing in near real-time. It provides strong performance, scalability, and fault tolerance. | |
DLT builds on Structured Streaming, offering a more declarative framework for creating data pipelines. You can define the transformations to perform on your data, and DLT manages orchestration, monitoring, data quality, errors, and more. Therefore, it offers more automation and less overhead than Structured Streaming. | |
Fully-managed connectors build on DLT, offering even more automation for the most popular data sources. They extend DLT's functionality to also include source-specific authentication, CDC, edge case handling, long-term API maintenance, automated retries, automated schema evolution, and so on. Therefore, they offer even more automation for any supported data sources. |
Some connectors operate at one level of this ETL stack. For example, Databricks offers fully-managed connectors for enterprise (SaaS) applications (for example, Salesforce) and databases (for example, SQL Servers). Other connectors operate at multiple layers of this ETL stack. For example, you can use Auto Loader with Structured Streaming for full customization or DLT for a more managed experience. This is also true for streaming data from Apache Kafka, Amazon Kinesis, Google Pub/Sub, and Apache Pulsar.
Databricks recommends starting with the most managed layer. If it doesn't satisfy your requirements (for example, if it doesn't support your data source), drop down to the next layer. Databricks plans to expand support for more connectors across all three layers.
File upload and download
You can ingest files that reside on your local network, files that have been uploaded to a volume, or files that are downloaded from an internet location. See Files.
Fully-managed connectors
You can use fully-managed connectors to ingest from SaaS applications and databases. Available connectors include:
Customizable connectors
In addition to the fully-managed connectors, Databricks offers many ways to ingest data. These include customizable connectors for cloud object storage and streaming sources like Kafka. See Standard connectors in Lakeflow Connect.