Sumit Mittal

The very first thing most likely you will do when working with Databricks is Ingesting your data, so that your data is sitting in Delta Lake.

whats the benefit of data sitting in Delta Lake compared to Data Lake?

you get
- ACID properties
- Perform DML operations
- Time Travel
- Schema Evolution and Enforcement
and much more!

So getting the data in Delta lake is the first task.

How do you do that?

Earlier, it used to be a bunch of third party tools, in house tools.

But now things are very much simplified. you can now use Lakeflow Connect for the Data Ingestion.

your organization data might be sitting in cloud storages - adls gen2 , amazon s3, databricks volumes or the data might be sitting in databases, SaaS applications etc..

with lakeflow connect we can have efficient ingenstion pipelines all within databricks.

there are different types of connectors

- upload files from local storage to volume
- Standard connectors (ingest from cloud storage into your delta lake)
- managed connectors (ingest from SaaS applications / Databases)

Ways of doing Data Ingestion from cloud storage (Standard Connectors)

- CTAS
- Copy Into
- Auto Loader

We can use fully managed connectors for doing Data Ingestion from Databases or SaaS applications.

different kind of ingestion modes

- batch ingestion (all data is re-ingested each time)
- incremental batch (all new data is ingested, previously loaded records are skipped automatically)
- incremental streaming - continously load data rows or batches of data rows as it is generated so that you can query as it arrives in near real time.

Once the data is ingested in delta lake you do ETL using Lakeflow Declarative Pipelines in Databricks. This was earlier referred to as DLT (Delta Live tables)

I hope you found this helpful.

Do you want a hack on how to practise in Azure Databricks for almost free using a paid edition?

I can talk about it in my next post if you are curious to know!

3 days ago | [YT] | 37