Hey!

I'm Sumit Mittal - Founder & CEO of TrendyTech.

I transform the careers of Big data aspirants through my carefully curated masters program to help them evolve into Big data experts. I have put in my wholehearted effort to present to you the best online big data course through the experience gained by having worked on multiple challenging Big data projects as an EX-CISCO and VMware employee.

The journey began in 2018 with my passion for teaching. Started by training a few working professionals, and eventually quit my high-paying job to pursue my passion and to bring about a change in the professional lives of many.

I have incorporated effective learning approaches to master Big Data that have been assimilated over the years as an alumni of top educational organizations like NIT Trichy, BITS Pilani, and IIIT Bangalore.

Link to my website: trendytech.in


Sumit Mittal

All you need to know about Deletion Vectors in Databricks

Lets remember the time when there were no deletion vectors,

lets say if we have a file1 with 1 million records, and we are just deleting 1 record from that file,

what will databricks do in this case?

It will create a new parquet file2 will all records except the one record.

Also it will change the metadata saying remove file1, add file2.

so a complete file rewrite is required and it is write intensive, also takes more storage.

Now with Deletion Vectors things have changed. It adds metadata that marks rows as logically deleted without rewriting the underlying Parquet files.

It will not rewrite the file, rather it just adds a deletion vector file. Its like a tombstone marker. Now it changes the metadata, remove file1, add file1 but with deletion vectors.

This way it makes sure when we read the data, it will not show us the deleted record, even though that record will physically be present on the disk.

This ensures faster deletes, smaller I/O, and a lot of saving in terms of storage.

However one downside is that if these Deletion vectors keep accumulating, then it will significantly impact your read performance.

When reading data it will have to scan all the deleted records as well, and filter those. which is an overhead.

so we need to make sure to do periodic maintainance

- Run REORG TABLE command to rewrite the data, so that deletion vectors are merged.

- Run OPTIMIZE to compact small files and improve query performance.

- Run VACUUM to clean up old file versions and reclaim storage space.

Based on this, now try answering these 5 scenario based questions

1. You are managing a large Delta table with billions of rows.
Deleting even a few thousand rows used to be very expensive because it required rewriting files.
How do deletion vectors, help solve this problem?

2. Imagine two cases:
Case A: Deleting 100 rows
Case B: Deleting 100 million rows
In which case do deletion vectors bring the most benefit?

3. Your company must comply with the GDPR. Since deletion vectors only mark rows as deleted, the data still exists in storage.
How would you design a process to ensure deleted records are physically removed for compliance?

4. Over time, analysts complain that queries on the Delta table are becoming slower after frequent deletes. How would you confirm whether deletion vectors are the cause, and what steps can you take to restore performance?

5. After months of deletes and updates, you notice storage costs have gone up and queries are slower. Why does this happen, and what Databricks operations would you apply to balance storage, performance, and compliance?

I have created a dedicated 45 video answering all these questions along with demo on my youtube channel.

I am sure you would have learnt something new today.

2 days ago | [YT] | 43

Sumit Mittal

I created a video covering 5 Scenario based Interview questions related to Deletion Vectors in Databricks (a really hot topic for interviews)

Question 1) You are managing a large Delta table with billions of rows.
Deleting even a few thousand rows used to be very expensive because it required rewriting files.
How do deletion vectors, help solve this problem?

Question 2) Imagine two cases:
Case A: Deleting 100 rows
Case B: Deleting 100 million rows
In which case do deletion vectors bring the most benefit, and when would you consider a file rewrite instead?

Question 3) Your company must comply with the GDPR (right to be forgotton). Since deletion vectors only mark rows as deleted, the data still exists in storage.
How would you design a process to ensure deleted records are physically removed for compliance?

Question 4) Over time, analysts complain that queries on the Delta table are becoming slower after frequent deletes. How would you confirm whether deletion vectors are the cause, and what steps can you take to restore performance?

Question 5) After months of deletes and updates, you notice storage costs have gone up and queries are slower, even though deletion vectors were meant to save space and time. Why does this happen, and what Databricks operations would you apply to balance storage, performance, and compliance?

I am sure this video will help you a lot. This is a really hot topic for Interviews these days!

Do support by liking, commenting & sharing if you truly find it valuable :)

4 days ago | [YT] | 50

Sumit Mittal

Announcing my new YouTube playlist on Data Engineer Interview Preparation starting from 21st August.

In this I will come with DE interview questions and solutions asked in top companies.

I keep getting lot of DMs, asking for help on Data Engineering Interview prep.

This will be a big step in helping all those, and everyone else who really want to prepare for interviews in great depth.

I will upload 2 videos every week and will bring
- Top product-based company questions
- Scenario based questions
& everything you need to know for cracking the most difficult Big Data interviews.

I know the struggle of giving interviews. And if this playlist can make even one person feel a little more prepared, a little more confident, then it’s worth it.

1 week ago | [YT] | 111

Sumit Mittal

I’m starting a free Data Engineering Interview Prep Playlist on YouTube starting 21st August.

More details coming in my next post!

1 week ago | [YT] | 67

Sumit Mittal

14 Important points to understand Serverless Compute in Databricks

1. If you have created the classic compute you know that you would typically wait for 3-7 minutes for cluster to get started. Also scaling up takes similar amount of time. serverless reduces this startup time drastically.

2. its simple, we do not have to set 100's of infra settings just to balance out on the performance - cost tradeoff.

3. for serverless jobs we just have to pick a goal - standard / performance.
if we pick standard then performance will be on par with classic, but if we pick performance then you will get very quick startup time and this will be efficient to run your workloads.

4. serverless compute runs in control plane (databricks cloud account). Unlike the classic compute which runs on your cloud account, serverless has a big change. Now the infra runs on databricks account. So it's basically a fleet of resources which are already running and it grabs few cores from that.

5. In classic compute world, Upgrading DBR's (Databricks runtime version) is a pain. It was reported that 20% of the time goes in just doing this maintenance work of upgrading DBR's. with serverless all of this is taken care.

6. Currently serverless compute is available for your SQL workloads, lakeflow Jobs, Notebooks, Declarative pipelines (DLT)

7. serverless compute is a versionless product. It anways runs on the latest version.

8. to track the usage/ cost you can set the buget policy with tags. Then you can query your system tables to understand the DBU breakup.

9. you can set the environment (only applicable for serverless). you can have your REPL memory to 16 GB or increase it to 32 GB when you are collecting lot of data on your REPL. can be in case of python pandas.

10. Right now serverless supports SQL & Python languages.

11. When using serverless, the type of worker, the number of workers, scaling up and scaling down. All of these things are taken care. Also its done in a intelligent way and this should significantly reduce the cost.

12. when using classic compute, lets say if you say you need a 8 node cluster. And on this you are not running spark, then also you end up paying for 8 nodes even if its not used. But in serverless it will create a cluster only when you use spark. Else it might just have one node when doing things using pandas.

13. In serverless its not that databricks is enforcing you to use the latest versions only, might be you have specific need for a older version then you can select that from environments screen.

14. Serverless is the way to go!

1 week ago | [YT] | 32

Sumit Mittal

In Databricks (Managed table vs External table)

Any of the table consists of 2 parts.

Table = Data + Metadata

in managed tables both data and metadata are fully managed by databricks.

Metadata is stored in the metastore.

your data will be stored in your cloud storage.

Basically for keeping data we need to create a external location and then mention that while creating the catalog.

you might ask what is a external location.

An external location is a cloud storage location backed by storage credentials to make it fully secured.

Now this external location path you need to mandatorily specify at catalog level, if not specified at the metastore level.

Databricks recommends not to give the external location path at metastore level as data from multiple workspaces will end up in same location.

Now when we are creating schema, that time its optional to specify the external location and in case if we give that, this takes precedence.

so lets take a scenario, if the external location is defined at all 3
metastore, catalog, schema
then the path mentioned at schema level takes the precedence.

Basically external location defined at lower level comes into effect.

The managed table by default will be Delta format, and we could have created a iceberg table also.

you might remember the days of hive metastore where when we drop the managed table both the data and metadata is dropped and there is no way to recover the data.

In case of Unity catalog backed Managed table, after dropping the table we can write a UNDROP command within 7 days. so basically data is marked for deletion only after 7 days. However metadata is dropped.

This makes sure that if a managed table is deleted by mistake, we can still undo things.

Lets talk about External table

The only difference in creation is, we give the external location path when creating a table.

unity catalog governs data access permissions but does not manage data lifecycle, optimizations, storage location or layout.

when you drop an external table, the data files are not deleted.

when you create an external table, you can either register an existing directory of data files as a table or
provide a path to create a new directory.

Databricks recommends using external tables for the following use cases:

- you need to register a table backed by existing data that is not compatible with unity catalog manage tables.
- you also require direct access to the data from non databricks clients.

So whenever possible try using the Managed tables, as you get all sorts of optimizations. Databricks recommends using Managed tables whenever possible.

I am sure you would have learned something new from this post.

3 weeks ago | [YT] | 49

Sumit Mittal

A lot has changed in Databricks.

Over the years Databricks has evolved as a End to End Data Platform.

When we talk about Data Engineering, Databricks now offers
- Lakeflow Connect (for data Ingestion)
- Lakeflow Declarative Pipelines (for ETL, earlier referred as DLT)
- Lakeflow Jobs (for Orchestration)
- Unity Catalog (for Metadata handling and Governance)

In the latest versions of databricks, the workspace is unity catalog enabled.

In unity catalog, at the top level is the Metastore
Under metastore we have one or more catalogs
Under catalog you have schemas/Databases
Under schemas you have tables.

So its a 3 level namespace
catalog -> schema -> table

Now whenever we create a databricks workspace we will get
- an automatically provisioned unity catalog metastore..
- and a catalog under that named after your workspace.

The best practise is to just have 1 metastore in one region and then attach this metastore to all the workspaces in that region.

BTW whats the need for more than one workspace?
you can have it for different environments (dev, stg, prod)
for different BU's (sales, finance, IT)

can one workspace have multiple metastores?
No, it will have just one metastore attached to it.

can multiple workspaces have one metastore only?
Yes, if these workspaces are in the same region.

you might know the days when hive metastore is the default one.

Now its not default, but we still get one hive metastore along with each workspace.

Hive Metastore follows a 2 level namespace
Schema -> table

Now lets understand, what were the issues with hive metastore?

- Hive metastore offers basic metadata management but lacks, fine grained access controls.
- It lacks build in data governance features like data lineage, auditing and access controls.
- Moreover hive metastore can deal only with structured data.

so basically, tables in the hive metastore do not benefit from the full set of security and governance features provided by the unity catalog. such as built in auditing, lineage and access control.

Its a legacy feature now.

When developing any solution in Databricks make sure we should always use Unity catalog.

I am sure you would have learned something new from this post.

In my next post I will talk about tables in Databricks in good depth.

3 weeks ago | [YT] | 55

Sumit Mittal

I created a multi agent system that takes a database and gives you insights without you asking anything.

This involves a supervisor and 3 agents
agent 1 acts like a analyst asking questions
agent 2 acts like a expert answering questions
agent 3 takes the conversation of analyst and expert and summarizes them. Generates a PDF report.

agent 1 needs a tools to connect to the database and gets the schema.
agent 2 needs a tool so that it can run queries on the database
agent 3 need a tool to create pdf document.

Here is the flow
- the supervisor calls the agent1 (Analyst)
- Analyst uses the tool to get the schema and based on this it generates a set of questions to ask.
- The analyst returns these queries to the supervisor.
- The supervisor then calls the expert and sends those queries.
- The expert answer these based on the query answering tool
- the results are sent back to the supervisor.
- Now the supervisor calls the reviewer agent which takes the conversation and summarizes it. It then uses the tool to generate a PDF document and returns back to the supervisor.

In the end we have a final PDF report which has the insights from the database.

I hope you liked the solution.

2 months ago | [YT] | 81

Sumit Mittal

I dedicated my last 6 months fully to create a world class Gen AI program.
Super happy with the outcome.

2 months ago | [YT] | 75

Sumit Mittal

I was talking to one of my student working in a top product based company,

she mentioned she was working on a Gen AI project where, they need to query a database using natural text.

So, I thought its a really good idea to include this use case as part of my Gen AI program, and in fact I just released this for my students today.

I want to give you all an idea on how to approach this problem.

This is a perfect use case for Agentic approach.

This uses a Agent with access to 2 tools
- Schema Getter tool to get the tables in DB with column names and Data types
- Query Runner tool to run the SQL query on DB.

when someone gives a query like "I want to know the total sales in last quarter"

step 1 - This query goes to the LLM, Now LLM is intelligent enough to understand that it needs to call the schema getter tool to get the schema.

step 2 - The schema getter tool gives the schema to the LLM.

step 3 - The LLM generates a SQL based on this schema.

step 4 - LLM is intelligent enough to invoke the query runner tool

step 5 - query runner tool runs the query on the DB & returns the output to LLM

step 6 - LLM refines this to make it more presentable & gives back to the user.

so we can see that we binded a few tools (in this case 2 tools) with our LLM.

We are not have a if else kind of logic, rather LLM decides on what to do next.

A important point to note, the tools always give their output back to LLM, and then LLM decides what to do next.

In case if you were thinking what Proof of concept to develop on Gen AI, then you can go with this.

I hope you found this informative!

In my next post I will talk about Multi Agent Architecture.

2 months ago | [YT] | 65