Azarudeen Shahul - Invidious

A Data Engineer has a complex Gold-layer table that joins 5 different Silver tables. This table is used by a Power BI dashboard that is refreshed every hour. The users are complaining that the dashboard takes 2 minutes to load because the underlying SQL view is too slow.

Which data entity should the Data Engineer implement to reduce the dashboard load time to seconds while ensuring the data is updated once an hour?

2 months ago | [YT] | 1

View 0 replies

Azarudeen Shahul

#Databricks #DataEngineering #Azure

A Data Engineer runs an UPDATE command on a Delta table to change the prices of 100 products. After the command finishes, they notice a new JSON file has been created in the _delta_log directory.

The Question:
In Delta Lake, what is the primary purpose of the CRC (Cyclic Redundancy Check) files found within the _delta_log folder?

2 months ago | [YT] | 5

View 1 reply

Azarudeen Shahul

#Databricks #DataEngineering

You are processing a stream of unstructured "Customer Support Chats." You need to extract specific entities like Order_ID, Phone_Number, and Shipping_Address. Your lead architect suggests using Databricks AI Functions (ai_extract) instead of traditional Regular Expressions (regexp_extract).

From a Data Governance and Maintenance perspective, what is the most significant technical advantage of using ai_extract() over a complex REGEXP_EXTRACT() pattern for this task?

I benchmarked them both in my latest video! Check it out here: [https://youtu.be/SF9LW0r0F_A]

2 months ago | [YT] | 2

View 1 reply

Azarudeen Shahul

#DataEngineering #PySpark

You are ingesting a JSON file where the customer_data field contains a dynamic list of objects. Each object has a property_name and a property_value.

You need to choose the most efficient Delta Lake schema to store this so that you can quickly retrieve specific properties without scanning the entire column.

Which of these complex types allows you to access a specific element directly using a key-based lookup (e.g., column['key']) without needing to "flatten" or "explode" the data first?

2 months ago | [YT] | 1

View 1 reply

Azarudeen Shahul

#DataEngineering #Databricks #AIEngineer

You are a Data Engineer at a global furniture retailer. You receive thousands of customer reviews in a Delta table. You need to automatically categorize these reviews into four buckets: clothing, shoes, accessories, or furniture to route them to the correct product teams.

You came with databricks ai functions to solve this and wrote query as below,

SELECT
review_text,
ai_classify(review_text, ARRAY('clothing', 'shoes', 'accessories', 'furniture')) AS category
FROM customer_reviews;

If a customer leaves a review that says,

"The delivery was late and the box was damaged,"

but doesn't mention a specific product, what will the ai_classify() function return?

2 months ago | [YT] | 1

View 1 reply

Azarudeen Shahul

#DataEngineering #Databricks

You have a 10TB Delta table that is updated daily with new transaction data. You run the OPTIMIZE command every night to merge small files and maintain performance.

One day, due to a scheduling error, the OPTIMIZE command is triggered twice back-to-back on the exact same dataset.

2 months ago | [YT] | 2

View 1 reply

Azarudeen Shahul

#DataEngineering #Databricks

A data engineering team is using a Delta table that was originally partitioned by Region. However, the business has changed, and 90% of new queries now filter by 'CustomerID' instead. Performance is tanking because searching for a 'CustomerID' across all Region folders is causing massive "File Scans."

The team decides to switch to Liquid Clustering to solve this.

If the team runs,
ALTER TABLE orders CLUSTER BY (CustomerID),

which of the following statements is TRUE regarding the existing data?

2 months ago | [YT] | 3

View 1 reply

Azarudeen Shahul

#Databricks #DataEngineering

You are building a Delta Live Tables (DLT) pipeline to process raw JSON sensor data. You have defined a Streaming Table for the Bronze layer and a Materialized View for the Gold layer to show hourly averages.

-- Bronze Layer
CREATE OR REFRESH STREAMING TABLE sensors_bronze
AS SELECT * FROM cloud_files("/raw/data", "json");

-- Gold Layer
CREATE OR REFRESH MATERIALIZED VIEW sensors_gold
AS SELECT sensor_id, window.start, avg(temperature)
FROM LIVE.sensors_bronze
GROUP BY sensor_id, window(timestamp, "1 hour");

If you manually delete a corrupted file from the /raw/data folder and then click "Start" (Incremental Refresh) on the DLT pipeline, what will happen to the data already stored in the sensors_gold Materialized View?

2 months ago | [YT] | 1

View 1 reply

Azarudeen Shahul

#DataEngineering #Databriks

A Data Engineer has created a Materialized View (MV) to provide a daily summary of transaction data for a Power BI dashboard. To ensure the summary is always up-to-date, they scheduled a REFRESH MATERIALIZED VIEW command to run every 30 minutes.

After one week, the cloud bill shows a massive spike in Serverless Compute costs, even though the source table only receives a few hundred new rows per hour.

Which of the following is the most likely reason for the high cost of this Materialized View?

2 months ago | [YT] | 2

View 2 replies

Azarudeen Shahul

#DataEngineering #Databricks

A data engineer is using Auto Loader with 'cloudFiles' to ingest JSON data from a cloud storage bucket. The engineer notices that new files occasionally contain additional columns that were not present when the stream was first started. Which feature of Auto Loader allows the stream to continue and automatically adapt the target table schema without manual intervention?

2 months ago | [YT] | 2

View 1 reply