Over the years Databricks has evolved as a End to End Data Platform.
When we talk about Data Engineering, Databricks now offers - Lakeflow Connect (for data Ingestion) - Lakeflow Declarative Pipelines (for ETL, earlier referred as DLT) - Lakeflow Jobs (for Orchestration) - Unity Catalog (for Metadata handling and Governance)
In the latest versions of databricks, the workspace is unity catalog enabled.
In unity catalog, at the top level is the Metastore Under metastore we have one or more catalogs Under catalog you have schemas/Databases Under schemas you have tables.
So its a 3 level namespace catalog -> schema -> table
Now whenever we create a databricks workspace we will get - an automatically provisioned unity catalog metastore.. - and a catalog under that named after your workspace.
The best practise is to just have 1 metastore in one region and then attach this metastore to all the workspaces in that region.
BTW whats the need for more than one workspace? you can have it for different environments (dev, stg, prod) for different BU's (sales, finance, IT)
can one workspace have multiple metastores? No, it will have just one metastore attached to it.
can multiple workspaces have one metastore only? Yes, if these workspaces are in the same region.
you might know the days when hive metastore is the default one.
Now its not default, but we still get one hive metastore along with each workspace.
Hive Metastore follows a 2 level namespace Schema -> table
Now lets understand, what were the issues with hive metastore?
- Hive metastore offers basic metadata management but lacks, fine grained access controls. - It lacks build in data governance features like data lineage, auditing and access controls. - Moreover hive metastore can deal only with structured data.
so basically, tables in the hive metastore do not benefit from the full set of security and governance features provided by the unity catalog. such as built in auditing, lineage and access control.
Its a legacy feature now.
When developing any solution in Databricks make sure we should always use Unity catalog.
I am sure you would have learned something new from this post.
In my next post I will talk about tables in Databricks in good depth.
Sumit Mittal
A lot has changed in Databricks.
Over the years Databricks has evolved as a End to End Data Platform.
When we talk about Data Engineering, Databricks now offers
- Lakeflow Connect (for data Ingestion)
- Lakeflow Declarative Pipelines (for ETL, earlier referred as DLT)
- Lakeflow Jobs (for Orchestration)
- Unity Catalog (for Metadata handling and Governance)
In the latest versions of databricks, the workspace is unity catalog enabled.
In unity catalog, at the top level is the Metastore
Under metastore we have one or more catalogs
Under catalog you have schemas/Databases
Under schemas you have tables.
So its a 3 level namespace
catalog -> schema -> table
Now whenever we create a databricks workspace we will get
- an automatically provisioned unity catalog metastore..
- and a catalog under that named after your workspace.
The best practise is to just have 1 metastore in one region and then attach this metastore to all the workspaces in that region.
BTW whats the need for more than one workspace?
you can have it for different environments (dev, stg, prod)
for different BU's (sales, finance, IT)
can one workspace have multiple metastores?
No, it will have just one metastore attached to it.
can multiple workspaces have one metastore only?
Yes, if these workspaces are in the same region.
you might know the days when hive metastore is the default one.
Now its not default, but we still get one hive metastore along with each workspace.
Hive Metastore follows a 2 level namespace
Schema -> table
Now lets understand, what were the issues with hive metastore?
- Hive metastore offers basic metadata management but lacks, fine grained access controls.
- It lacks build in data governance features like data lineage, auditing and access controls.
- Moreover hive metastore can deal only with structured data.
so basically, tables in the hive metastore do not benefit from the full set of security and governance features provided by the unity catalog. such as built in auditing, lineage and access control.
Its a legacy feature now.
When developing any solution in Databricks make sure we should always use Unity catalog.
I am sure you would have learned something new from this post.
In my next post I will talk about tables in Databricks in good depth.
2 months ago | [YT] | 55