in managed tables both data and metadata are fully managed by databricks.
Metadata is stored in the metastore.
your data will be stored in your cloud storage.
Basically for keeping data we need to create a external location and then mention that while creating the catalog.
you might ask what is a external location.
An external location is a cloud storage location backed by storage credentials to make it fully secured.
Now this external location path you need to mandatorily specify at catalog level, if not specified at the metastore level.
Databricks recommends not to give the external location path at metastore level as data from multiple workspaces will end up in same location.
Now when we are creating schema, that time its optional to specify the external location and in case if we give that, this takes precedence.
so lets take a scenario, if the external location is defined at all 3 metastore, catalog, schema then the path mentioned at schema level takes the precedence.
Basically external location defined at lower level comes into effect.
The managed table by default will be Delta format, and we could have created a iceberg table also.
you might remember the days of hive metastore where when we drop the managed table both the data and metadata is dropped and there is no way to recover the data.
In case of Unity catalog backed Managed table, after dropping the table we can write a UNDROP command within 7 days. so basically data is marked for deletion only after 7 days. However metadata is dropped.
This makes sure that if a managed table is deleted by mistake, we can still undo things.
Lets talk about External table
The only difference in creation is, we give the external location path when creating a table.
unity catalog governs data access permissions but does not manage data lifecycle, optimizations, storage location or layout.
when you drop an external table, the data files are not deleted.
when you create an external table, you can either register an existing directory of data files as a table or provide a path to create a new directory.
Databricks recommends using external tables for the following use cases:
- you need to register a table backed by existing data that is not compatible with unity catalog manage tables. - you also require direct access to the data from non databricks clients.
So whenever possible try using the Managed tables, as you get all sorts of optimizations. Databricks recommends using Managed tables whenever possible.
I am sure you would have learned something new from this post.
Sumit Mittal
In Databricks (Managed table vs External table)
Any of the table consists of 2 parts.
Table = Data + Metadata
in managed tables both data and metadata are fully managed by databricks.
Metadata is stored in the metastore.
your data will be stored in your cloud storage.
Basically for keeping data we need to create a external location and then mention that while creating the catalog.
you might ask what is a external location.
An external location is a cloud storage location backed by storage credentials to make it fully secured.
Now this external location path you need to mandatorily specify at catalog level, if not specified at the metastore level.
Databricks recommends not to give the external location path at metastore level as data from multiple workspaces will end up in same location.
Now when we are creating schema, that time its optional to specify the external location and in case if we give that, this takes precedence.
so lets take a scenario, if the external location is defined at all 3
metastore, catalog, schema
then the path mentioned at schema level takes the precedence.
Basically external location defined at lower level comes into effect.
The managed table by default will be Delta format, and we could have created a iceberg table also.
you might remember the days of hive metastore where when we drop the managed table both the data and metadata is dropped and there is no way to recover the data.
In case of Unity catalog backed Managed table, after dropping the table we can write a UNDROP command within 7 days. so basically data is marked for deletion only after 7 days. However metadata is dropped.
This makes sure that if a managed table is deleted by mistake, we can still undo things.
Lets talk about External table
The only difference in creation is, we give the external location path when creating a table.
unity catalog governs data access permissions but does not manage data lifecycle, optimizations, storage location or layout.
when you drop an external table, the data files are not deleted.
when you create an external table, you can either register an existing directory of data files as a table or
provide a path to create a new directory.
Databricks recommends using external tables for the following use cases:
- you need to register a table backed by existing data that is not compatible with unity catalog manage tables.
- you also require direct access to the data from non databricks clients.
So whenever possible try using the Managed tables, as you get all sorts of optimizations. Databricks recommends using Managed tables whenever possible.
I am sure you would have learned something new from this post.
3 weeks ago | [YT] | 49