Delta Live Tables
It is a declarative ETL framework for the Databricks Data Intelligence Platform that helps data teams simplify streaming and batch ETL cost-effectively.
declarative framework is a programming or development approach where you specify what you want to achieve, but not necessarily how to achieve
you declare the desired outcome, and the framework or system takes care of the implementation details.
• Focus on What, Not How:
• Abstraction of Implementation
• Less Manual Control:
They provide instructions or configuration details about how a particular piece of code should be treated or processed by a framework or tool.
writing explicit code to specify certain settings, you can use annotations to convey that information more concisely.
DLT provides you a feature to create dataset in form of streaming tables, materialized views, and views maintained the results of declarative queries
• Streaming Table
• Materialized View
• View
How to define dataset
• To define DLT Table and materialized view both we use @dlt.table however Databricks will understand based on the type of read performed in the function.
○ To define DLT materialized view we need to performs a static read against a data source. - spark.read.format("csv")
○ To define DLT Table we need to performs a streaming read against a data source -- spark.readStream.format("cloudFiles")
§ Or we can directly declare --- > dlt.create_streaming_table
• To define DLT View we use @dtl.view table
When to use
• Streaming Table
○ A query is defined against a data source that is continuously or incrementally growing.
○ Query results should be computed incrementally.
○ High throughput and low latency is desired for the pipeline.
• Materialized View
○ You want to view the results of a query during development. Because tables are materialized and can be viewed and queried outside of the pipeline, using tables during development can help validate the correctness of computations.
○ Materialized views should be used for data sources with deletions, or aggregations, and for change data capture processing (CDC).
• View
○ Delta Live Tables does not publish views to the catalog, so views can be referenced only within the pipeline in which they are defined
○ If want to reduce storage and compute costs and do not require the materialization of query results
○ If we have large or complex query that you want to break into easier-to-manage queries.
○ If we want to validate intermediate results using expectations
DLT Expectations:
○ Delta Live Tables (DLT) offers out-of-box features for handling expectations and data quality monitoring to deliver high-quality data on delta lake.
○ We can define one or more validation rules and these rules can be applied in DLT datasets in declarative statement
○ An expectation consists of three things:
§ A description, which acts as a unique identifier and allows you to track metrics for the constraint.
§ A boolean statement that always returns true or false based on some stated condition.
§ An action to take when a record fails the expectation, meaning the boolean returns false.
○ There are three ways in which we can declare these validation
§ Warning --> @dlt.expect("valid timestamp", "col(“timestamp”) > '2012-01-01'")
§ Drop --> @dlt.expect_or_drop("valid_current_page", "current_page_id IS NOT NULL AND current_page_title IS NOT NULL")
§ Fail --> @dlt.expect_or_fail("valid_count", "count > 0")
CDC in Delta Live table:
• We generally use merge statements to perform SCD1 and SCD2 and we need to write some complex logic to handle different conditions.
• However with Apply changes function we just need to pass few of the details in a given syntax provided by Databricks to perform SCD1 and SCD2.
• ignore_null_valus
○ Allow ingesting updates containing a subset of the target columns. When a CDC event matches an existing row and ignore_null_updates is True, columns with a null will retain their existing values in the target.
• Apply as delete
○ Specifies when a CDC event should be treated as a DELETE rather than an upsert.
• Apply as turncate
○ Because this clause triggers a full truncate of the target table, it should be used only for specific use cases requiring this functionality.
• sequence by
○ it will help us if same keys with same time appear it will treat based on Timestamp desc & Version & set the start_at & end_at accordingly.
• Track_history_coulmn_list
○ It will help to maintain any historical changes in row. Keys it will decides based on which column need to maintain the history
Append Flow:
• When ever you want to ingest data from multiple tables
• For example, you might have a table that combines regional data from every region you’re operating in. As new regions are rolled out, you can add the new region data to the table without performing a full refresh.
Nice Post. Thanks for Sharing.
Step-by-Step Guide to Choosing HR Software for your business
Why Zoho Analytics Is the Perfect Alternative to Microsoft Power BI