Automate your jobs with Databricks Asset Bundles

Tags: Data analytics, Artificial intelligence, CKΔTechnical

Databricks has announced the General Availability of Databricks Asset Bundles (DABs). At CKDelta, we’ve been testing this framework for a while, automating our workflows, and deploying them using continuous integration and delivery (CI/CD) pipelines. Here we share our experiences and show you some of the functionality we've found most useful.

What are Databricks Asset Bundles?
Databricks Asset Bundles (DABs) enable easy workflow management in the Databricks platform. The concept follows the infrastructure-as-code approach, allowing you to version your pipelines in your git-based repository easily. Workflows, cluster configurations, resources, and artifacts can be described in YAML files and maintained as a part of your codebase. You can create bundles manually or by using templates. Azure Databricks offers several templates that might give you some inspiration on how to structure bundles. They include examples of using Python and SQL code in Databricks, dbt-core for local development, and MLOps Stacks projects. What’s more; you can create your own templates that can be reused across different projects and teams.

Workflows specified in configuration files can be validated, deployed, and run using Databricks CLI. Before you start playing with DABs, make sure you use the latest version of Databricks CLI (0.218.0 or higher). If you use an older version, remove it first and install the latest one. After authenticating to Databricks, you can create your first bundle from a template with the following command:

databricks bundle init

It will present the set of available templates from which you can choose the one you need. If you have already used dbx for workflow management, you’ll find it fairly easy to switch to DABs. It has many overlapping functionalities but also new features that make workflow automation easier.

Creating and running a bundle

Now, let’s look at some components your bundle configuration can include.

1. Cloud and workspace configurations: all information about your workspace (such as profiles, paths, and authentications) and your cloud platform (Azure, AWS or GCP).


workspace:
  
 profile: DEFAULT
  
 root_path: /Shared/.bundle/
  
 artifact_path: /Shared/.bundle/
  
 auth_type: databricks-cli

2. Source code: paths to notebooks, Python files, packages, and wheels that your project uses.


resources:

 jobs:

  training_pipeline:

   name: training_pipeline

   tasks:

    - task_key: get_data

    spark_python_task:

     python_file: jobs/data_collection.py

    libraries:

    - whl: ./dist/*.whl

3. Databricks resources: project elements specific for the Databricks platform, e.g. cluster configurations, MLFlow specifications, Delta Live Tables (DLT) pipelines.

resources:

 pipelines:

  weekly_data_pipeline:

    name: weekly_data_pipeline

    libraries:

     - notebook:

      path: ./notebooks/br_pipeline.py

    storage: dev_location

    target: dev_target

    clusters:

     - label: default

     autoscale:

      min_workers: 1

      max_workers: 4

     spark_env_vars:

      PYSPARK_PYTHON: /databricks/python3/bin/python3

4. Scheduling rules: DABs allow you to set scheduling rules for each workflow. You can use cron expressions, common for other orchestration tools such as Airflow.

resources: jobs: weekly_data_pipeline: name: weekly_data_pipeline schedule: quartz_cron_expression: '0 10 * * 1-5' timezone_id: Europe/Amsterdam

5. Custom permission rules: Permission rules can be specified to ensure the security of your production workflows. You can set permissions jointly for all resources (e.g., experiments, jobs and pipelines) or separately for each of them.resources: jobs: weekly_data_pipeline: name: weekly_data_pipeline permissions: - service_principal_name: company_service_principle level: IS_OWNER - user_name: developer_email level: CAN_EDIT

To learn more about components you can include in your bundle, check out the DABs documentation.

Once you’ve created your first bundle, you can now test it to see its structure and check for any syntax errors. The following command will return a JSON with all elements of your bundle.

databricks bundle validate

If the validation is successful, you can deploy your bundle to a default or specified target. The following command deploys local artifacts to the dev Databricks workspace specified in the configuration file:

databricks bundle deploy -t dev

Finally, run a selected workflow present in the configuration file from your local or VM:

databricks bundle run -t dev weekly_data_pipeline

Now that we’ve covered the basics of DABs, let’s dive into some of the features that we found the most useful in our day-to-day job.

The include keyword

Configuration files can grow really fast. The more pipelines, clusters, and custom permission rules you need, the larger and more difficult to read a configuration file becomes. An ideal solution would be to split the configurations between different files to create a local and modularised system. However, YAML doesn’t natively support such operations. Workaround solutions allow you to attach custom constructors (such as !include) to the YAML loader, but they might not be straightforward to use or easy to understand by your team members.

DABs introduce the include keyword that allows you to specify a list of path globs containing additional configuration files. All of the files that you reference will be included in your bundle. Those paths should be specified relatively to the main bundle configuration file, e.g.:


include:

 - "training.pipelines.yml"

 - "training.jobs.yml"

 - "training.variables.yml"

You can also automatically include all configuration files matching a common pattern. In the example below all configuration files that start with training. and have a YML extension will be loaded:


include:

 - "training.*.yml"

This feature allows us to split configuration files easily. Not only does this improve the readability of the configurations, but also allows for fast iterations and is less prone to errors.

Referencing variables

A standard way of reusing information in YAML files is by using anchors and aliases. They let you reference the same values in multiple places within a configuration file. Even though anchors and aliases can make the configuration files more concise, they might also complicate them and introduce problems with parsing YAML files. DABs introduce a new way of creating and referencing variables.

You can declare variables in your configuration files following the variables keyword. DABs allow you to set optional descriptions, default values and lookups to retrieve an ID value. Combining variables with include lets you store your variables in a separate file which you can then reference in your main bundle configuration. Here’s an example:


variables:

 dev_cluster_id:

  description: The ID of an existing development cluster

  default: 11114-2223-2gaa01

 notebook_location:

  description: The path to training notebooks

  default: Shared/.bundle/notebooks

You can easily reference the above-specified variables using the ${var.<VARIABLE>} syntax, e.g.

resources:

 jobs:

  training_pipeline:

   name: training_pipeline

   tasks:

   - task_key: train_model

    existing_cluster_id: ${var.dev_cluster_id}

    notebook_task:

     notebook_path: ${var.notebook_location}

There are several ways of setting a variable’s value. One is including the default value in your configuration file (like in the example above). Another method is using an environment variable or providing a value as part of the deploy or run command.

DABs also allow for specifying different variables depending on the target.

targets: dev: variables: notebook_location: /my_name/.bundle/notebooks prod: variables: notebook_location: /Shared/.bundle/notebooks

In the example above you can see that within the same variable name, we’ve set a different value for notebook location in targets development and production.

The overriding functionality

Setting variables in DABs enables the modularisation of your configuration files without compromising their readability. However, it also has some limitations. Complex variables are not yet supported by DABs. This means that currently, cluster configurations cannot be stored as variables and reused in different tasks or pipelines. But that functionality should come out any day now since a pull request is already open. A current workaround for that issue uses the overriding functionality introduced in DABs. Let’s look at an example of how to do that.

Let’s say that you’re testing a new training job and need a cluster with a maximum of four nodes for development purposes. In production you want to use the same cluster configuration as in development, but with a maximum of ten nodes. Using the overriding functionality, you can specify the development cluster configuration in the resources mapping and then override it in the targets mapping for production. Here’s how you can do that:

resources: jobs: training_pipeline: name: training_pipeline job_clusters: - job_cluster_key: autoscale-cluster new_cluster: spark_version: 13.3.x-scala2.12 node_type_id: Standard_DS3_v2 autoscale: min_workers: 1 max_workers: 4 spark_env_vars: PYSPARK_PYTHON: /databricks/python3/bin/python3 targets: production: resources: jobs: training_pipeline: name: training_pipeline job_clusters: - job_cluster_key: autoscale-cluster new_cluster: autoscale: max_workers: 10

The overriding functionality allows you to reuse all cluster specifications from the development resources. The only thing you need to specify for production is the max_workers whose value changes to ten. You can use the databricks bundle validate command to see the structure of your bundle and validate whether your cluster configuration has been populated and overridden correctly.

Deployment modes

Each stage of the development process has its own set of typical actions. In the development stage, it can be using personal clusters for interactive code creation or running concurrent jobs for fast testing. In the production stage, it’s: using the service principle for job deployments or setting automatic scheduling. DABs follow this convention by introducing mode mapping. By specifying the mode for your development and production pipelines, behaviours connected to these stages will be automatically executed and evaluated.

Some of the behaviours available in the development mode include:

• adding a prefix with your user name to jobs and pipelines, so you can easily identify your resources in Databricks for faster development,

• using interactive clusters by specifying compute-id: <cluster-id>,

• pausing schedules and triggers on deployed jobs,

• enabling concurrent runs on all jobs.

All you have to do to access those functionalities is set the mode keyword to development in the target of your choice. Here’s how to do that:

targets: dev: mode: development

Similarly, there are functionalities that you can access by specifying themode keyword to production. It will automatically:

• validate that your current Git branch is equal to the Git branch specified in the production target,

• block overriding cluster configurations with compute-id: <cluster-id>,

• enforce using service_principle to run jobs,

- alternatively, confirm that paths are not overridden by specific users.

A no-brainer for Databricks users

DABs is a great orchestration tool to enhance your CI/CD pipelines and we're looking forward to continued development. Using DABs is straightforward, and makes your configuration files modularised and easier to understand. It enables fast development and robust production deployments. If you are a Databricks user, using DABs is a no-brainer. Let us know what your impressions of DABs are. We’d love to hear from you!