Skip to main content

Databricks Lakehouse

Overview

This destination syncs data to Delta Lake on Databricks Lakehouse. Each stream is written to its own delta-table.

This connector requires a JDBC driver to connect to the Databricks cluster. By using the driver and the connector, you must agree to the JDBC ODBC driver license. This means that you can only use this connector to connect third party applications to Apache Spark SQL within a Databricks offering using the ODBC and/or JDBC protocols.

Currently, this connector requires 30+MB of memory for each stream. When syncing multiple streams, it may run into an out-of-memory error if the allocated memory is too small. This performance bottleneck is tracked in this issue. Once this issue is resolved, the connector should be able to sync an almost infinite number of streams with less than 500MB of memory.

Getting started

Databricks AWS Setup

1. Create a Databricks Workspace

TIP: Alternatively use Databricks quickstart for new workspace

2. Create a metastore and attach it to workspace

IMPORTANT: The metastore should be in the same region as the workspaces you want to use to access the data. Make sure that this matches the region of the cloud storage bucket you created earlier.

Setup storage bucket and IAM role in AWS

Follow Configure a storage bucket and IAM role in AWS to setup AWS bucket with necessary permissions.

Create metastore

  • Login into Databricks account console with admin permissions.

  • Go to Data tab and hit Create metastore button:

  • Provide all necessary data and click Create:

  • Select the workspaces in Assign to workspaces tab and click Assign.

3. Create Databricks SQL Warehouse

TIP: If you use Databricks cluster skip this step

  • Open the workspace tab and click on created workspace console:

  • Create SQL warehouse:

    • Switch to SQL tab
    • Click New button
    • Choose SQL Warehouse
  • After SQL warehouse was created we can it's Connection details to con

4. Databricks SQL Warehouse connection details

TIP: If you use Databricks cluster skip this step

  • Open workspace console.

  • Go to SQL Warehouse section and open it

  • Open Connection Details tab:

IMPORTANT: Server hostname, Port, HTTP path are used for Airbyte connection

5. Create Databricks Cluster

TIP: If you use Databricks SQL Warehouse skip this step

  • Open the workspace tab and click on created workspace console:

  • Create Cluster:

    • Switch to Data science & Engineering
    • Click New button
    • Choose Cluster

6. Databricks Cluster connection details

TIP: If you use Databricks SQL Warehouse skip this step

  • Open workspace console.

  • Go to Compute section under Data science & Engineering and click on cluster link:

  • Open Advanced options under Configuration, choose JDBC/ODBC tab:

    IMPORTANT: Server hostname, Port, HTTP path are used for Airbyte connection

7. Create Databricks Token

  • Open workspace console.

  • Open User Settings, go to Access tokens tab and click Generate new token:

  • In the new window put a comment (Optional) and lifetime:

TIP: Lifetime can be set to 0

8. Adding External Locations (Optional)

TIP: Skip this step if no external data source is used.

  • Open workspace console.

  • Go to Data section, expand on External Location and click Create Location button:

  • Fill in the fields and click Create button:

TIP: The new Storage credential can be added in the Storage Credentials tab or use same as for Metastore.

Airbyte Setup

Databricks fields

Data Source

You could choose a data source type

  • Managed tables
  • Amazon S3 (External storage)
  • Azure Blob Storage (External storage)

Managed tables data source type

Please check Databricks documentation about What is managed tables

TIP: There is no addition setup should be done for this type.

Amazon S3 data source type (External storage)

IMPORTANT: Make sure the External Locations has been added to the workspace. Check Adding External Locations step.

Provide your Amazon S3 data:

  • S3 Bucket Name - The bucket name
  • S3 Bucket Path - Subdirectory under the above bucket to sync the data into
  • S3 Bucket Region - See here for all region codes.

    IMPORTANT: The metastore should be in the same region as the workspaces you want to use to access the data. Make sure that this matches the region of the cloud storage bucket you created earlier.

  • S3 Access Key ID - Corresponding key to the above key id
  • S3 Secret Access Key -
    • See this on how to generate an access key.
    • We recommend creating an Airbyte-specific user. This user will require read and write permissions to objects in the bucket.
  • S3 Filename pattern - The pattern allows you to set the file-name format for the S3 staging file(s), next placeholders combinations are currently supported: {date}, {date:yyyy_MM}, {timestamp}, {timestamp:millis}, {timestamp:micros}, {part_number}, {sync_id}, {format_extension}. Please, don't use empty space and not supportable placeholders, as they won't be recognized

Azure Blob Storage data source type (External storage)

IMPORTANT: The work in progress.

Sync Mode

FeatureSupportNotes
Full Refresh SyncWarning: this mode deletes all previously synced data in the configured bucket path.
Incremental - Append Sync
Incremental - Append + Deduped
Namespaces

Configuration

CategoryParameterTypeNotes
DatabricksServer HostnamestringRequired. Example: abc-12345678-wxyz.cloud.databricks.com. See documentation. Please note that this is the server for the Databricks Cluster. It is different from the SQL Endpoint Cluster.
HTTP PathstringRequired. Example: sql/protocolvx/o/1234567489/0000-1111111-abcd90. See documentation.
PortstringOptional. Default to "443". See documentation.
Personal Access TokenstringRequired. Example: dapi0123456789abcdefghij0123456789AB. See documentation.
GeneralDatabricks catalogstringOptional. The name of the catalog. If not specified otherwise, the "hive_metastore" will be used.
Database schemastringOptional. The default schema tables are written. If not specified otherwise, the "default" will be used.
Schema evolutionbooleanOptional. The connector enables automatic schema evolution in the destination tables.
Purge Staging DatabooleanThe connector creates staging files and tables on S3 or Azure. By default, they will be purged when the data sync is complete. Set it to false for debugging purposes.
Data Source - S3Bucket NamestringName of the bucket to sync data into.
Bucket PathstringSubdirectory under the above bucket to sync the data into.
RegionstringSee documentation for all region codes.
Access Key IDstringAWS/Minio credential.
Secret Access KeystringAWS/Minio credential.
S3 Filename patternstringThe pattern allows you to set the file-name format for the S3 staging file(s), next placeholders combinations are currently supported: {date}, {date:yyyy_MM}, {timestamp}, {timestamp:millis}, {timestamp:micros}, {part_number}, {sync_id}, {format_extension}. Please, don't use empty space and not supportable placeholders, as they won't recognized.
Data Source - AzureAccount NamestringName of the account to sync data into.
Container NamestringContainer under the above account to sync the data into.
SAS tokenstringShared-access signature token for the above account.
Endpoint domain namestringUsually blob.core.windows.net.

⚠️ Please note that under "Full Refresh Sync" mode, data in the configured bucket and path will be wiped out before each sync. We recommend you provision a dedicated S3 or Azure resource for this sync to prevent unexpected data deletion from misconfiguration. ⚠️

Staging Files (Delta Format)

S3

Data streams are first written as staging delta-table (Parquet + Transaction Log) files on S3, and then loaded into Databricks delta-tables. All the staging files will be deleted after the sync is done. For debugging purposes, here is the full path for a staging file:

s3://<bucket-name>/<bucket-path>/<uuid>/<stream-name>

For example:

s3://testing_bucket/data_output_path/98c450be-5b1c-422d-b8b5-6ca9903727d9/users/_delta_log
↑ ↑ ↑ ↑ ↑
| | | | transaction log
| | | stream name
| | database schema
| bucket path
bucket name

Azure

Similarly, streams are first written to a staging location, but the Azure option uses CSV format. A staging table is created from the CSV files.

Unmanaged Spark SQL Table

Currently, all streams are synced into unmanaged Spark SQL tables. See documentation for details. In summary, you have full control of the location of the data underlying an unmanaged table. In S3, the full path of each data stream is:

s3://<bucket-name>/<bucket-path>/<database-schema>/<stream-name>

For example:

s3://testing_bucket/data_output_path/public/users
↑ ↑ ↑ ↑
| | | stream name
| | database schema
| bucket path
bucket name

In Azure, the full path of each data stream is:

abfss://<container-name>@<account-name>.dfs.core.windows.net/<database-schema>/<stream-name>

Please keep these data directories on S3/Azure. Otherwise, the corresponding tables will have no data in Databricks.

Output Schema

Each table will have the following columns:

ColumnTypeNotes
_airbyte_ab_idstringUUID.
_airbyte_emitted_attimestampData emission timestamp.
_airbyte_dataJSONThe data from your source will be in this column

Under the hood, an Airbyte data stream in Json schema is first converted to an Avro schema, then the Json object is converted to an Avro record, and finally the Avro record is outputted to the Parquet format. Because the data stream can come from any data source, the Json to Avro conversion process has arbitrary rules and limitations. Learn more about how source data is converted to Avro and the current limitations here.

Suppose you are interested in learning more about the Databricks connector or details on how the Delta Lake tables are created. You may want to consult the tutorial on How to Load Data into Delta Lake on Databricks Lakehouse.

Changelog

Expand to review
VersionDatePull RequestSubject
3.0.02024-07-12#40689(Private release, not to be used for production) Add _airbyte_generation_id column, and sync_id entry in _airbyte_meta
2.0.02024-05-17#37613(Private release, not to be used for production) Alpha release of the connector to use Unity Catalog
1.1.22024-04-04#36846(incompatible with CDK, do not use) Remove duplicate S3 Region
1.1.12024-01-03#33924(incompatible with CDK, do not use) Add new ap-southeast-3 AWS region
1.1.02023-06-02#26942Support schema evolution
1.0.22023-04-20#25366Fix default catalog to be hive_metastore
1.0.12023-03-30#24657Fix support for external tables on S3
1.0.02023-03-21#23965Added: Managed table storage type, Databricks Catalog field
0.3.12022-10-15#18032Add SSL=1 to the JDBC URL to ensure SSL connection.
0.3.02022-10-14#15329Add support for Azure storage.
2022-09-01#16243Fix Json to Avro conversion when there is field name clash from combined restrictions (anyOf, oneOf, allOf fields)
0.2.62022-08-05#14801Fix multiply log bindings
0.2.52022-07-15#14494Make S3 output filename configurable.
0.2.42022-07-14#14618Removed additionalProperties: false from JDBC destination connectors
0.2.32022-06-16#13852Updated stacktrace format for any trace message errors
0.2.22022-06-13#13722Rename to "Databricks Lakehouse".
0.2.12022-06-08#13630Rename to "Databricks Delta Lake" and add field orders in the spec.
0.2.02022-05-15#12861Use new public Databricks JDBC driver, and open source the connector.
0.1.52022-05-04#12578In JSON to Avro conversion, log JSON field values that do not follow Avro schema for debugging.
0.1.42022-02-14#10256Add -XX:+ExitOnOutOfMemoryError JVM option
0.1.32022-01-06#7622 #9153Upgrade Spark JDBC driver to 2.6.21 to patch Log4j vulnerability; update connector fields title/description.
0.1.22021-11-03#7288Support Json additionalProperties.
0.1.12021-10-05#6792Require users to accept Databricks JDBC Driver Terms & Conditions.
0.1.02021-09-14#5998Initial private release.