biocolab icon

BioStudio logo

Databricks with BioStudio

Databricks

Databricks is a unified analytics platform built on Apache Spark, designed for large-scale data processing. It integrates data engineering, data science, and business analytics into one environment. Delta Lake adds reliability with ACID transactions, while MLflow manages the machine learning lifecycle. The platform supports massive data volumes and complex workloads with scalability, offering collaborative notebooks and real-time co-editing to facilitate teamwork. Databricks integrates with AWS, Azure, Google Cloud, and various databases, providing enterprise-level security and compliance. It handles infrastructure management and is versatile for data warehousing, ETL, machine learning, and real-time analytics.
Databricks integrates seamlessly with Amazon S3, offering a robust environment for big data processing and analytics. It can read and write data directly to and from S3, allowing users to utilize S3 as a cost-effective and scalable storage solution. Databricks also allows you to mount S3 buckets to the Databricks file system (DBFS), making it easier to access and manage data stored in S3 as if it were part of the local file system.

BioStudio integration with Databricks

BioStudio supports integration with Databricks, enabling users to leverage its capabilities for data exploration and job execution. BioStudio offers various tools to streamline these processes. Currently, BioStudio supports:

  1. Mounting Cloud Object Storage: BioStudio can mount All - cloud object storage ( AWS, GCP, Azure ) on Databricks, syncing data from the user’s workspace.
  2. SQL Warehouse Connection: BioStudio also supports connections to SQL warehouses for optimized data querying and analysis.
  3. Cluster Connection: Users can connect to Databricks clusters for scalable computing.

Setup Databricks connection with BioStudio.

                         +=============================+
                         |      User's Workspace       |
                         +=============================+
                                     |
                                     | Data Sync
                                     v
                         +=============================+
                         |   BioStudio Integration     |
                         |        with Databricks      |
                         +=============================+
                           |            |            |
                           |            |            |
     +---------------------+            |            +---------------------+
     |                                  |                                  |
     v                                  v                                  v
+-----------------+             +-----------------+             +-----------------+
| Cloud Object    |             | SQL Warehouse   |             | Databricks      |
| Storage (AWS,   |             | Connection      |             | Clusters        |
| GCP, Azure)     |             |                 |             |                 |
|                 |             |                 |             |                 |
+-----------------+             +-----------------+             +-----------------+
       |                               |                                |
       |                               |                                |
       v                               v                                v
+---------------------+       +-----------------+             +-----------------+
|  Amazon S3          |       | SQL Data        |             | EC2 Instances   |
|                     |       | Querying/       |             | (Configured     |
|                     |       | Analysis        |             | for HPC with    |
|                     |       |                 |             | FSx for Lustre) |
+---------------------+       +-----------------+             +-----------------+
       |
       | Connected via FSx for Lustre
       |
       v
+---------------------+
| High-Performance    |
| Parallel Workflows  |
| (SLURM/SGE Paradigm)|
+---------------------+

🔆 This bucket is associated with Databricks.


Protocol

🔆 This bucket is associated with Databricks.


Protocol

🔆 Mount S3 bucket.

  1. Login to BioStudio.
  2. Click to Cloud Storage Icon.


Protocol

  1. Click to Create Connection.
  2. It will connect to S3 bucket and mount.


Protocol

  1. Mount point can be view by clicking System Terminal.


Protocol


Protocol

  1. Create one file.


Protocol

  1. It is available with Databricks.


Protocol

Databricks Connection with BioStudio Visual Studio.

🔆 Cluster Connection
Complete databricks including resouces can be connected to BioStudio using token.

  1. Generate token to connect to BioStudio by clicking setting.


Protocol

  1. Select user -> Developer -> Access tokens -> Generate new token.


Protocol

  1. Open the VS Code application tool from BioStudio.


Protocol

  1. Configure databricks connection.


Protocol

[Bioturing-databrick]
host = https://dbc-454984-e221564448e.cloud.databricks.com
token = dap2165411316rdtffdbvrdftgyrtfhgrd321654646
  1. Provide host name.


Protocol

  1. Edit Databricks profile.


Protocol

  1. Provide all values.


Protocol

  1. Configuration has been saved.
  2. Click to Databricks extention -> Configure Databricks to see the connectivity.


Protocol

  1. Select saved profile.


Protocol

  1. Connection succeed.


Protocol

  1. User connection is available with Databricks.

    Protocol

BioStudio connection with Databricks SQL warehouses.

🔆 SQL Warehouses

  • We need SQL Warehouse to did operartions like execute query.
  1. Let’s start SQL warehouses server.


Protocol


Protocol

  1. SQL warehouses server started.


Protocol

  1. We will use python to execute query with SQL Warehouses.


Protocol

  1. Generate token to execute query from BioStudio.

  2. Create conda environment / kernel. Select Kernels -> fill all the values.


Protocol


Protocol

  1. Check environment is ready to use.


Protocol

  1. Activate environment.
ub-lalit-e33d9fbdb4e96d0@colabdev-868c4b58bb-ctb8k:~$ conda env list
# conda environments:
#
databrick                /data/ub-lalit-e33d9fbdb4e96d0/.conda/envs/databrick
databrick-sqlwarehose     /data/ub-lalit-e33d9fbdb4e96d0/.conda/envs/databrick-sqlwarehose
base                     /miniconda/user

ub-lalit-e33d9fbdb4e96d0@colabdev-868c4b58bb-ctb8k:~$ conda activate databrick-sqlwarehose
(databrick-sqlwarehose) ub-lalit-e33d9fbdb4e96d0@colabdev-868c4b58bb-ctb8k:~$ 
  1. Install the Databricks SQL Connector for Python library on your development machine by running.
pip install databricks-sql-connector


Protocol

  1. Create new notebook.


Protocol

  1. Select Kernel.


Protocol

  1. Run notebook.


Protocol

BioStudio connection with Databricks Compute.

🔆 Compute

  1. Creater cluster.


Protocol


Protocol

  • Cluster is ready now.


Protocol

  1. Attached cluster to BioStudio.


Protocol

  1. Cluster Connected.


Protocol

  1. We can select Kernel to run code.


Protocol

  1. Databrick power tool also available.


Protocol