Comment on page
Databricks Connect
Nuvolos now offers a VSCode application with Python 3.9 and R 4.2 and Databricks Connect (
databrics-connect
) pre-installed. From this application, you can submit Spark jobs to Databrics-hosted Spark clusters.To configure the connection to Databricks, you will need a personal access token, which is not available in the Databricks Community Edition.
First, create a "Databricks 10.4 LTS + Py39 + R 4.2" application in Nuvolos:

Start the new application and open a terminal and configure your Databricks connection with the
databricks-connect configure
command. You will need the URL of your Databricks cluster and your personal access token.You can test your connection with the command
databricks-connect test
To run the example, please install the
slugify
Python package with the following command:conda install -y -c conda-forge python-slugify
Once you have configured the Databricks connection, you can try the following simple example to create a Databricks table and run a SQL query on the table:
import pandas as pd
from pyspark.sql import SparkSession
from slugify import slugify
spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled","true")
# Use NYC Squirrel Census data
df = pd.read_csv("https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD")
df.columns = [slugify(c, separator="_") for c in df.columns]
df = df.drop(columns=["vehicle_location", "electric_utility"])
df = spark.createDataFrame(df)
df.write.mode('overwrite').saveAsTable('ev_data')
spark.sql('select make, model, count(*) as registered from ev_data group by make, model order by registered desc').show(10)
You will see a result like:

The
sparklyr
package is pre-installed in the application which allows you to connect to Databricks Spark clusters, configured with databricks-connect.
You can run the following R script example to run a simple job on your Databricks cluster:
library(sparklyr)
library(dplyr)
databricks_connect_spark_home <- system("databricks-connect get-spark-home", intern = TRUE)
sc <- spark_connect(method = "databricks", spark_home = databricks_connect_spark_home)
cars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)
cars_tbl %>%
group_by(cyl) %>%
summarise(
mean_mpg = mean(mpg, na.rm = TRUE),
mean_hp = mean(hp, na.rm = TRUE)
)
print(as_tibble(cars_tbl), n = 10)
spark_disconnect(sc)
You should see an output like:

Last modified 9mo ago