Apache Airflow
Last updated
Last updated
For researchers who require scheduled workflows, Nuvolos supports Airflow as a self-service application. Airflow runs inside a JupyterLab application, making it easy to edit Airflow DAG files, install packages and use the Nuvolos filesystem for data processing.
The JupyterLab application is collaborative, so DAGs can be worked on simultaneously by multiple users in a "Google Docs"-like fashion.
DAGs should be created as Python files in the /files/airflow/dags
folder, refer to Airflow documentation for an example.
Create a new Python file named /files/airflow/dags/tutorial.py
and copy the contents of the tutorial DAG from the Airflow tutorial.
Click on the Airflow tab and click on the All DAGs filter selector on the UI, the DAG should show up on the list like on the screenshot below. It can take up to a minute for the DAG to show up on the list, as Airflow is periodically scanning Python files the /files/airflow/dags
folder for new DAG definitions.
Click on the slider toggle next to the tutorial
DAG name to enable the DAG and start the first execution.
You should quickly see that the DAG has executed successfully by seeing a 1 in a green circle in the Runs column.
Airlfow Connections and Variables can be configured on the Airflow UI.
Airflow on Nuvolos uses a CeleryExecutor back-end to be able to execute tasks in parallel.
To install packages used in DAGs, simply open a JupyterLab terminal and pip / conda / mamba install the required package. Please refer to the Install a software package chapter of our documentation for detailed instructions.
Task execution, scheduler and DAG bag update logs are in /files/airflow/logs
.
The following example illustrates how to create a DAG that downloads CSV data from an API, saves the data as a compressed Parquet file and uploads the data as a Nuvolos table.
Airflow will use the database credentials of the user starting the application.
Create a new Airflow application in your working instance and start the application.
Once Airflow starts, open a new terminal tab and run the following commands to install package dependencies:
mamba install -y --freeze-installed -c conda-forge pandas-datareader
mamba install -y --freeze-installed -c conda-forge pyarrow
Once the setup is complete, the following script should be saved as the file /files/airflow/dags/csv_to_nuvolos
:
Save the file, a new DAG should show up within a couple of seconds on the Airflow tab. Click on the slider toggle next to the csv_to_nuvolos
DAG name to enable the DAG:
Click on the blue "play" icon to trigger the execution of the DAG. Click on the name of the DAG to see the progress:
When all steps run to success, they show up dark green in Airflow. You can now check the resulting table in the Tables view:
Airlfow is now also available bundled with VSCode, which makes developing DAGs easier. To use Airflow with VSCode, please select the latest version of the "Airflow + Code-server + Python" app from the Airflow application type:
Next, when the application launches, open the Command Palette with Ctrl + Shift + P
or Command + Shift + P
on a Mac and type "Airflow" to select the Airflow: Show Airflow
command:
This command will open Airflow in a new VSCode tab:
To install additional Python dependencies, open a Terminal in VSCode and install the package with mamba install -y -c conda-forge --freeze-installed <package_name>