NUVOLOS
Sign In
  • Getting Started
    • Introduction to Nuvolos
    • Documentation structure
    • Nuvolos basic concepts
      • Organisational hierarchy
      • Applications
      • Distribution
      • Data integration
      • Snapshots
      • Background tasks
    • Navigate in Nuvolos
    • Quickstart tutorials
      • Research
      • Education (instructor)
      • Education (student)
  • Features
    • Applications
      • Application resources
      • Sessions
        • Session Logs
      • Install a software package
      • Create a persistent .bashrc
      • Automatic code execution
      • Long-running applications
      • Troubleshooting applications
      • New applications or licenses
      • Configuring applications
      • Exporting applications
      • Add-ons
        • MariaDB add-on
        • PostgreSQL add-on
        • OpenSearch add-on
        • MongoDB add-on
        • Redis add-on
        • PostGIS add-on
        • Rclone mount add-on
        • Neo4j add-on
    • File system and storage
      • File navigator
      • Large File Storage
      • Preview files
      • Mount Dropbox
      • Access S3 buckets with RClone
      • Access remote files with SSHFS
      • Access files on SharePoint Online
    • Object distribution
      • Distribution strategies
      • The distributed instance
    • Snapshots
      • Create a snapshot
      • Restore a snapshot
      • Delete a snapshot
    • Database integration
      • Create datasets
      • View tables
      • Build queries
      • Upload data
      • Access data from applications
        • Set up ODBC drivers
        • Obtain tokens for data access
        • Find database and schema path
      • DBeaver integration
    • Environment variables and secrets
    • Searching
      • Page
      • Find an application
      • Find an organisation
      • Find a space
      • Find an instance
      • Find a state
    • Video library
    • Nuvolos CLI and Python API
      • Installing the CLI
      • Using the CLI
  • User Guides
    • Research guides
      • Inviting a reviewer
      • GPU computation
    • Education guides
      • Setting assignments
        • Programmatical assignment handling
      • Documenting your course
      • Setting up group projects
        • Collaborative application editing
      • Configuring student applications
      • Archiving your course
      • Student guides
        • Joining a course
        • Working on assignments
        • Leaving a course
    • Application-specific guides
      • JupyterLab
      • RStudio
      • VSCode
      • Stata
      • MATLAB
      • Terminal
      • Terminal [tmux]
      • Apache Airflow
      • Apache Superset
      • D-Wave Inspector
      • MLFlow
      • Databricks Connect
      • Dynare.jl
      • CloudBeaver
      • InveLab
      • Overleaf
      • Metabase
      • DNDCv.CAN
      • OpenMetaData
      • Uploading data to the Large File Storage
    • Data guides
      • Setting up a dataset on Nuvolos
      • Importing data on Nuvolos
      • A complete database research workflow (Matlab & RStudio)
      • Accessing data as data.frames in R
      • Working with CRSP and Compustat
      • Working with the S&P 500®
  • Pricing and Billing
    • Pricing structure
    • Resource pools and budgets
    • Nuvolos Compute Units (NCUs)
  • Administration
    • Roles
      • Requesting roles
    • Organisation management
    • Space management
      • Invite to a space
      • Revoke a space user
      • HPC spaces
      • Resting spaces
    • Instance management
      • Invite to an instance
    • Enabling extra services
    • Monitoring resource usage
  • Reference
    • Application reference
      • InveLab
        • Dataset selection
        • Modules
          • Time-series visualisation
          • Moment estimation
          • Mean-variance frontiers
          • Frontiers
          • Dynamic strategy
          • Portfolio analysis
          • Performance analysis
          • Benchmarking
          • Carry trade strategies
          • Risk measures
          • Conditional volatility
          • Replication
          • Factor factory
          • Factor tilting
          • Valuation
    • Glossary
  • FAQs
    • FAQs
    • Troubleshooting
      • Login troubleshooting
        • I forgot my email address
        • I forgot my identity provider
        • I can't log in to Nuvolos
        • I forgot my password
        • I haven't received the password reset email
        • I haven't received the invitation email
      • Application troubleshooting
        • I can't see an application
        • I can't start an application
        • I can't create an application
        • I can't delete an application
        • I can't stop a running application
        • JupyterLab 3 troubleshooting
        • Spyder 3.7 troubleshooting
      • Administration troubleshooting
        • I can't see a space
        • I can't create a space
        • I can't delete a space
        • I can't invite admins to my space
        • I can't see an instance
        • I can't create an instance
        • I can't delete an instance
        • I can't invite users to an instance
        • I can't see distributed content in my instance
        • I can't see a snapshot
        • I can't create a snapshot
        • I can't delete a snapshot
        • I can't revoke a user role
        • I can't upload a file
        • I can't delete a file
        • I can't invite students to my course
      • Content troubleshooting
        • I can't find my files in my Linux home
        • I can't find my files among the Workspace files
        • I restored a snapshot by mistake
Powered by GitBook
On this page
  • Purpose
  • Preparing data for a dataset
  • Documenting your data
  • Non self-service datasets

Was this helpful?

  1. Features
  2. Database integration

Create datasets

PreviousDatabase integrationNextView tables

Last updated 2 years ago

Was this helpful?

Purpose

Dataset spaces are meant to store curated, high-quality and well-documented data. Any steps prior to achieving this state is supposed to be performed in other (research) environments and only final versions are to be distributed to a dataset space.

Preparing data for a dataset

Every data harvesting, cleaning and curation workflow is different and should be designed on a case-by-case basis. We provide general principles that harmonise well with data layout in Nuvolos.

Some important guidelines to consider when designing your dataset:

  1. One dataset should map to one space.

  2. If a dataset has multiple sub-databases (such as topical sub-databases, for example Health Indicators, Development Indicators, etc.), you may consider populating multiple instances in the same dataset space.

  3. Create vintages of your data to differentiate different point-in-time states of the same data.

  4. Table names need to be unique in a single vintage, but may be the same across multiple instances.

  5. When distributing from multiple instances, name clashing may occur, so avoid overlapping names whenever possible when designing a dataset.

  6. In database layout, Organization and Space together form a , and Instance and Snapshot form a . This means that all instances are stored in the same database, but all vintages are stored in different schema.

  7. We suggest only distributing completely scrubbed and final versions of data to a dataset space, intermediate states should be kept in research instances. It is not possible to modify data stored in dataset spaces (they are essentially "read-only" once data has been populated).

A generic flow would require the following steps:

  1. as a dataset space. Choose the appropriate visibility option (public for open datasets and private for datasets with stringent access control requirements). Activate tables in the dataset.

  2. Create a research space, an application and the appropriate data harvesting code. Activate tables in the research space.

  3. Pull data from your source and (SDW).

  4. Clean and manipulate data to reach the desired layout and quality either via in-memory or in-SDW procedures or a combination of both.

  5. Document data.

  6. along with its documentation to the dataset space.

  7. Create vintage in the dataset space by using the .

Documenting your data

Data documentation in Nuvolos may comprise of the following artefacts:

Non self-service datasets

In case of non self-service datasets, as part of professional data services, we perform the previous steps. The intermediate states are generally not available in your organisation as a research space, however the end result is always stored with the appropriate rights in dataset spaces.

Descriptive documentation in PDF or markdown format stored on the filesystem, along with s and other representations if necessary.

for short-hand information on column values (unit of measurement, high level description.

for short-hand information on table comments.

You can add table and column comments programmatically by sending appropriate SQL statements as linked previously. If you would rather use a GUI solution, you can always use the interface on the to edit the description of a column or table.

ERD
Column comments
Table comments
tables view
Database
Schema
insert raw data into the Scientific Data Warehouse
Distribute data
snapshot feature
Create target dataset