How to upload your curated data to a dataset space
Dataset spaces are meant to store curated, high-quality and well-documented data. Any steps prior to achieving this state is supposed to be performed in other (research) environments and only final versions are to be distributed to a dataset space.
Preparing data for a dataset
Every data harvesting, cleaning and curation workflow is different and should be designed on a case-by-case basis. We provide general principles that harmonize well with data layout in Nuvolos.
Some important guidelines to consider when designing your dataset:
One dataset should map to one space.
If a dataset has multiple sub-databases (such as topical sub-databases, for example Health Indicators, Development Indicators, etc.), you may consider populating multiple instances in the same dataset space.
Create vintages of your data to differentiate different point-in-time states of the same data.
Table names need to be unique in a single vintage, but may be the same across multiple instances.
When distributing from multiple instances, name clashing may occur, so avoid overlapping names whenever possible when designing a dataset.
In database layout, Organization and Space together form a Database, and Instance and Snapshot form a Schema. This means that all instances are stored in the same database, but all vintages are stored in different schema.
We suggest only distributing completely scrubbed and final versions of data to a dataset space, intermediate states should be kept in research instances. It is not possible to modify data stored in dataset spaces (they are essentially "read-only" once data has been populated).
A generic flow would require the following steps:
Create target dataset as a dataset space. Choose the appropriate visibility option (public for open datasets and private for datasets with stringent access control requirements). Activate tables in the dataset.
Create a research space, an application and the appropriate data harvesting code. Activate tables in the research space.
You can add table and column comments programmatically by sending appropriate SQL statements as linked previously. If you would rather use a GUI solution, you can always use the interface on the tables view to edit the description of a column or table.
Non self-service datasets
In case of non self-service datasets, as part of professional data services, we perform the previous steps. The intermediate states are generally not available in your organization as a research space, however the end result is always stored with the appropriate rights in dataset spaces.