NUVOLOS
Sign In
  • Getting Started
    • Introduction to Nuvolos
    • Documentation structure
    • Nuvolos basic concepts
      • Organisational hierarchy
      • Applications
      • Distribution
      • Data integration
      • Snapshots
      • Background tasks
    • Navigate in Nuvolos
    • Quickstart tutorials
      • Research
      • Education (instructor)
      • Education (student)
  • Features
    • Applications
      • Application resources
      • Sessions
        • Session Logs
      • Install a software package
      • Create a persistent .bashrc
      • Automatic code execution
      • Long-running applications
      • Troubleshooting applications
      • New applications or licenses
      • Configuring applications
      • Exporting applications
      • Add-ons
        • MariaDB add-on
        • PostgreSQL add-on
        • OpenSearch add-on
        • MongoDB add-on
        • Redis add-on
        • PostGIS add-on
        • Rclone mount add-on
        • Neo4j add-on
    • File system and storage
      • File navigator
      • Large File Storage
      • Preview files
      • Mount Dropbox
      • Access S3 buckets with RClone
      • Access remote files with SSHFS
      • Access files on SharePoint Online
    • Object distribution
      • Distribution strategies
      • The distributed instance
    • Snapshots
      • Create a snapshot
      • Restore a snapshot
      • Delete a snapshot
    • Database integration
      • Create datasets
      • View tables
      • Build queries
      • Upload data
      • Access data from applications
        • Set up ODBC drivers
        • Obtain tokens for data access
        • Find database and schema path
      • DBeaver integration
    • Environment variables and secrets
    • Searching
      • Page
      • Find an application
      • Find an organisation
      • Find a space
      • Find an instance
      • Find a state
    • Video library
    • Nuvolos CLI and Python API
      • Installing the CLI
      • Using the CLI
  • User Guides
    • Research guides
      • Inviting a reviewer
      • GPU computation
    • Education guides
      • Setting assignments
        • Programmatical assignment handling
      • Documenting your course
      • Setting up group projects
        • Collaborative application editing
      • Configuring student applications
      • Archiving your course
      • Student guides
        • Joining a course
        • Working on assignments
        • Leaving a course
    • Application-specific guides
      • JupyterLab
      • RStudio
      • VSCode
      • Stata
      • MATLAB
      • Terminal
      • Terminal [tmux]
      • Apache Airflow
      • Apache Superset
      • D-Wave Inspector
      • MLFlow
      • Databricks Connect
      • Dynare.jl
      • CloudBeaver
      • InveLab
      • Overleaf
      • Metabase
      • DNDCv.CAN
      • OpenMetaData
      • Uploading data to the Large File Storage
    • Data guides
      • Setting up a dataset on Nuvolos
      • Importing data on Nuvolos
      • A complete database research workflow (Matlab & RStudio)
      • Accessing data as data.frames in R
      • Working with CRSP and Compustat
      • Working with the S&P 500®
  • Pricing and Billing
    • Pricing structure
    • Resource pools and budgets
    • Nuvolos Compute Units (NCUs)
  • Administration
    • Roles
      • Requesting roles
    • Organisation management
    • Space management
      • Invite to a space
      • Revoke a space user
      • HPC spaces
      • Resting spaces
    • Instance management
      • Invite to an instance
    • Enabling extra services
    • Monitoring resource usage
      • NCU Limits and Capacities
  • Reference
    • Application reference
      • InveLab
        • Dataset selection
        • Modules
          • Time-series visualisation
          • Moment estimation
          • Mean-variance frontiers
          • Frontiers
          • Dynamic strategy
          • Portfolio analysis
          • Performance analysis
          • Benchmarking
          • Carry trade strategies
          • Risk measures
          • Conditional volatility
          • Replication
          • Factor factory
          • Factor tilting
          • Valuation
    • Glossary
  • FAQs
    • FAQs
    • Troubleshooting
      • Login troubleshooting
        • I forgot my email address
        • I forgot my identity provider
        • I can't log in to Nuvolos
        • I forgot my password
        • I haven't received the password reset email
        • I haven't received the invitation email
      • Application troubleshooting
        • I can't see an application
        • I can't start an application
        • I can't create an application
        • I can't delete an application
        • I can't stop a running application
        • JupyterLab 3 troubleshooting
        • Spyder 3.7 troubleshooting
      • Administration troubleshooting
        • I can't see a space
        • I can't create a space
        • I can't delete a space
        • I can't invite admins to my space
        • I can't see an instance
        • I can't create an instance
        • I can't delete an instance
        • I can't invite users to an instance
        • I can't see distributed content in my instance
        • I can't see a snapshot
        • I can't create a snapshot
        • I can't delete a snapshot
        • I can't revoke a user role
        • I can't upload a file
        • I can't delete a file
        • I can't invite students to my course
      • Content troubleshooting
        • I can't find my files in my Linux home
        • I can't find my files among the Workspace files
        • I restored a snapshot by mistake
Powered by GitBook
On this page
  • Getting access
  • Accessing data in Nuvolos
  • About Compustat and CRSP
  • Linking Compustat and CRSP
  • Merging by CUSIP
  • Merging by CRSP/COMPUSTAT Merged Database
  • A worked example using the RStudio app
  • An alternative solution

Was this helpful?

  1. User Guides
  2. Data guides

Working with CRSP and Compustat

PreviousAccessing data as data.frames in RNextWorking with the S&P 500®

Last updated 1 year ago

Was this helpful?

Getting access

In order to get access to a dataset, the organisation managers can provide information on whether the dataset is currently made available in Nuvolos.

Accessing data in Nuvolos

If CRSP and Compustat datasets are available to your organisation, please see the relevant documentation pages on accessing data in Nuvolos applications.

About Compustat and CRSP

CRSP Stock and Indexes dataset: The CRSP US Stock Databases contain daily and monthly market and corporate action data for over 32,000 active and inactive securities with primary listings on the NYSE, NYSE American, NASDAQ, NYSE Arca and Bats exchanges and include CRSP broad market indexes. CRSP databases are characterised by their comprehensive corporate action information and highly accurate total return calculations. COMPUSTAT North America: Compustat North America (Standard & Poor’s (McGraw-Hill)) is a database of U.S. and Canadian fundamental and market information on more than 24,000 active and inactive publicly held companies. It provides more than 300 annual and 100 quarterly Income Statement, Balance Sheet, Statement of Cash Flows and supplemental data items. If applicable there is a quarterly update.

Linking Compustat and CRSP

Both datasets assign non-volatile unique identifiers to securities and companies. CRSP tracks the "lifecycle" of a security, through corporate events such as mergers, acquisitions and rebrandings. To match these identifiers, the link table approach is often used. A link table associates unique identifiers of two different sources.

CRSP identifiers:

  • PERMNO and PERMCO. These are unique and permanent identifiers, they are never reassigned and do not change during the life history of a stock.

  • CUSIP: a nine-digit numeric (e.g., 037833100 for Apple) or nine-character alphanumeric (e.g., 38259P508 for Google) code that identifies a North American financial security. CRSP keeps 8-digits and 9-digits versions of CUSIP. In the 8-digits version, the last digit is removed. A stock's CUSIP can change over time but is never reassigned.

  • Ticker: unique identifier assigned to a stock traded on a particular exchange. Tickers can be reassigned.

COMPUSTAT identifiers:

  • GVKEY: a unique and permanent retrieval key assigned to each company (issue, currency, index) in the COMPUSTAT database. GVKEYs do not change, and they are not reassigned.

  • CUSIP: COMPUSTAT keeps the full 9-digits version of CUSIP.

  • Ticker.

CRSP/COMPUSTAT link generation approaches:

  1. Using the CUSIP identifier, present by default in both datasets.

  2. Using an identifier linking table provided by the CRSP-COMPUSTAT Merged Database.

Merging by CUSIP

Merging CRSP and COMPUSTAT using CUSIP, the following steps are required:

  • From the CRSP database, get stock identifiers from table NAME_HISTORY. This table contains information on identifiers such as PERMNO and CUSIP, as well as other codes. The start date (column NAMEDT) and end date (column NAMEENDDT) define the period for which a particular stock identifier was valid. If NAMEENDDT is not available, this means that the stock data in NAME_HISTORY is still accurate until the most recent date in the dataset.

  • From COMPUSTAT, select a table that contains the CUSIP identifiers to merge it with the table NAME_HISTORY in CRSP. Merging based on the CUSIP identifier requires dates from COMPUSTAT and CRSP to fall within the time interval defined by NAMEDT and NAMEENDDT.

Merging by CRSP/COMPUSTAT Merged Database

The vendor database CRSP-COMPUSTAT Merged provides a link table (LINK_HISTORY) that can be used to merge CRSP with COMPUSTAT with the following information:

  • CCMID: Compustat's permanent identifier, either GVKEY for companies or GVKEYX for indexes.

  • LPERMNO: CRSP PERMNO link during link period.

  • LPERMCO: CRSP PERMCO link during link period.

  • LINKTYPE: A code to describe the type of link between CRSP and COMPUSTAT.

  • LINKDT: first effective date of the link.

  • LINKENDDT: last effective date of the link. If LINKENDDT is null, this means that the link is still effective.

Merging CRSP and COMPUSTAT using LINK_HISTORY, the following steps are required:

  • Choose between CRSP or COMPUSTAT as the first dataset to merge with CRSP-COMPUSTAT. If merging CRSP with CRSP-COMPUSTAT, then the join is done based on PERMNO. If merging COMPUSTAT with CRSP-COMPUSTAT, then the join is done based on GVKEY. When merging the first dataset with CRSP-COMPUSTAT, both identifiers of CRSP (PERMNO) and COMPUSTAT (GVKEY) will be present in the output table.

  • Merge the resulting table from the previous step with the second dataset. If the first dataset was CRSP, then merge it with COMPUSTAT on GVKEY. Otherwise, merge it with CRSP on PERMNO.

    At all times ensure that the dates from CRSP and COMPUSTAT fall within the LINKDT and LINKENDDT range.

A worked example using the RStudio app

The code examples below are meant for providing a starting point but may need to be adjusted depending on the research question. All researchers should review and verify that the code they use is in line with their research questions.

In the following, it will be assumed that the tables have been distributed from the Compustat North America dataset to the current state of the instance of the example. To distribute data, follow the instructions detailed here.

Designing the query

The suggested workflow is to use the query editor on the UI first to design the query. This step is not compulsory, however, it is considered best practice.

The query

In the example. the following tables were distributed to the work instance's current state:

Compustat:

  1. FUNDA (Fundamentals Annual)

CSRP:

  1. MSF (Monthly price information)

CCM (The merged database):

  1. CCMXPF_LINKTABLE

The first subquery - fundamentals from Compustat:

We select a couple of fundamentals from FUNDA along with the important identifiers GVKEY and DATADATE. It is important to note that the result of a naive query might contain some duplicate GVKEY-DATADATE combinations that need to be deduplicated.

A simple query that does basic deduplication is the following:

SELECT DF.GVKEY, DF.DATADATE, DF.ACCO, DF.AJEX, DF.CURCD, DF.RANK_IN_KEY FROM 
(SELECT DISTINCT GVKEY, DATADATE, ACCO, AJEX, CURCD, ROW_NUMBER() OVER (PARTITION BY GVKEY, DATADATE ORDER BY DATADATE) AS RANK_IN_KEY 
FROM FUNDA T
WHERE FYEAR >= 2000 AND FYEAR <= 2010) DF
WHERE DF.RANK_IN_KEY = 1 OR (DF.RANK_IN_KEY > 1 AND DF.AJEX IS NOT NULL)

An alternative would be to use R to run a simple query and have R run the deduplication. This is syntactically simpler, but at this point, the database engine cannot be used to combine large tables, so the R application needs to handle the computational and memory burden:

# Load the pipe operator for more readable code
library(magrittr)

# Standard practice to access data in Nuvolos from R inside the Nuvolos app
# The nuvolos library is pre-installed on Nuvolos R applications.
conn <- nuvolos::get_connection()
result_data <- dbGetQuery(conn,"SELECT GVKEY, DATADATE, ACCO, AJEX, CURCD FROM FUNDA WHERE FYEAR >= 2000 AND FYEAR <= 2010")

# Simple deduplicate using dplyr::distinct
result_data_dedup <- result_data %>% dplyr::distinct(GVKEY, DATADATE, .keep_all = TRUE)

The second subquery - using the linking table:

The following is a simplified possible query of the linking logic. The first table named INP is just the result of the previously presented query. This is joined together based on GVKEY with the linking table. Some care is needed, due to the fact that a link between a GVKEY and PERMNO can be transient - hence the linking happens only for timestamps that fall in the linking period. More complicated logic can be implemented using overlap intervals. For the appropriate choice of LINKTYPE, consult the dataset documentation, the present version is a standard choice.

SELECT  INP.*, LT.LPERMNO, LT.LINKPRIM, LT.LINKDT, LT.LINKENDDT FROM
    (SELECT DF.GVKEY, DF.DATADATE, DF.ACCO, DF.AJEX, DF.CURCD, DF.RANK_IN_KEY FROM 
        (SELECT DISTINCT GVKEY, DATADATE, ACCO, AJEX, CURCD, ROW_NUMBER() OVER (PARTITION BY GVKEY, DATADATE ORDER BY DATADATE) AS RANK_IN_KEY 
        FROM FUNDA T
        WHERE FYEAR >= 2000 AND FYEAR <= 2010) DF
    WHERE DF.RANK_IN_KEY = 1 OR (DF.RANK_IN_KEY > 1 AND DF.AJEX IS NOT NULL) ) INP
INNER JOIN CCMXPF_LINKTABLE LT
ON LT.GVKEY = INP.GVKEY AND 
    (INP.DATADATE >= LT.LINKDT OR LT.LINKDT IS NULL) AND 
    (INP.DATADATE <= LT.LINKENDDT OR LT.LINKDT IS NULL)
WHERE LT.LINKTYPE IN ('LU', 'LC')

Based on flavour and use-case, additional deduplication might be necessary as there might be multiple PERMNO matches for a GVKEY. This is easiest to be done using the previously presented syntax in R, however, this involves the drawback of not being able to run later join operations using the database engine.

Putting it together with CRSP

The last step of the merging process is to take a CRSP table with PERMNO identifiers and join it using the result of the previous query. The MSF table contains monthly pricing information and the bid/ask average PRC is queried in this toy example.

SELECT FUNDLINK.*, MSF.PRC FROM
        (SELECT  INP.*, LT.LPERMNO, LT.LINKPRIM, LT.LINKDT, LT.LINKENDDT FROM
        (SELECT DF.GVKEY, DF.DATADATE, DF.ACCO, DF.AJEX, DF.CURCD, DF.RANK_IN_KEY FROM 
            (SELECT DISTINCT GVKEY, DATADATE, ACCO, AJEX, CURCD, ROW_NUMBER() OVER (PARTITION BY GVKEY, DATADATE ORDER BY DATADATE) AS RANK_IN_KEY 
            FROM FUNDA T
            WHERE FYEAR >= 2000 AND FYEAR <= 2010) DF
        WHERE DF.RANK_IN_KEY = 1 OR (DF.RANK_IN_KEY > 1 AND DF.AJEX IS NOT NULL) ) INP
    INNER JOIN CCMXPF_LINKTABLE LT
    ON LT.GVKEY = INP.GVKEY AND 
        (INP.DATADATE >= LT.LINKDT OR LT.LINKDT IS NULL) AND 
        (INP.DATADATE <= LT.LINKENDDT OR LT.LINKDT IS NULL)
    WHERE LT.LINKTYPE IN ('LU', 'LC') ) FUNDLINK
INNER JOIN MSF
ON MSF.PERMNO = FUNDLINK.LPERMNO AND YEAR(MSF.DATE) = YEAR(FUNDLINK.DATADATE) AND MONTH(MSF.DATE) = MONTH(FUNDLINK.DATADATE)

To minimize the memory footprint of an application, it is thus suggested to run the above query in R in the following manner:

# Load the pipe operator for more readable code
library(magrittr)

# Standard practice to access data in Nuvolos from R inside the Nuvolos app
# The nuvolos library is pre-installed on Nuvolos R applications.
conn <- nuvolos::get_connection()
result_data <- dbGetQuery(conn,"SELECT FUNDLINK.*, MSF.PRC FROM 
        (SELECT  INP.*, LT.LPERMNO, LT.LINKPRIM, LT.LINKDT, LT.LINKENDDT FROM
        (SELECT DF.GVKEY, DF.DATADATE, DF.ACCO, DF.AJEX, DF.CURCD, DF.RANK_IN_KEY FROM 
            (SELECT DISTINCT GVKEY, DATADATE, ACCO, AJEX, CURCD, ROW_NUMBER() OVER (PARTITION BY GVKEY, DATADATE ORDER BY DATADATE) AS RANK_IN_KEY 
            FROM FUNDA T
            WHERE FYEAR >= 2000 AND FYEAR <= 2010) DF
        WHERE DF.RANK_IN_KEY = 1 OR (DF.RANK_IN_KEY > 1 AND DF.AJEX IS NOT NULL) ) INP
    INNER JOIN CCMXPF_LINKTABLE LT
    ON LT.GVKEY = INP.GVKEY AND 
        (INP.DATADATE >= LT.LINKDT OR LT.LINKDT IS NULL) AND 
        (INP.DATADATE <= LT.LINKENDDT OR LT.LINKDT IS NULL)
    WHERE LT.LINKTYPE IN ('LU', 'LC') ) FUNDLINK
INNER JOIN MSF
ON MSF.PERMNO = FUNDLINK.LPERMNO AND YEAR(MSF.DATE) = YEAR(FUNDLINK.DATADATE) AND MONTH(MSF.DATE) = MONTH(FUNDLINK.DATADATE);")

### Additional operations on the result set

An alternative solution

Compared to the previous route, it is possible to directly link CRSP and COMPUSTAT vendor tables without using any derived entities such as MSF or FUNDA.

The below query provides an example of doing this, it is assumed that TIME_SERIES_DAILY_PRIMARY, LINK_HISTORY and CO_AFND1 tables are available in the working instance's current state.

For the new regularly updated "Compustat Global for <your institution>" datasets, we have created the COMPANY_ANNUAL_DATA_ITEMS table which joins the split Compustat company annual fundamentals tables CO_AFND1 and CO_AFND2 for your convenience.

You can replace CO_AFND1 with COMPANY_ANNUAL_DATA_ITEMS in the examples below to use all data item columns from A to Z without needing to join CO_AFND1 with CO_AFND2.

SELECT TS.CALDT AS DATE, TS.KYPERMNO AS PERMNO, FN.GVKEY, ABS(TS.PRC) AS PRC, FN.EPSFX AS EPS
FROM
LINK_HISTORY AS L
INNER JOIN TIME_SERIES_DAILY_PRIMARY AS TS
ON L.LPERMNO = TS.KYPERMNO
LEFT JOIN CO_AFND1 AS FN
ON FN.GVKEY = L.CCMID AND
FN.DATAFMT='STD' AND FN.CONSOL='C' AND FN.POPSRC='D' AND
FN.DATADATE BETWEEN TO_DATE(L.LINKDT::VARCHAR, 'YYYYMMDD') AND TO_DATE(NULLIF(L.LINKENDDT::VARCHAR, '99999999'),'YYYYMMDD') AND
FN.DATADATE = TS.CALDT
WHERE L.LPERMNO = 57913
ORDER BY DATE;

In order to run this query and assign it to a data.frame object in R, the following snippet can be used.

# Standard practice to access data in Nuvolos from R inside the Nuvolos app
# The nuvolos library is pre-installed on Nuvolos R applications.
conn <- nuvolos::get_connection()
result_data <- dbGetQuery(conn,"SELECT TS.CALDT AS DATE, TS.KYPERMNO AS PERMNO, FN.GVKEY, ABS(TS.PRC) AS PRC, FN.EPSFX AS EPS
    FROM
    LINK_HISTORY AS L
    INNER JOIN TIME_SERIES_DAILY_PRIMARY AS TS
    ON L.LPERMNO = TS.KYPERMNO
    LEFT JOIN CO_AFND1 AS FN
    ON FN.GVKEY = L.CCMID AND
    FN.DATAFMT='STD' AND FN.CONSOL='C' AND FN.POPSRC='D' AND
    FN.DATADATE BETWEEN TO_DATE(L.LINKDT::VARCHAR, 'YYYYMMDD') AND TO_DATE(NULLIF(L.LINKENDDT::VARCHAR, '99999999'),'YYYYMMDD') AND
    FN.DATADATE = TS.CALDT
    WHERE L.LPERMNO = 57913
    ORDER BY DATE;")

The Compustat North America dataset is no longer updated in Nuvolos. Please use the "Compustat Global for <your institution>" if you wish to work with data newer than December 2020 and refer to the described after this example.

here
alternative solution