Working with CRSP and Compustat
Getting access
In order to get access to a dataset, the organisation managers can provide information on whether the dataset is currently made available in Nuvolos.
Accessing data in Nuvolos
If CRSP and Compustat datasets are available to your organisation, please see the relevant documentation pages here on accessing data in Nuvolos applications.
About Compustat and CRSP
CRSP Stock and Indexes dataset: The CRSP US Stock Databases contain daily and monthly market and corporate action data for over 32,000 active and inactive securities with primary listings on the NYSE, NYSE American, NASDAQ, NYSE Arca and Bats exchanges and include CRSP broad market indexes. CRSP databases are characterised by their comprehensive corporate action information and highly accurate total return calculations. COMPUSTAT North America: Compustat North America (Standard & Poor’s (McGraw-Hill)) is a database of U.S. and Canadian fundamental and market information on more than 24,000 active and inactive publicly held companies. It provides more than 300 annual and 100 quarterly Income Statement, Balance Sheet, Statement of Cash Flows and supplemental data items. If applicable there is a quarterly update.
Linking Compustat and CRSP
Both datasets assign non-volatile unique identifiers to securities and companies. CRSP tracks the "lifecycle" of a security, through corporate events such as mergers, acquisitions and rebrandings. To match these identifiers, the link table approach is often used. A link table associates unique identifiers of two different sources.
CRSP identifiers:
PERMNO and PERMCO. These are unique and permanent identifiers, they are never reassigned and do not change during the life history of a stock.
CUSIP: a nine-digit numeric (e.g., 037833100 for Apple) or nine-character alphanumeric (e.g., 38259P508 for Google) code that identifies a North American financial security. CRSP keeps 8-digits and 9-digits versions of CUSIP. In the 8-digits version, the last digit is removed. A stock's CUSIP can change over time but is never reassigned.
Ticker: unique identifier assigned to a stock traded on a particular exchange. Tickers can be reassigned.
COMPUSTAT identifiers:
GVKEY: a unique and permanent retrieval key assigned to each company (issue, currency, index) in the COMPUSTAT database. GVKEYs do not change, and they are not reassigned.
CUSIP: COMPUSTAT keeps the full 9-digits version of CUSIP.
Ticker.
CRSP/COMPUSTAT link generation approaches:
Using the CUSIP identifier, present by default in both datasets.
Using an identifier linking table provided by the CRSP-COMPUSTAT Merged Database.
Merging by CUSIP
Merging CRSP and COMPUSTAT using CUSIP, the following steps are required:
From the CRSP database, get stock identifiers from table
NAME_HISTORY
. This table contains information on identifiers such asPERMNO
andCUSIP
, as well as other codes. The start date (columnNAMEDT
) and end date (columnNAMEENDDT
) define the period for which a particular stock identifier was valid. IfNAMEENDDT
is not available, this means that the stock data inNAME_HISTORY
is still accurate until the most recent date in the dataset.From COMPUSTAT, select a table that contains the
CUSIP
identifiers to merge it with the tableNAME_HISTORY
in CRSP. Merging based on theCUSIP
identifier requires dates from COMPUSTAT and CRSP to fall within the time interval defined byNAMEDT
andNAMEENDDT
.
Merging by CRSP/COMPUSTAT Merged Database
The vendor database CRSP-COMPUSTAT Merged provides a link table (LINK_HISTORY
) that can be used to merge CRSP with COMPUSTAT with the following information:
CCMID: Compustat's permanent identifier, either
GVKEY
for companies orGVKEYX
for indexes.LPERMNO: CRSP
PERMNO
link during link period.LPERMCO: CRSP
PERMCO
link during link period.LINKTYPE: A code to describe the type of link between CRSP and COMPUSTAT.
LINKDT: first effective date of the link.
LINKENDDT: last effective date of the link. If
LINKENDDT
is null, this means that the link is still effective.
Merging CRSP and COMPUSTAT using LINK_HISTORY, the following steps are required:
Choose between CRSP or COMPUSTAT as the first dataset to merge with CRSP-COMPUSTAT. If merging CRSP with CRSP-COMPUSTAT, then the join is done based on
PERMNO
. If merging COMPUSTAT with CRSP-COMPUSTAT, then the join is done based onGVKEY
. When merging the first dataset with CRSP-COMPUSTAT, both identifiers of CRSP (PERMNO
) and COMPUSTAT (GVKEY
) will be present in the output table.Merge the resulting table from the previous step with the second dataset. If the first dataset was CRSP, then merge it with COMPUSTAT on
GVKEY
. Otherwise, merge it with CRSP onPERMNO
.At all times ensure that the dates from CRSP and COMPUSTAT fall within the LINKDT and LINKENDDT range.
A worked example using the RStudio app
The code examples below are meant for providing a starting point but may need to be adjusted depending on the research question. All researchers should review and verify that the code they use is in line with their research questions.
In the following, it will be assumed that the tables have been distributed from the Compustat North America dataset to the current state of the instance of the example. To distribute data, follow the instructions detailed here.
The Compustat North America dataset is no longer updated in Nuvolos. Please use the "Compustat Global for <your institution>" if you wish to work with data newer than December 2020 and refer to the alternative solution described after this example.
Designing the query
The suggested workflow is to use the query editor on the UI first to design the query. This step is not compulsory, however, it is considered best practice.
The query
In the example. the following tables were distributed to the work instance's current state:
Compustat:
FUNDA
(Fundamentals Annual)
CSRP:
MSF
(Monthly price information)
CCM (The merged database):
CCMXPF_LINKTABLE
The first subquery - fundamentals from Compustat:
We select a couple of fundamentals from FUNDA
along with the important identifiers GVKEY
and DATADATE
. It is important to note that the result of a naive query might contain some duplicate GVKEY
-DATADATE
combinations that need to be deduplicated.
A simple query that does basic deduplication is the following:
An alternative would be to use R to run a simple query and have R run the deduplication. This is syntactically simpler, but at this point, the database engine cannot be used to combine large tables, so the R application needs to handle the computational and memory burden:
The second subquery - using the linking table:
The following is a simplified possible query of the linking logic. The first table named INP
is just the result of the previously presented query. This is joined together based on GVKEY
with the linking table. Some care is needed, due to the fact that a link between a GVKEY
and PERMNO
can be transient - hence the linking happens only for timestamps that fall in the linking period. More complicated logic can be implemented using overlap intervals. For the appropriate choice of LINKTYPE
, consult the dataset documentation, the present version is a standard choice.
Based on flavour and use-case, additional deduplication might be necessary as there might be multiple PERMNO
matches for a GVKEY
. This is easiest to be done using the previously presented syntax in R, however, this involves the drawback of not being able to run later join operations using the database engine.
Putting it together with CRSP
The last step of the merging process is to take a CRSP table with PERMNO identifiers and join it using the result of the previous query. The MSF
table contains monthly pricing information and the bid/ask average PRC
is queried in this toy example.
To minimize the memory footprint of an application, it is thus suggested to run the above query in R in the following manner:
An alternative solution
Compared to the previous route, it is possible to directly link CRSP and COMPUSTAT vendor tables without using any derived entities such as MSF
or FUNDA
.
The below query provides an example of doing this, it is assumed that TIME_SERIES_DAILY_PRIMARY,
LINK_HISTORY
and CO_AFND1
tables are available in the working instance's current state.
For the new regularly updated "Compustat Global for <your institution>" datasets, we have created the COMPANY_ANNUAL_DATA_ITEMS table which joins the split Compustat company annual fundamentals tables CO_AFND1 and CO_AFND2 for your convenience.
You can replace CO_AFND1 with COMPANY_ANNUAL_DATA_ITEMS in the examples below to use all data item columns from A to Z without needing to join CO_AFND1 with CO_AFND2.
In order to run this query and assign it to a data.frame object in R, the following snippet can be used.
Last updated