Welcome to the Hoberg-Phillips Data Library
***** IMPORTANT NOTE: The data on this page (known as the ETNIC or "Embeddings-Based TNIC Database") is is part of a preliminary release of a new more informative version of the TNIC database. Click here for important note on the reasons for the preliminary period and plans for final release, and proper research use and citations) *****
Data coverage is 1989 to 2023
Data Provided by Gerard Hoberg and Gordon Phillips
ETNIC: Embeddings-Based TNIC Industry Classifications (ETNIC) data
Embeddings-based TNIC pairwise industry classifications (ETNIC) were developed in Hoberg and Phillips (2024) (see below). ETNIC uses doc2vec embedding technology to improve the statistical power of the baseline TNIC data originally developed in Hoberg and Phillips (2016). A second important extension in ETNIC is greater cross sectional coverage as ETNIC covers all firms in Compustat that have a link to a 10-K on the SEC EDGAR website whereas baseline TNIC covers publicly traded firms, and thus requires an observation to be present in both the CRSP and Compustat databases. This extension offers more research flexibility, and researchers wishing to limit analysis to publicly traded firms on CRSP will need to filter the data on their own to achieve this.
I. [Baseline ETNIC-3 Data] Download baseline version of the embeddings-based ETNIC database (the standard ETNIC version used in most research projects). This version is at a granularity consistent with three-digit SIC codes (we refer to this database as ETNIC-3 data). [Download ETNIC-3 Data] [View Readme for ETNIC-3 Data]
II. [Larger ETNIC-2 Data] Download a larger version of the embeddings-based ETNIC database, which is at a granularity consistent with two-digit SIC codes (we refer to this database as ETNIC-2 data). This has more pairs as it is a coarser industry classification. [Download ETNIC-2 Data] [View Readme for ETNIC-2 Data]
III. [Complete ETNIC-All Data] Download complete version of the embeddings-based ETNIC database (files are much larger, advanced users only). This version is referred to as "ETNIC-All" and has all pairwise similarity scores for all firms in the database (including those not in the same industry). [Download ETNIC-All Data] [View Readme for ETNIC-All Data]
* The following studies provided the key innovations to the creation of this data:
- Scope, Scale and Competition: The 21st Century Firm - Gerard Hoberg and Gordon Phillips, 2024, forthcoming at Journal of Finance
- Using Representation Learning and Web Text to Identify Competitors - Gerard Hoberg, Craig Knoblock, Gordon Phillips, Jay Pujara, Zhiqiang Qiu and Louiqa Raschid, 2024, working paper
- Text-Based Network Industries and Endogenous Product Differentiation - Gerard Hoberg and Gordon Phillips, 2016, Journal of Political Economy, 124 (5), 1423-1465