*****************************************************
TOP LEVEL NOTE: FIC industry classifications are a restricted version of the more general TNIC database.  The key restrictions imposed on TNIC to derive FIC are two fold.  First, we force the industry classification to be transitive (which is suboptimal as transitivity is not the way relatedness works in the real world).  Transitivity, like many things in financial economics, can be viewed as a constraint on the classification that makes it more noisy, although it gives the classification
a property that might otherwise be desirable (transitivity).  Most other classifications like SIC, NAICS, GICS, etc, are transitive classifications and thus are forcing some firms into clusters where they do not uniquely belong.  Please see Hoberg and Phillips (2016JPE) for details, paper reference is below.  The second restriction imposed on FIC industries is that the clusters have a "location" or "industry description" that is fixed in time.  This, again, is also the case for SIC, NAICS, GICS, etc.  One must hold fixed product locations for transitive industries if they are to be time-consistent.  ***** Sorry for long discussion, but details are in the Hoberg and Phillips (2016JPE) paper noted below.  Please read the discussion of TNIC vs FIC in that paper before using FIC industries. ***** It is critical to do so becaue FIC industries ARE SUBOPTIMAL as far as signal.  To use them is to ignore a substantial amount of power...  XXXXXXXX Thus we do not recommend using FIC and only include it for completeness.  We recommend using TNIC industries for almost all applications.


****** FYI on most recent 2022 update: In this update, in addition to forward extending the database to 2021 fiscal year endings, we also improved the linking to Compustat gvkeys resulting in 1% more observations in each year relative to older versions.  We also used better parsing technology to improve the quality of the item 1 extracted from some 10-Ks (we thank Christopher Ball at metaHeuristica.com).  We tested these improvements using standard tests from HP2016 referenced below and find a modest improvement in signal power indicating that this version is improved relative to prior versions.


****** NOTE: Please read the technical descriptions below before using the data.  


General Info:

This file accompanies the FIC industry database and describes where the data comes from,
the papers that should be cited when providing academic references, and some very important technical details regarding its usage.
Please read the technical details in full before using this data.  These details are critically important to ensure proper usage.

**************************************************************************************************************
**************************************************************************************************************
********************************************** Background ****************************************************
********************************************** Background ****************************************************
********************************************** Background ****************************************************
**************************************************************************************************************
**************************************************************************************************************

For an extensive description of this data, please read the data and methodology sections of the studies noted below.  Here is a 
brief description.

This data is based on web crawling and text parsing algorithms that process the text in the business descriptions of 10-K annual 
filings on the SEC Edgar website from 1996 to present.  These product descriptions are legally required to be accurate, as Item 101 
of Regulation S-K legally requires that firms describe the significant products they offer to the market, and these descriptions 
must also be updated and representative of the current fiscal year of the 10-K.  We merge each firm's text product description to 
the CRSP/COMPUSTAT universe using the central index key (CIK) [We thank the Wharton Research Data Service (WRDS) for providing us 
with an expanded historical mapping of SEC CIK to COMPUSTAT gvkey, as the base CIK variable in COMPUSTAT only contains current links].  
Our resulting database is based on all publicly traded firms (domestic firms traded on either NYSE, AMEX, or NASDAQ) for which we have 
COMPUSTAT and CRSP data.

We calculate our firm-by-firm pairwise similarity scores by parsing the product descriptions from the firm 10Ks and forming word vectors 
for each firm to compute continuous measures of product similarity for every pair of firms in our sample in each year (a pairwise 
similarity matrix).  This is done using the cosine similarity method, which is applied after basic screens to eliminate common words are
applied (see studies noted below).   For any two firms i and j, we thus have a product similarity, which is a real number in the 
interval [0,1] describing how similar the words used by firms i and j are.  For any given year, if there are 5000 firms, this would be 
((5000*5000)-5000)/2 pairwise similarities (the lower off diagonal of a square matrix).

The FIC classification is based on a clustering algorithm that groups firms together to maximize within-industry similarity while achieving a goal
of N industries.  To maintain the fixed location properties of other FIC industies such as SIC or NAICS, they are constructed using the 1997 data 
alone, and then the same set of industries is held fixed over time.  We use 1997 as this is the earliest year for which we have full coverage in 
Edgar.  The clustering algorithm is also run over the subset of firms excluding conglomerates to identify pure-play product markets accurately.
The clustering algorithm reduces the set of all firms to N industries using a maximization of within-industry similarity procedure described in 
the papers below.  Because the algorithm adjusts industry memberships after each iteration, it is possible that a FIC industry designed to have N 
industries might end up having only N-1 industries.  The attached file includes FIC-500, FIC-400, FIC-300, FIC-200, FIC-100, FIC-50 and FIC-25 industries.  
In some cases, a classification labelled as having N industries might in fact have one or two fewer industries.  The incidence of having one
fewer industry than targeted is a natural result of the clustering algorithm and should not be viewed as being problematic.  

One last note is that although we fix the classifications based on 1997 data, we do assign all firms to these fixed set of industries for the full
length of our sample.  Firms are thus evaluated each year and a firm's industry assignment can change each year.   
That is, we use firm i's 2003 10-K to assign it to one of the N 1997 fixed location industries in 2003.  This is done for each year.  
Because firm 10-Ks can change over time, and because the industries are fixed over time, a given firm's industry assignment can thus change as 
its 10-K evolves.  This is analogous to the possibility that a firm can move from one SIC code to another over time.  Hence, our FIC industries 
are designed to offer the same properties as other FIC industries like SIC and NAICS, but with FREQUENT updating based on how firms 
product descriptions change over time.  Note that all FIC industries miss out on the enhanced flexibility offered by TNIC industries.  
If your analysis can benefit from time-varying industry locations, or from the full knowledge of how similar firms and industries are 
to one another (see papers below), please use the TNIC industry data that is also now available on the web.

**************************************************************************************************************
**************************************************************************************************************
********************************************** Citations *****************************************************
********************************************** Citations *****************************************************
********************************************** Citations *****************************************************
**************************************************************************************************************
**************************************************************************************************************

This data is the result of a large research project initiated in early 2006 by Gerard Hoberg and Gordon Phillips.
The intent of the project is to better understand the role of industry, product market competition, and relatedness 
through the product market.  The data in its current state is the result of innovations described in the following
two papers.  As such, both should be cited when using this data for the purpose of academic research.

Product Market Synergies and Competition in Mergers and Acquisitions: A Text-Based Analysis
Gerard Hoberg and Gordon Phillips, 2010, Review of Financial Studies 23 (10), 3773-3811.

Text-Based Network Industries and Endogenous Product Differentiation
Gerard Hoberg and Gordon Phillips, Journal of Political Economy (October 2016), 124 (5) 1423-1465.

**********************************************************************************************************************
**********************************************************************************************************************
********************************************** Technical Details *****************************************************
********************************************** Technical Details *****************************************************
********************************************** Technical Details *****************************************************
**********************************************************************************************************************
**********************************************************************************************************************

Please read the following carefully to ensure proper usage of this data.

Technical Note 1) Because our own research reveals that firms and industries move considerably within the product space over time, we view
TNIC industries to be far more informative and useful than FIC classifications, including SIC, NAICS, or even these FIC industries.  Also 
please read the final paragraph in the background section above in this file, which makes it clear that TNIC data is needed to derive more 
economic content about how similar firms are within an industry, or how similar they are across industries.

Technical Note 2) Each file contains a gvkey, a year, and industry codes for the 100, 200, 300, 400, and 500 classifications.  It is important to 
note that we already did the merge to COMPUSTAT, so you do not have to repeat this.  The data contained here is not lagged.  For convenience, the year field 
in this database is based on Compustat calendar years obtained as the first four digits of the YYYYMMDD datadate variable.  Consider a COMPUSTAT firm with a 
fiscal year ending in Sept 30th, 1997, for example.  The corresponding records for this firm's gvkey in the file fic_data.txt for 1997 are based on the product 
description of the 10-K report that was associated with this 9/30/1997 fiscal year.  Because this data is merged by fiscal year ending, the industry assignments 
in this file should conveniently be viewed as being time-synchronous to the COMPUSTAT data with the corresponding fiscal year end.

Technical Note 3) Please be aware that these industries are formed using single segment 1997 firm data, and firms are then assigned to the 
classifications using their product descriptions in any given year, based on which industry to which they are most similar.  Thus firms are 
REASSIGNED every year and can switch industries over time in each year.

Technical Note 4) The actual numbers associated with an industry assignment (eg, industry #59 in the FIC-200 classification) do not have economic 
content beyond their use to identify which firms are in the same industry.  For example, industry #58 is not more closely related to industry #59
than is industry #22.  However, the #59 is important because it tells you that all gvkeys that are assigned to #59 are in the same industry.  
To get different levels of coarseness, analogous to using SIC-4 or SIC-3, please use different levels of FIC coarseness, as we provide 5 such 
levels (25, 50, 100, 200, 300, 400, 500).