Weixin Liang1, Scott Elrod2, Daniel A McFarland3, James Zou1,4. 1. Department of Computer Science, Stanford University, 350 Jane Stanford Way, Stanford, CA 94305, USA. 2. Office of Technology Licensing, Stanford University, Stanford, CA 94305, USA. 3. Graduate School of Education, Stanford University, Stanford, CA 94305, USA. 4. Department of Biomedical Data Science, Stanford University, 350 Jane Stanford Way, Stanford, CA 94305, USA.
Abstract
This article systematically investigates the technology licensing by Stanford University. We analyzed all the inventions marketed by Stanford's Office of Technology Licensing (OTL) between 1970 to 2020, with 4,512 inventions from 6,557 inventors. We quantified how the innovation landscape at Stanford changed over time and examined factors that correlate with commercial success. We found that the most profitable inventions are predominantly licensed by inventors' own startups, inventions have involved larger teams over time, and the proportion of female inventors has tripled over the past 25 years. We also identified linguistic features in how the inventors and OTL describe the inventions that significantly correlate with the invention's future revenue. Interestingly, inventions with more adjectives in their abstracts have worse net income. Our study opens up a new perspective for analyzing the translation of research into practice and commercialization using large-scale computational and linguistics analysis.
This article systematically investigates the technology licensing by Stanford University. We analyzed all the inventions marketed by Stanford's Office of Technology Licensing (OTL) between 1970 to 2020, with 4,512 inventions from 6,557 inventors. We quantified how the innovation landscape at Stanford changed over time and examined factors that correlate with commercial success. We found that the most profitable inventions are predominantly licensed by inventors' own startups, inventions have involved larger teams over time, and the proportion of female inventors has tripled over the past 25 years. We also identified linguistic features in how the inventors and OTL describe the inventions that significantly correlate with the invention's future revenue. Interestingly, inventions with more adjectives in their abstracts have worse net income. Our study opens up a new perspective for analyzing the translation of research into practice and commercialization using large-scale computational and linguistics analysis.
The role of American research universities has evolved and expanded in recent decades. While the traditional mission of universities has long been to educate young people and to discover and transmit new disciplinary knowledge, today, many universities have added technological invention and commercialization as part of their core mission. This is manifested by changes in some universities’ mission statements, the proliferation and enlargement of offices of technology licensing (OTLs), the increase in the number of invention disclosures, patents, and licenses, and changes in tenure and promotion criteria to encourage the commercialization of university-generated knowledge. As universities are playing an increasingly central role in technological inventions that drive economic growth, many technology-reliant companies have reduced their budgets for internal research and development, opting to lean more heavily on collaborations with universities.,7, 8, 9, 10With the increased emphasis on university technology transfer, many universities in the US have established OTLs. An OTL serves as a mediator between the suppliers of innovations (university scientists) and those who can potentially commercialize them, i.e., industry,, centralizing university inventions and facilitating their commercialization through licensing to existing firms or startup companies of inventors. As examples, technologies that Stanford University’s OTL has commercialized include the recombinant DNA technology that helped to jump start the biotechnology industry, internet search engines (e.g., Google PageRank), functional antibodies, and music synthesizers. The activities of OTLs have important economic and policy implications since licensing agreements and university-based startups can result in additional revenue for the university, employment opportunities, and local economic and technological spillovers through the stimulation of additional research and development (R&D) investment and job creation.13, 14, 15 To incentivize university scientists, universities typically share licensing income with the inventors and the inventor’s department. For example, Stanford’s royalty-sharing policy is to divide a third of the net income to the inventor, a third to the inventors’ departments, and a third to the inventors’ schools.We focus our analysis on Stanford OTL because it is one of the most active and impactful technology transfer centers. Stanford OTL has long been regarded as a canonical approach for many universities in both the US and abroad., Established in 1970, Stanford’s OTL is one of the older OTLs.,, Stanford has one of the most successful technology transfer programs, which has contributed to substantial commercial activity. According to the 2020 annual survey of the Association of University Technology Managers (AUTM), Stanford ranks among the top five universities in the US across each of the key technology transfer performance metrics, including license income received, invention disclosures, US patents issued, and startup companies formed.Despite the increasing importance of technology licensing, systematic data science analysis of university technology transfer is not common. Much of the previous research focuses on a few particular inventions at specific universities.,,, For example, Colyvas et al. investigated the early formation and the institutionalization process of Stanford’s OTL.,, Another line of research focuses on the determinants of licensing. Shane et al. analyzed early MIT inventions issued during the 1980–1996 period and found that university inventions are more likely to be licensed when patents are effective. Using the same data, Dechenaux et al. explored the effect of appropriability on commercialization of inventions, and Shane et al. found that new ventures with founders having direct and indirect relationships with venture investors are most likely to receive venture funding. Huang et al. performed a systematic analysis of life sciences patents in MIT from 1983 to 2017. They include a number of outcome measurements that are unique to the biopharmaceutical industry, such as Orange Book citations, drug candidates discovered, and US drug approvals. Other works focused on faculties’ decisions on invention disclosure,,26, 27, 28, 29, 30, 31 showing that faculty decisions to disclose are shaped by their perceptions of the benefits of patent protection and the historical structure and mission of the university.Several works performed cross-university studies at a coarser granularity to measure the efficiency of university technology transfer., For example, Thursby et al. found that the rise in university technology transfer is the result of a greater willingness of university researchers to patent their inventions and an increase in outsourcing of R&D by firms via licensing., Subsequent work showed that higher percentages of royalty shares for faculty members,, and age of OTLs,, are associated with greater licensing income. Beyond the US, studies on the efficiency of university technology transfer have also been conducted in other countries including the UK, Spain, and Italy., In sum, this line of research suggested that the key impediments to better university technology transfer performance tend to be organizational, which includes incentives, relating both to pecuniary and non-pecuniary rewards, such as credit toward tenure and promotion, the staffing and compensation practices of the OTLs, university culture, milieu of entrepreneurism, and group norms.,,More broadly, previous research has investigated factors that drive scientific innovation. Although scientific innovation is widely accepted to be highly uncertain and unpredictable, previous research found that scientific projects that posit unexpected relationships between domains receive greater attention and are more richly rewarded than projects that explore more commonplace connections.44, 45, 46, 47 Although external factors such as the overall funding landscape and economic conditions could also affect scientific innovation, research teams are the engines of modern science. The growth of prevalence and size of teams has been one of the most universal trends across all areas of scientific and scholarly investigation. Prior experimental and observational studies reveal that demographic diversity benefits innovation.50, 51, 52 Smaller teams have tended to disrupt science and technology with new ideas and opportunities, whereas larger teams have tended to develop existing ones., Another line of research investigated patent-to-paper citations to assess the route from public research to economic and social impact, which highlights the importance of basic research and public research.55, 56, 57, 58, 59Large-scale computational analysis of university technology transfer by OTLs has been limited in the literature. This gap motivates our comprehensive computational analysis leveraging the unprecedented data of 4,512 marketed inventions from 6,557 inventors at Stanford since the founding of its OTL in 1970. These are the inventions that Stanford’s OTL prioritized for marketing over the 13,485 disclosed inventions during that period of time.In our analysis, we focus on (1) quantifying how the innovation landscape at Stanford evolved over 50 years and (2) examining the factors that correlate with commercial success. We organize our analysis by first quantifying the holistic trends of invention at Stanford over time. We then analyze the inventors driving the innovations—their demographics, team composition, and the effects of licensing by inventor startups. Finally, a particularly interesting aspect of inventions is how they are publicly marketed, which is also a key role of the OTL. Therefore, we further analyze linguistic features in how an invention is presented in its title and abstract, as these semantic footprints enable us to gain insights into what the OTL believes are important to highlight. Our study opens up a new perspective for analyzing the translation of research into practice and commercialization by using fine-grained, large-scale computational and linguistic analysis.
Results
Overview of the Stanford inventions data
Our study leverages unprecedented data access to 4,512 marketed inventions, corresponding to 6,557 inventors from Stanford between 1970 and 2020, provided to us by the Stanford OTL for analysis. The number of marketed inventions increased rapidly from 1980 (4 inventions per year) to 2010 (250 inventions per year) and plateaued in the 2010s (Figure 1A). The rise of the internet greatly facilitated marketing, contributing to the large increase. Following the convention of the OTL, we use net income, which is defined as the total licensing income minus the cumulative expense (e.g., patent application and litigation costs) as a measure of the outcome of an invention. The total net income of the inventions for all years considered is $581 M, and the average net income is $0.13 M. Overall, most inventions have a negative net income, and only 20% of inventions in this dataset have produced positive net income (Figure 1A).
Figure 1
Overview of the Stanford inventions data
(A) Number of inventions by year that Stanford’s Office of Technology Licensing marketed. The color of the stacked bar chart indicates whether the cumulative net income (until June 31, 2021) is positive.
(B) Categories with the highest average net income across years. The numbers in parentheses indicate the average net income. Cell colors indicate the root category (teal: electronics, blue: biology, green: chemistry).
(C and D) Overrepresented keywords of (C) above-median income inventions (net income above the median for the same year) and (D) below-median income inventions. We identified words with the greatest log likelihood ratio of appearing in above-median invention keywords versus below-income invention keywords. The size of each term in the word cloud corresponds to its log likelihood ratio.
(E) Visualization of the sub-categories and the collaborative relationship among them. Each node represents a sub-category, and the edge is defined as the percentage of overlapping inventions that the two sub-categories share. Node color indicates the root category. Intra-category edges are colored using the color of the root category. Inter-category edges are colored gray.
Overview of the Stanford inventions data(A) Number of inventions by year that Stanford’s Office of Technology Licensing marketed. The color of the stacked bar chart indicates whether the cumulative net income (until June 31, 2021) is positive.(B) Categories with the highest average net income across years. The numbers in parentheses indicate the average net income. Cell colors indicate the root category (teal: electronics, blue: biology, green: chemistry).(C and D) Overrepresented keywords of (C) above-median income inventions (net income above the median for the same year) and (D) below-median income inventions. We identified words with the greatest log likelihood ratio of appearing in above-median invention keywords versus below-income invention keywords. The size of each term in the word cloud corresponds to its log likelihood ratio.(E) Visualization of the sub-categories and the collaborative relationship among them. Each node represents a sub-category, and the edge is defined as the percentage of overlapping inventions that the two sub-categories share. Node color indicates the root category. Intra-category edges are colored using the color of the root category. Inter-category edges are colored gray.Each invention is assigned to one or more categories (e.g., “biophysics”) and keywords (e.g., “Alzheimer disease”) by the OTL. The categories with the highest average net income changed across the years (Figure 1B). Before 2000, the top net income categories were all in electronics, and after 2000, the top net income categories were in biology and chemistry. Since the net income is cumulative across time, recent inventions have a lower net income compared with older inventions because they had less time to accumulate income. We also identified the keywords that had the greatest log likelihood ratio of appearing in above-median inventions (net income above the median for the same year) versus below-median inventions (Figure 1C), and vice versa (Figure 1D). Words enriched in above-median income inventions tend to be life sciences terms, such as therapeutic and genomics. In contrast, words enriched in low-income inventions tend to be associated with physical sciences, such as optical and photonics.One invention can be assigned to multiple categories if it is relevant to different domains. For example, 17 medical imaging inventions disclosed in 2020 are assigned to both the radiology subcategory (under the biology category) and the computer vision subcategory (under the engineering category). Categories that co-occur in many inventions suggest that there is fruitful interdisciplinary research between them. We visualize the interaction relationship among different categories as a network (Figure 1E). There are substantial interactions between subcategories of biology and chemistry and between subcategories of engineering and electronics. Subcategories of materials science have diverse interactions with biology, chemistry, and engineering.In all of the following analyses, to control for net income change over time, we use the net income rank, i.e., the normalized rank of the net income among the inventions with the same disclosure year. We also control the categorical difference by adding the categories as control variables in all linear regression analyses.
Inventor demographics analysis
The proportion of female inventors has tripled from 6.5% in 1995 to 19.7% in 2020 (Figure 2A). The increase remains significant after controlling for categories (p = 7.9E−07; Table S5). However, despite such a rapid increase, overall, females are still underrepresented: the percentage of female inventors is consistently lower than the percentage of female faculty at Stanford by a large margin. For example, in 2019, the percentage of female faculty at Stanford was 30%, while only 20% of the inventors were female. A caveat here is that certain disciplines (e.g., medicine, engineering) can be more likely to file inventions than other disciplines.
Figure 2
Inventor demographics analysis
(A) The percentage of female Stanford faculty and invention authors over the past 25 years.
(B) The number of authors per invention across different categories over time.
Inventor demographics analysis(A) The percentage of female Stanford faculty and invention authors over the past 25 years.(B) The number of authors per invention across different categories over time.Beyond involving more females, inventions also involved larger teams over time. For example, the average number of inventors per invention under the biology category increased from 2.47 in 1980–2000 to 3.29 in 2015–2020 (Figure 2B). Such an increase is consistent across different categories (p = 1.7E−30; Table S6), indicating that the invention environment at Stanford is increasingly collaborative. In addition, we found that the inventions from teams of only first-time inventors have a higher net income than other inventions (p = 3.1E−02; Table S7). This highlights the importance of being open to first-time inventors.
Self-licensing
Stanford’s entrepreneurial culture is also reflected in our data. Around 20% of the inventions were licensed by the inventor's own startups, which we refer to as “self-licensing.” Overall, the self-licensing rate increases over time (p = 4.0E−02; Figure 3A; Table S8). The interesting peak of the self-licensing rate in 1995–1999 might be related to the dot-com bubble. We also found that inventions with high net income are predominantly self-licensing inventions (Figure 3B). For example, all inventions that have generated more than $10 M net income are self-licensed, and the self-licensing rate for the inventions with $1–$10 M net income is 59%. In contrast, the self-licensing rate for inventions with less than $10 K net income is 16%. After controlling for categories and years, self-licensing is still strongly associated with higher net income (p = 4.1E−08; Table S9). This finding is consistent with previous research showing that startups with direct connections to the university tend to be more successful than otherwise similar startups. In addition, the self-licensing rate is higher in the biology category (p = 3.0E−05) and lower in the electronics category (p = 6.4E−04).
Figure 3
Self-licensing (inventions licensed by the inventor's own startups)
(A) The fraction of inventions licensed by inventor startups over time.
(B) The fraction of inventions in each net income group that the inventors license. The sample sizes for each net income category are: <$10 K: 3,776 inventions; $10–$100 K: 465 inventions; $100 K–$1 M: 212 inventions; $1–$10 M: 56 inventions; ≥$10 M: 5 inventions.
Self-licensing (inventions licensed by the inventor's own startups)(A) The fraction of inventions licensed by inventor startups over time.(B) The fraction of inventions in each net income group that the inventors license. The sample sizes for each net income category are: <$10 K: 3,776 inventions; $10–$100 K: 465 inventions; $100 K–$1 M: 212 inventions; $1–$10 M: 56 inventions; ≥$10 M: 5 inventions.
Linguistic analysis on OTL marketing
An important role of the Stanford OTL is to market the researchers’ inventions to potentially interested companies. Marketing is typically initiated through a marketing abstract created by the OTL that describes the invention to the public. Therefore, to gain insights into OTL marketing, we analyze two main questions. (1) How have marketing abstracts changed over the years? (2) Which linguistic features in the marketing abstracts are associated with the commercial outcome of the invention? Similar text analysis techniques have been applied to scientific innovation studies in the literature.The marketing abstracts have changed substantially over the years. The average length of the marketing abstracts has nearly doubled: from 144 words in 1980–1990 to 241 words in 2015–2020 (Figure 4A). The increase remains statistically significant after controlling for categories (p = 7.7E−42; Tables S1 and S2). Interestingly, the titles of the marketing abstracts are also getting longer (p = 6.3E−19; Table S1) and have 10× more adjectives (p = 2.4E−52; Figure 4B; Table S3) from 1980–1990 (1%) to 2015–2020 (12%). This might suggest that inventions are becoming increasingly specialized, which would require longer text and more adjectives to describe them.
Figure 4
Linguistic analysis on OTL marketing
(A) The average length of OTL marketing abstracts and inventors' abstracts over time.
(B) The average fraction of adjectives in titles over time.
(C) The correlation between the occurrence of each adjective in the marketing abstract and net income rank. Shown here are adjectives with p <0.05. Font size indicates the frequency of the word. Text color indicates the correlation coefficient with net income rank after controlling for categories: red indicates negative correlation, and blue indicates positive correlation.
(D) Machine-learning classifiers with the marketing abstracts as inputs to predict whether the net income of an invention will be above the median net income of the inventions of the same disclosure year. TF-IDF, the classifier using term frequency-inverse document frequency features; BERT, the state-of-the-art text classifier that utilizes deep learning to provide contextual features for each word. Category baseline: only using category tags of each invention as inputs. Shown are receiver operating characteristic (ROC) curves on the hold-out test set. A classifier using TF-IDF features achieves a 0.71 area under the receiver operating characteristic (AUROC) on the hold-out test set.
Linguistic analysis on OTL marketing(A) The average length of OTL marketing abstracts and inventors' abstracts over time.(B) The average fraction of adjectives in titles over time.(C) The correlation between the occurrence of each adjective in the marketing abstract and net income rank. Shown here are adjectives with p <0.05. Font size indicates the frequency of the word. Text color indicates the correlation coefficient with net income rank after controlling for categories: red indicates negative correlation, and blue indicates positive correlation.(D) Machine-learning classifiers with the marketing abstracts as inputs to predict whether the net income of an invention will be above the median net income of the inventions of the same disclosure year. TF-IDF, the classifier using term frequency-inverse document frequency features; BERT, the state-of-the-art text classifier that utilizes deep learning to provide contextual features for each word. Category baseline: only using category tags of each invention as inputs. Shown are receiver operating characteristic (ROC) curves on the hold-out test set. A classifier using TF-IDF features achieves a 0.71 area under the receiver operating characteristic (AUROC) on the hold-out test set.Beyond the temporal changes, we also identified linguistic features in how the OTL describes the inventions that significantly correlate with the invention’s future revenue. We found that inventions with longer marketing abstracts (p = 2.2E−04) or more adjectives in the marketing abstracts (p = 1.4E−05) are associated with worse net income (Table S4). Interestingly, we found that words like “novel” (p = 3.57e−08), “significant” (p = 2.00e−03), and “effective” (p = 8.51e−03) correlate negatively with the net income, even after controlling for categories. These adjectives remain statistically significant after adjusting for multiple hypothesis testing with a false discovery rate of 0.05. In contrast, only a few adjectives correlate positively with net income.Analyses of inventors’ abstracts show similar results: The length of inventors' abstracts has also substantially increased over time (p = 6.3E−19) after controlling for categories (Table S1). In addition, both the length of the inventors' abstracts (p = 1.3E−02) and the fraction of adjectives (p = 2.7E−05) correlate negatively with net income (Table S4). We further investigated the correlation between net income and the usage of each adjective in the inventors' abstracts (Figure S1A). Similarly, we found that in inventor’s abstracts, adjectives like “significant” (p = 4.04e−04), “novel” (p = 2.38e−02), and “effective” (p = 3.27e−03) also correlate negatively with the net income, even after controlling for categories. These adjectives also remain statistically significant after adjusting for multiple hypothesis testing with a false discovery rate of 0.05. One possible explanation is that for more incremental inventions, inventors tend to write longer abstracts and use more adjectives to highlight their novelties and advantages over existing technologies, and the writing of marketing abstracts by OTLs might be influenced by the inventors’ abstracts.Finally, to quantify the distinction between the marketing abstract of above-median income inventions and those of low-income inventions, we trained machine-learning classifiers to predict whether the net income rank is above 0.5, i.e., whether the net income is above the median for the same disclosure year (Figure 4D). A classifier using term frequency-inverse document frequency (TF-IDF) features achieves a 0.71 area under the receiver operating characteristic (AUROC) on the hold-out test set. The BERT classifier, which utilizes deep learning to provide contextual features for each word, is highly accurate, with a 0.76 AUROC score. In contrast, the baseline classifier that takes category annotations as input only achieves a low 0.57 AUROC score, suggesting that the linguistic patterns we identify here are not driven by different styles of presenting different categories of inventions. Experiments on inventors’ abstracts show similar results (Figure S1B). This suggests that the abstracts of above-median income inventions have clearly distinguishing textual features beyond categorical differences.
Discussion
This paper provides a systematic and quantitative characterization of the technology licensing pipeline at Stanford between 1970 and 2020, with 4,512 inventions from 6,557 inventors. Our analysis characterizes how the innovation landscape at Stanford changed over time: the top-income invention categories shifted from electronics to life sciences after 2000. The inventions might also be increasingly specialized, as indicated by the substantial increase in the length of both the titles and the abstracts for describing them.Our demographic analysis suggests that inventions involved larger teams over time across all categories. The proportion of female inventors has tripled over the past 25 years, though they are still underrepresented. This finding is consistent with previous research findings on the gender gap in patenting. Proactive efforts can be taken to support diverse faculties in translating their research to industry. Our analysis also highlights the important role of inventors in commercializing research: the most profitable inventions are predominantly licensed by inventors’ own startups, and such self-licensing practices are also becoming increasingly popular over time. This finding is consistent with previous research showing that startups with direct connections to the university tend to be more successful than otherwise similar startups. Several other papers have also shown evidence for a positive relationship between faculty involvement and commercialization outcomes.,, Overall, the self-licensing rate increases over time, and there is an interesting peak of the self-licensing rate in 1995–1999 that might be related to the dot-com bubble.An important role of the Stanford OTL is to market the researchers’ inventions to potentially interested companies. A primary way of this marketing is through the OTL providing a marketing abstract that describes the invention to the public. Our linguistic analysis identified linguistic features in how the OTL describes the inventions that significantly correlate with the invention’s future revenue. Interestingly, inventions with more adjectives in the marketing abstracts are associated with worse net income. Adjectives like “novel,” “effective,” and “significant” in the marketing abstracts correlate negatively with the net income, even after controlling for categories and year. One possible explanation is that for more incremental inventions, inventors tend to write longer abstracts and use more adjectives to highlight their novelties and advantages over existing technologies, and the writing of marketing abstracts by OTL might be influenced by the inventors’ abstracts. Furthermore, the strong predictive performance at discriminating both the author and marketing abstracts of inventions with above-median versus below-median income, after controlling for categories, exemplifies their substantial linguistic difference and opens up new possibilities for further linguistic analysis. Future works include incorporating further semantic analysis on the abstracts to measure the scientific novelty of the inventions.The findings of this study have to be considered in light of some limitations. First, the invention licensing via OTLs represents only one facet of the transfer of technology from university to industry, though it is an important facet. Second, we primarily focus on net income as the outcome metric because it is straightforward to quantify and is a key metric of OTL's own assessment. However, it is important to note that licensing income does not completely capture impact, and pursuing licensing income is not the ultimate goal of the Stanford OTL. The third limitation concerns the observational nature of our study. Although we have been careful in controlling for confounders like category and year in our statistical models, the results should not be interpreted as causal but rather as statistical associations. Finally, while our data focus on a single university, Stanford University, this is an important case study because Stanford is a leading center of innovation. Our findings also provide insights into the academic-industry partnership of Silicon Valley since many technologies and startups from Stanford are commercialized there. More work is needed to study the technology licensing at other universities with different entrepreneurial environments.
Experimental procedures
Resource availability
Lead contact
Further information and requests for code and data should be directed to and will be fulfilled by the lead contact, James Zou (jamesz@stanford.edu).
Materials availability
This study did not generate any physical materials.
Materials and methods
Stanford inventions data
Metadata for the subset of 4,512 inventions that were prioritized for marketing, corresponding to 6,557 inventors from Stanford between 1980 and 2020, were provided to us by the Stanford OTL for analysis. Many of the inventions were web marketed, which partly explains the rapid increase in the number of inventions in the 1990s. The OTL receives invention disclosures from Stanford faculty, staff, and students. Generally, faculty notify the OTL of their invention discoveries and delegate to the university all rights to negotiate licenses on their behalf. After receiving the invention disclosures, the OTL evaluates the commercial potential of the invention and, if it is a patentable subject matter, decides whether to file a patent. Our dataset contains inventions protected by both patents and copyright. For each invention, we have data on the title, name of inventors, the abstract, keywords, category tags, and disclosure date. We also have access to the cumulative revenue and the cumulative expense, from which we can derive the cumulative net income of each invention, which provides a measure of the impact of each invention. The net income in our dataset is calculated before sharing it with inventors and inventors' departments and schools. Given that we aimed to focus on the impact of each invention, we considered each invention (i.e., docket) as our unit of analysis. For more information about the Stanford OTL, we refer interested readers to https://otl.stanford.edu/. For examples of Stanford inventions, we refer interested readers to http://techfinder.stanford.edu/
Categorization for inventions
The original Stanford invention dataset contains the category tags only for a subset of 1,700 inventions, annotated by a third-party marketing platform. Therefore, we trained a machine-learning model to propagate category annotations for all inventions. The input of the categorization model is the keywords and the title of each invention. For each category tag, we trained a binary classifier by fine-tuning a BERT deep neural language model to predict whether an invention belongs to this category or not. Using a held-out test set, we found that our categorization models achieve high classification performance: for most of the category tags, the categorization model has an AUC larger than 0.8. The AUC score is larger than 0.7 for all category tags. The categorization model is implemented using PyTorch 1.4.0. Since the categorization model we trained was highly accurate (AUC > 0.8), we used the full samples (4,512 inventions) along with the predicted category tags for our analyses.
Statistical analysis and multiple hypothesis testing
Data processing, statistical testing, and visualization were performed using Python v.3.7. We conducted a series of statistical tests using a Python-package statsmodel (https://www.statsmodels.org/stable/index.html). We supply p values as a tool for interpretation; we maintain the convention of 0.05 as the threshold for statistical significance. We performed the Benjamini-Hochberg procedure for multiple hypothesis testing with a false discovery rate of alpha = 0.05 using the statsmodel package. Our plots were generated using the matplotlib Python package.
Linguistic analysis: Predicting net income from both the author and marketing abstracts
We hypothesized that there are linguistic differences in how the inventors and OTLs describe the inventions. We performed linguistic analysis on inventors' abstracts (the abstracts of the invention disclosures written by university scientists), marketing abstracts (the abstracts rewritten by the Stanford OTL’s marketing team for the audience of business and legal professionals), and the final invention titles edited by the Stanford OTL’s marketing team. We split the author and marketing abstracts under consideration (inventions between 1980 and 2020) in an 80%/20% ratio as the train/test splits. We used the sklearn python package and trained a TF-IDF featurizer on the training data and then featurized both training and test data. Finally, we trained a logistic regression model based on the features. The AUC-ROC curve was evaluated on the test set. Furthermore, we also experimented with the BERT deep neural language model, implemented using PyTorch 1.4.0.
Linguistic analysis: Adjectives
Part of speech tagging provides the functionality of marking a word in the text to a particular part of speech (e.g., adjectives, nouns, pronouns, verbs) based on both its context and definition. We used the Python Package Spacy (https://spacy.io/), an industrial-strength natural language processing toolkit, to perform part of speech tagging and identify adjectives.
Demographic analysis: Stanford faculty gender data
We used the official number of the percentage of female faculty at Stanford over the years from the Faculty Demographics reports authored by the Stanford Office of Faculty Development, Diversity, and Engagement, which is publicly available at https://facultydevelopment.stanford.edu/data-reports/faculty-demographicsWe follow the method used by previous research66, 67 for gender identification from names. This method has recently been validated on a dataset of scientist names extracted from the WoS database. The gender of each reviewer and reviewing editor is inferred from their names using a Python-package gender-guesser (https://pypi.python.org/pypi/gender-guesser/). Previous research shows that the gender-guesser package achieves the lowest misclassification rate and minimizes bias. The validation performed by Santamaría and Mihaljević (2019) limited misclassification to 1.5% for European names, 3.6% for African names, and 6.4% for Asian names.