New Approaches To Describing Research Investments

bridges vol. 37, May 2013 / Feature Articles

By Julia Lane, David Newman, and Rebecca F. Rosen

"How much should a nation spend on science? What kind of science? How much from private versus public sectors? Does demand for funding by potential science performers imply a shortage of funding or a surfeit of performers? ... A new "science of science policy" is emerging, and it may offer more compelling guidance for policy decisions and for more credible advocacy."

John H. Marburger III, "Wanted: Better Benchmarks." Science, May 20, 2005

Marburger's questions still resonate today. What research is being funded? Which agencies are funding what research? How is research funding in different areas changing over time? Answering these questions is fundamentally important for policy makers, science researchers, and the general public. Yet traditional, highly manual science policy methods make such answers impossible. We must do better, as science is changing rapidly and nations rely increasingly on research to solve economic and societal problems. In addition, tight national budgets require better management and evidence-based accountability.

One way we can do better is by using science (Lane, 2010). New scientific technologies, such as natural language processing and topic modeling, can be used to provide fresh perspectives on research investments and portfolios and their contributions to scholarship and innovation, both nationally and internationally. Many of these technologies are open source, so can be customized in flexible ways. They are also robust, so the results can be understood by a well-formed and mature scientific community.

In this report, we explain how one such approach – topic modeling – was used to describe scientific research portfolios in France and the United States. The approach emerged from work commissioned by NSF's Advisory Subcommittee of the Advisory Committees for the Social, Behavioral, and Economic (SBE) Sciences and the Computer and Information Science and Engineering (CISE) Directorates to identify techniques and tools that could characterize a specific set of proposal and award portfolios (National Science Foundation, 2011).

{access view=guest}Access to the full article is free, but requires you to register. Registration is simple and quick – all we need is your name and a valid e-mail address. Already registered? Login with your user details at the top right of our site. Thank you for your interest in bridges.{/access} {access view=!guest}

What is Topic Modeling?

Topic modeling is a computer algorithm that can be used to learn topics from a corpus of text documents. The methodology can automatically extract or learn a set of topics that describe a collection of text documents. The topic model is an unsupervised methodology: It learns topics directly from the text data and does not require dictionaries, thesauri, or other ontologies. The topic model evolved out of earlier techniques such as Latent Semantic Indexing. Underlying the topic model is the assumption that individual documents exhibit a small number of topics. The topic model learns from large-scale patterns of terms that tend to co-occur in documents (Newman, Asuncion, Smyth, & Welling, 2009).

Topic modeling (also referred to as Latent Dirichlet Allocation or LDA) is widely considered the state-of-the-art method for automatically extracting semantic content from collections of text documents. The topic model simultaneously learns a set of topics that describe a collection of documents, and a short list of topics associated with each document in the collection. The automatically learned topics are focused distributions of terms that convey some theme or subject area; for scientific literature most of the topics correspond well to research areas. For example, the table below lists sample topics learned from a collection of scientific proposal abstracts. The learned topics are listed on the left and the human-provided label on the right, to emphasize that the topics preceded the labels. These examples show how learned topics are able to capture the themes of research disciplines.

Learned topic Human label
policy public government policies organization local agencies maker private ... PUBLIC POLICY
black gravitational hole holes wave relativity waves gravity general ... RELATIVITY
mobile wireless network spectrum communication devices radio access ... WIRELESS NETWORKS
plant insect species ant interaction pollinator host pollination flower... BIOLOGY
user software end interface tool web interfaces interactive task need ... SOFTWARE
data dimensional algorithm matrix dimension analysis reduction sparse ... MATHEMATICS
chemistry organic reaction synthesis synthetic method compound chiral... ORGANIC CHEMISTRY
genes gene genetic expression function gene_expression regulatory specific ... GENE EXPRESSION
economic labor worker income effect inequality household job policy market ... ECONOMICS
material matter electron quantum physics transition properties states theoretical ... QUANTUM PHYSICS


By definition, topics learned by the topic model are a (reduced dimensionality) representation of a single document and, therefore, specifically characterize a single document. Table 2 illustrates this capability. This document on multimodal interfaces is neatly characterized by a short list of topics (the top-four topics are listed, with their percent allocations).


Mix of topics in one document. Document Title: Multimodal Interface for Retrieval of Perceptually and Semantically Similar Biomedical Images
(14%) information search web text retrieval document semantic user content tool

(12%) imaging image medical method tomography reconstruction inverse optical

(12%) visual environment object virtual information eye world movement human

(10%) user software end interface tool web interfaces interactive task need computer


Topic Modeling to describe Research Portfolios

Due to their additive nature, a set of documents can also be characterized by a mixture or histogram of topics. Using this approach, topics provide a way to understand a set of documents, for example, a collection of scientific abstracts or research descriptions within a research portfolio. By aggregating over several different time intervals, one could measure (topically) how sets of documents within a research portfolio change over time. In general, it is possible to aggregate over any available metadata.

Topic modeling is an alternative to possible preexisting schemes for classification or categorization, but can also be used along with those preexisting schemes. Topic modeling creates a unified topic basis for a wide variety of analyses. The topic representation provides an immediate structure for comparing, contrasting, and combining text documents. Topics are a convenient basis for both querying and reporting, as well as a useful basis for visualizations, both for computing relations between documents and for annotating and color coding visualizations. Since each document in a collection contains multiple topics, researchers can use cluster analyses to identify aggregate research areas within a portfolio and to visualize the growth and decline of these research communities over time.

The following list suggests the broad range of questions one could answer using the topic modeling approach – much like the questions raised in our opening quotation:

  1. How much does research agency X spend on nanotechnology and nanomaterials?
  2. How have funded research areas changed between 2010 and 2012?
  3. What documents are similar to this set of documents?
  4. What is the topical makeup of documents categorized as X?
  5. What documents are related to a given keyword query?
  6. What topics do document sets X and Y have in common?

We have used topic modeling to describe research portfolios within and between funding agencies, both domestic and international. The R&D Dashboard was developed to visualize National Institutes of Health (NIH) and National Science Foundation (NSF) grants by geographic location and by research topic; a video of its use can be seen at The NIH/NSF topics were based on a collection of titles and abstracts in all NSF and NIH grants from 2000 to 2009. This collection included over 100,000 NSF grants and over 600,000 NIH grants.


R&D Dashboard from Christina Jones on Vimeo.


Working with the French Institut National de Cancer (INCa), we described their portfolio using a topic model that was trained on all publicly-available NIH abstracts (Edmund M Talley et al., 2011). The results of applying that model to the 316 INCa awards can be seen here, and demonstrate how funding agencies can use topic models to describe their portfolios in an international context.

We also investigated how the NIH funds topic area 428 (chemotherapy, cancer, oncology), the topic with the highest coverage in the INCa portfolio. Figure 1 provides a visual overview of the NIH portfolio in topic 428, generated with the publicly available Web site. The first box identifies similar phrases in the NIH context (adjuvant chemotherapy, etc.), co-occurring topics, and similar topics. It also shows tags that occur commonly in proposals that have topic 428 as the main topic, such as NIH Concurrent Keywords (chemotherapy, cancer patient), NIH CRISP codes (neoplasm/cancer chemotherapy), and PubMed MESH Codes (antineoplastic combined chemotherapy).

    Fig1- New-approaches-to-describing-research-investments small  




INCa grants and NIH awards were compared by matching terms in titles and abstracts of funded projects. Four of the top 10 areas funded by INCa match with the top 40 NIH areas; details of the NIH-comparable awards can be accessed through the publicly available .

At the National Science Foundation, a topic model with 1,000 topics was generated from the corpus of research proposals submitted between 2000 and 2010; proposals through 2012 were then analyzed with the same topic model. We worked with the Engineering directorate to visualize how their research funding in different areas was changing over time, which enabled them to identify the organizations across NSF that were funding research in their top topics. A video of the results of that activity is available:


Portfolio Explorer COV Module from Christina Jones on Vimeo.



The topic modeling approach is intuitively appealing and interesting and reflects one of the many ways in which science is being used to describe research investments. Many other approaches are being developed: for example a major research university consortium – the Committee on Institutional Cooperation – has been developing new ways of answering the sets of questions identified above. * Importantly, more countries are building the types of data systems that will permit more scientific approaches for informing science policy.



Talley, E. M., D. Newman, D. Mimno, B. W. Herr II, H. M Wallach, G. A. P. C. Burns, A. G. M. Leenders, and A. McCallum. "Database of NIH grants using machine-learned categories and graphical clustering." Nature Methods (2011): 443–44.

Lane, J. "Let's make science metrics more scientific." Nature 464(7288) (2010): 488-89. Nature Publishing Group.

National Science Foundation. Report to the Advisory Committees of the Directorates of Computer and Information Science and Engineering and Social, Behavioral and Economic Sciences. National Science Foundation (2011).

Newman, D., A. Asuncion, P. Smyth, and M. Welling. "Distributed Algorithms for Topic Models." Journal of Machine Learning Research 10 (2009): 1801-28.://000270825200002


 *See, for example, Username=starmetrics Password: iC3mc26#