Allen Riddell Profile Picture

Allen Riddell

  • riddella@indiana.edu
  • Luddy Hall 2124
  • (812) 856-4730
  • Home Website
  • Assistant Professor
    School of Informatics and Computing

Field of study

  • Comparative Media Studies, Sociology of Literature, Data Science, Social Informatics, Digital Preservation, Machine Learning

Education

  • Ph.D. in Program in Literature at Duke University, 2013
  • M.S. in Statistics at Duke University, 2013
  • BA, Comparative Literature with Honors Stanford University, 2004

Research interests

  • Allen Riddell is an Assistant Professor in the School of Informatics, Computing, and Engineering. His research explores applications of modern statistical methods in the humanities and allied social sciences. His research interests include sociology of literature, publishing history, comparative media studies, library digitization, and text mining. Prior to coming to Indiana University, Riddell was a Neukom Fellow at the Neukom Institute for Computational Sciences and the Leslie Center for the Humanities at Dartmouth College.

Professional Experience

  • Assistant Professor of Information Science, School of Informatics, Computing, and Engineering, Indiana University, 2016 -
  • William H. Neukom Class of 1964 Fellow, Neukom Institute for Computational Science, Dartmouth College, 2013-2016
  • Postdoctoral Fellow, Leslie Center for the Humanities, Dartmouth College, 2013-2016
  • Visiting Fellow, eHumanities Group, Royal Netherlands Academy of Arts and Sciences, 2015

Awards

  • Katherine Goodman Stern Fellowship, Duke University (full tuition), 2012-2013
  • Summer Research Fellowship, Duke University, 2011
  • Alliance of Digital Humanities Organizations (ADHO) Bursary Award, 2011
  • Foreign Language and Area Studies Summer Fellowship (Chinese), 2008
  • James B. Duke Fellowship, Duke University, 2007-2011
  • University Scholars Fellowship, Duke University (full tuition), 2007-2008
  • Information Science + Information Studies Project Grant, Duke University, 2007

Representative publications

Stan: A probabilistic programming language (2017)
Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt ...
Journal of statistical software, 76 (1),

Stan is a probabilistic programming language for specifying statistical models. A Stan program imperatively defines a log probability function over parameters conditioned on specified data and constants. As of version 2.14. 0, Stan provides full Bayesian inference for continuous-variable models through Markov chain Monte Carlo methods such as the No-U-Turn sampler, an adaptive form of Hamiltonian Monte Carlo sampling. Penalized maximum likelihood estimates are calculated using optimization methods such as the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm. Stan is also a platform for computing log densities and their gradients and Hessians, which can be used in alternative algorithms such as variational Bayes, expectation propagation, and marginal inference using approximate integration. To this end, Stan is set up so that the densities, gradients, and Hessians, along with intermediate quantities of the algorithm such as acceptance probabilities, are easily accessible. Stan can also be called from the command line using the cmdstan package, through R using the rstan package, and through Python using the pystan package. All three interfaces support sampling and optimization-based inference with diagnostics and posterior analysis. rstan and pystan also provide access to log probabilities, gradients, Hessians, parameter transforms, and specialized plotting.

How to read 22,198 journal articles: Studying the history of German studies with topic models (2014)
Allen Beye Riddell
Distant Readings: Topologies of German culture in the long nineteenth century, 91-114

IN THE PAST DECADE, research libraries have digitized their holdings, mak-ing a vast collection of scanned books, newspapers, and other texts conveniently accessible. While these collections present obvious opportunities for historical research, the task of exploring the contents of thousands of texts presents a challenge. This chapter introduces a family of methods, often called topic models, that can be used to explore very large collections of texts. Researchers using these methods may be found not only in computer science, statistics, and computational linguistics but also increasingly in the human and social sciences in fields such as women’s history, political science, history of science, and classical studies. 1 This introduction uses a topic model to explore a particular corpus, a collection of 22,198 journal articles and book reviews from four US-based

The Supreme Court and the judicial genre (2017)
Michael A Livermore, Allen B Riddell and Daniel N Rockmore
Ariz. L. Rev., 59 837

The US Supreme Court is a singular institution within the American judiciary. It is uniquely powerful, sitting atop a judicial hierarchy in which lower courts are legally bound by its pronouncements. Its unique institutional features include a small docket, the ability to select the cases that it will decide, and a tradition of sitting as a whole, rather than in panels. The Court also plays a unique role in American political and social life: in the last several decades alone, its decisions have spurred social movements,'reshaped core cultural institutions, 2 and settled presidential elections. 3Although as an institution the Supreme Court is distinctive, it remains recognizably a court. The Supreme Court shares certain rituals with other US judicial institutions, such as the black robe and gavel. It also shares many procedures with other courts, including adversarial hearings and restrictions on ex parte contacts. Perhaps most important …

Zero-shot style transfer in text using recurrent neural networks (2017)
Keith Carlson, Allen Riddell and Daniel Rockmore
arXiv preprint arXiv:1711.04731,

Zero-shot translation is the task of translating between a language pair where no aligned data for the pair is provided during training. In this work we employ a model that creates paraphrases which are written in the style of another existing text. Since we provide the model with no paired examples from the source style to the target style during training, we call this task zero-shot style transfer. Herein, we identify a high-quality source of aligned, stylistically distinct text in Bible versions and use this data to train an encoder/decoder recurrent neural model. We also train a statistical machine translation system, Moses, for comparison. We find that the neural network outperforms Moses on the established BLEU and PINC metrics for evaluating paraphrase quality. This technique can be widely applied due to the broad definition of style which is used. For example, tasks like text simplification can easily be viewed as style transfer. The corpus itself is highly parallel with 33 distinct Bible Versions used, and human-aligned due to the presence of chapter and verse numbers within the text. This makes the data a rich source of study for other natural language tasks.Subjects: Computation and Language (cs. CL)Cite as: arXiv: 1711.04731 [cs. CL](or arXiv: 1711.04731 v1 [cs. CL] for this version)Submission historyFrom: Keith Carlson [view email][v1] Mon, 13 Nov 2017 17: 59: 26 UTC (199 KB)

Bending the law: geometric tools for quantifying influence in the multinetwork of legal opinions (2018)
Greg Leibon, Michael Livermore, Reed Harder, Allen Riddell and Dan Rockmore
Artificial Intelligence and Law, 26 (2), 145-167

Legal reasoning requires identification through search of authoritative legal texts (such as statutes, constitutions, or prior judicial opinions) that apply to a given legal question. In this paper, using a network representation of US Supreme Court opinions that integrates citation connectivity and topical similarity, we model the activity of law search as an organizing principle in the evolution of the corpus of legal texts. The network model and (parametrized) probabilistic search behavior generates a Pagerank-style ranking of the texts that in turn gives rise to a natural geometry of the opinion corpus. This enables us to then measure the ways in which new judicial opinions affect the topography of the network and its future evolution. While we deploy it here on the US Supreme Court opinion corpus, there are obvious extensions to large evolving bodies of legal text (or text corpora in general). The model is a proxy for …

Agenda formation and the US supreme court: A topic model approach (2016)
Michael A Livermore, A Riddell and Daniel Rockmore
Arizona Law Review, 1 (2),

This paper exploits a relatively new approach to quantitative text analysis—topic modeling—to examine the subject matter of Supreme Court decisions, and in particular to analyze how the semantic content produced by the Court differs from the published decisions of the US Appellate Courts. To conduct this analysis, we fit atopic model to the joint corpus of decisions (Supreme Court plus Appellate Court). The topic model enables a quantitative measurement of differences in semantic content between three corpora: the Supreme Court decisions, the Appellate Court decisions, and Appellate Court cases selected for review. We develop new methods to estimate these differences over time. We reach two findings. First, the Supreme Court has become substantially more semantically idiosyncratic in recent decades, as measured by the use of the topic distribution within a decision as a predictor of the authoring court …

A simple topic model (mixture of unigrams) (2012)
Allen B Riddell
July. 22

NB: This is an extended version of the appendix of my paper exploring trends in German Studies in the US between 1928 and 2006. In that paper I used a topic model (Latent Dirichlet Allocation); this tutorial is intended to help readers understand how LDA works.Topic models typically start with two banal assumptions. The first is that in a large collection of texts there exist a number of distinct groups (or sources) of texts. In the case of academic journal articles, these groups might be associated with different journals, authors, research subfields, or publication periods (eg the 1950s and 1980s). The second assumption is that texts from different sources tend to use different vocabulary. If we are presented with an article selected from one of two different academic journals, one dealing with literature and another with archeology, and we are told only that the word “plot” appears frequently in the article, we would be wise …

Public domain rank: identifying notable individuals with the wisdom of the crowd (2015)
Allen B Riddell
ACM. 10

Identifying literary, scientific, and technical works of enduring interest is challenging. Few are able to name significant works across more than a handful of domains or languages. This paper introduces an automatic method for identifying authors of notable works throughout history. Notability is defined using the record of which works volunteers have made available in public domain digital editions. A significant benefit of this bottom-up approach is that it also provides a novel and reproducible index of notability for all individuals with Wikipedia pages. This method promises to supplement the work of cultural organizations and institutions seeking to publicize the availability of notable works and prioritize works for preservation and digitization.

Bending the Law (2016)
Greg Leibon, Michael A Livermore, Reed Harder, Allen Riddell and Daniel Rockmore
Available at SSRN 2740136,

Legal reasoning requires identification, through search, of authoritative legal texts (such as statutes, constitutions, or prior judicial decisions) that apply to a given legal question. In this paper we model the concept of the law search as an organizing principle in the evolution of the corpus of legal texts, apply that model to US Supreme Court opinions. We examine the underlying navigable geometric and topological structure of the Supreme Court opinion corpus (the" opinion landscape") and quantify and study its dynamic evolution. We realize the legal document corpus as a geometric network in which nodes are legal texts connected in a weighted and interleaved fashion according to both semantic similarity and citation connection. This network representation derives from a stylized generative process that models human-executed search via a probabilistic agent that navigates between cases according to these legally relevant features. The network model and (parametrized) probabilistic search behavior give rise to a PageRank-style ranking of the texts--already implemented in a pilot version on a publicly accessible website--that can be compared to search results produced by human researchers. The search model also gives rise to a natural geometry through which we can measure change in the network. This enables us to then measure the ways in which new judicial decisions affect the topography of the network and its future evolution. While we deploy it here on the US Supreme Court opinion corpus, there are obvious extensions to larger bodies of evolving bodies of legal text (or text corpora in general). The model is a proxy for the way in …

Demography of Literary Form: Probabilistic Models for Literary History (2013)
A Riddell
Duke University Ph. D. thesis.

Digitization of library collections has made millions of books, newspapers, and academic journal articles accessible. These resources present an opportunity for historians interested in identifying patterns in cultural production that emerge over the space of decades or even centuries. For example, considerable interest has been expressed in studying the emergence, decline, and transmission across national and linguistic boundaries of literary form in the tens of thousands of novels published in Europe in the eighteenth and nineteenth centuries. Navigating such a large collection of texts, however, requires the use of quantitative methods rarely used in literary studies. Single, direct reading of even a thousand texts exceeds the time and resources available to most historians.This dissertation demonstrates the application of probabilistic model of texts in the study of literary history. The major finding of the dissertation is that regularities previously identified by literary historians can be captured by probabilistic models. Following the first chapter,“How to Read 22,198 Journal Articles: Studying the History of German Studies Using Topic Models,” which introduces representations of texts used in the dissertation, chapter 3,“Inferring Novelistic Genre in the English Novel, 1800-1836,” and chapter 4,“Networks of Literary Production,” illustrate the contribution probabilistic models of novelistic production are positioned to make to long-standing questions in literary history. Both chapters are concerned with the detection and description of empirical regularities in surviving nineteenth-century

Evaluating prose style transfer with the Bible (2018)
Keith Carlson, Allen Riddell and Daniel Rockmore
Royal Society open science, 5 (10), 171920

In the prose style transfer task a system, provided with text input and a target prose style, produces output which preserves the meaning of the input text but alters the style. These systems require parallel data for evaluation of results and usually make use of parallel data for training. Currently, there are few publicly available corpora for this task. In this work, we identify a high-quality source of aligned, stylistically distinct text in different versions of the Bible. We provide a standardized split, into training, development and testing data, of the public domain versions in our corpus. This corpus is highly parallel since many Bible versions are included. Sentences are aligned due to the presence of chapter and verse numbers within all versions of the text. In addition to the corpus, we present the results, as measured by the BLEU and PINC metrics, of several models trained on our data which can serve as baselines for future …

Reassembling the English novel, 1789-1919 (2018)
Allen Riddell and Michael Betancourt
arXiv preprint arXiv:1808.00382,

Sociologically-inclined literary history foundered in the 20th century due to a lack of inclusive bibliographies and biographical databases. Without a detailed accounting of literary production, numerous questions proved impossible to answer. The following are representative: How many writers made careers as novelists, Are there unacknowledged precursors or forgotten rivals to canonical authors, To what extent is a writer's critical or commercial success predictable from their social origins? In the last decade library digitization and the development of machine-readable datasets have improved the prospects for data-intensive literary history. This paper offers two analyses of bibliographic data concerning novels published in the British Isles after 1789. First, we estimate yearly rates of new novel publication between 1789 and 1919. Second, using titles of novels appearing between 1800 and 1829, we resolve a dispute concerning occupational gender segregation in novel subgenres. We show that the remarkable growth in the number of men novelists after 1815 was not concentrated in particular subgenres.Subjects: Digital Libraries (cs. DL); Social and Information Networks (cs. SI); Applications (stat. AP)Cite as: arXiv: 1808.00382 [cs. DL](or arXiv: 1808.00382 v1 [cs. DL] for this version)Submission historyFrom: Allen Riddell [view email][v1] Wed, 1 Aug 2018 15: 41: 51 GMT (2712kb, D)

Readers and their roles: Evidence from readers of contemporary fiction in the Netherlands (2018)
Allen Riddell and Karina van Dalen-Oskam
PloS one, 13 (7), e0201157

Reading serves many ends. Some readers report that works of fiction provide an imaginative escape from the rigors of life, others report reading in order to be intellectually challenged. While various characterizations of readers’ engagement with prose fiction have been proposed, few have been checked using representative samples of readers. Our research reports on reader self-descriptions observed in a representative sample of 501 adults in the Netherlands. Reader self-descriptions exhibit regularities, with certain self-descriptions predicting others. Contrary to existing theories which posit two types of readers characterized by non-overlapping concerns (identifying readers and distanced readers), we find that while some readers attend to plot structure or read in order to be intellectually challenged, reader self-descriptions overlap more than received theories predict. We hypothesize that some readers have cultivated more reading techniques than others, with educated or experienced readers tending to report deriving additional experiences from reading.

Studying Literary History with Latent Feature Models (2013)
Allen Beye Riddell

Novelistic genres—such as gothic novels, epistolary novels, and Bildungsromane—were an abiding feature of literary production in the nineteenth century. Their appearance, disappearance, and transmission across national and linguistic boundaries continues to be an object of interest for scholars in literary history and sociology of culture. This thesis considers two non-parametric latent feature models of a corpus of British literary fiction and compares the models’ representations with the judgments of literary historians. I find that the models agree with expert classifications of novelistic genre better than chance. This thesis contributes to efforts to validate latent feature models against human judgments and offers further confirmation that probabilistic models of text collections can support historical scholarship. iv

How to read 16,700 journal articles: studying German Studies with topic models (2012)
Allen B Riddell
Sat,

For those interested in learning more, I want to point to the following resources. These vary in their assumptions of background knowledge; I’ve tried to put the more introductory material first within each section.

Edit your profile