Information Systems Research, 2023

Who’s Watching TV?

Jessica Clark, Jean-Francois Paiement, Foster Provost

Abstract

This work addresses the problem of “user disambiguation”—estimating the likelihood of each member of a small group using a shared account or device. The main focus is on television set-top box (STB) viewership data in multiperson households, in which it is impossible to tell with certainty which household members watch what. The first main contribution is formulating user disambiguation as a predictive problem. The second contribution is a solution for estimating the likelihood that each individual in a multiperson household watches each TV segment. Kernel theories from the marketing, economics, and sociology literature inform the design of our method. This method learns priors for viewership in single-person households and then adapts them to the specifics of each multiperson household’s viewership history. Finally, we formalize two ad hoc heuristics that are currently used in industry (and research) for estimating audience composition of STB data and conduct a comparative analysis using simulated data, real large-scale viewership data, and a fully labeled panel-based data source. We find that our method has superior performance and practical value. The proposed solution has implications for advertisers, researchers who seek better understanding of TV viewership, and anyone using data generated by shared devices or accounts. A major TV provider has deployed this new method for use in its TV ad-targeting system. No personally identifiable information was gathered or used in conducting this study. To the extent any data were analyzed, it was anonymous and/or aggregated data, consistent with the carrier’s privacy policy.


Health Policy and Technology, 2023

Addressing algorithmic bias and the perpetuation of health inequities: An AI bias aware framework

Ritu Agarwal, Margret Bjarnadottir, Lauren Rhue, Michelle Dugas, Kenyon Crowley, Jessica Clark, Gordon Gao

Abstract

The emergence and increasing use of artificial intelligence and machine learning (AI/ML) in healthcare practice and delivery is being greeted with both optimism and caution. We focus on the nexus of AI/ML and racial disparities in healthcare: an issue that must be addressed if the promise of AI to improve patient care and health outcomes is to be realized in an equitable manner for all populations. We unpack the challenge of algorithmic bias that may perpetuate health disparities. Synthesizing research from multiple disciplines, we describe a four- step analytical process used to build and deploy AI/ML algorithms and solutions, highlighting both the sources of bias as well as methods for bias mitigation. Finally, we offer recommendations for moving the pursuit of fairness further.


Who Are You and What Are You Selling? Creator-Based and Product-Based Racial Cues in Crowdfunding

Lauren Rhue and Jessica Clark

Abstract

The display of personal information in crowdfunding campaigns is vital for facilitating trust and this information often communicates the racial identity of the fundraiser. We study the relationship between those racial cues and crowdfunding success. Using data of more than 100,000 projects gathered from Kickstarter.com, we categorize racial cues as creator-based versus product-based. For each category, we derive racial cues in two different mediums: photo vs. textual. We use propensity score matching to estimate the effects of racial identity across racial groups, categories, and mediums. We find that the category of racial cues is associated with crowdfunding success. Projects with either creator-based or product-based cues of African-American identity have lower success rates. In contrast, creator-based cues of Asian identity are associated with lower success whereas product-based cues are associated with increased success. Furthermore, when product-based cues and creator-based cues are misaligned, we find that the outcomes more closely follow those associated with product-based cues, suggesting that backers are more attuned to product attributes. Our results also suggest that racial anonymity is correlated with higher success rates as compared to African-American and Asian racial cues. Our study contributes to the understanding of racial identity in digital platforms across multiple contexts, mediums, and racial groups.


Unsupervised dimensionality reduction versus supervised regularization for classification from sparse data

Jessica Clark and Foster Provost

Abstract

Unsupervised matrix-factorization-based dimensionality reduction (DR) techniques are popularly used for feature engineering with the goal of improving the generalization performance of predictive models, especially with massive, sparse feature sets. Often DR is employed for the same purpose as supervised regularization and other forms of complexity control: exploiting a bias/variance tradeoff to mitigate overfitting. Contradicting this practice, there is consensus among existing expert guidelines that supervised regularization is a superior way to improve predictive performance. However, these guidelines are not always followed for this sort of data, and it is not unusual to find DR used with no comparison to modeling with the full feature set. Further, the existing literature does not take into account that DR and supervised regularization are often used in conjunction. We experimentally compare binary classification performance using DR features versus the original features under numerous conditions: using a total of 97 binary classification tasks, 6 classifiers, 3 DR techniques, and 4 evaluation metrics. Crucially, we also experiment using varied methodologies to tune and evaluate various key hyperparameters. We find a very clear, but nuanced result. Using state-of-the-art hyperparameter-selection methods, applying DR does not add value beyond supervised regularization, and can often diminish performance. However, if regularization is not done well (e.g., one just uses the default regularization parameter), DR does have relatively better performance--but these approaches result in lower performance overall. These latter results provide an explanation for why practitioners may be continuing to use DR without undertaking the necessary comparison to using the original features. However, this practice seems generally wrongheaded in light of the main results, if the goal is to maximize generalization performance.


Mining Massive Fine-Grained Behavior Data to Improve Predictive Analytics

David Martens, Foster Provost, Jessica Clark, and Enric Junque de Fortuny

Abstract

Organizations increasingly have access to massive, fine-grained data on consumer behavior. Despite the hype over big data, and the success of predictive analytics, only a few organizations have incorporated such fine-grained data in a nonaggregated manner into their predictive analytics. This paper examines the use of massive, fine-grained data on consumer behavior—specifically payments to a very large set of particular merchants—to improve predictive models for targeted marketing. The paper details how using this different sort of data can substantially improve predictive performance, even in an application for which predictive analytics has been applied for years. One of the most striking results has important implications for managers considering the value of big data. Using a real-life dataset of 21 million transactions by 1.2 million customers, as well as 289 other variables describing these customers, the results show that there is no appreciable improvement from moving to big data when using traditional structured data. However, in contrast, when using fine-grained behavior data, there continues to be substantial value to increasing the data size across the entire range of the analyses. This suggests that larger firms may have substantially more valuable data assets than smaller firms when using their transaction data for targeted marketing.