"Mining Massive Fine-Grained Behavior Data to Improve Predictive Analytics"

David Martens, Foster Provost, Jessica Clark, and Enric Junque de Fortuny

Abstract

Organizations increasingly have access to massive, fine-grained data on consumer behavior. Despite the hype over big data, and the success of predictive analytics, only a few organizations have incorporated such fine-grained data in a nonaggregated manner into their predictive analytics. This paper examines the use of massive, fine-grained data on consumer behavior—specifically payments to a very large set of particular merchants—to improve predictive models for targeted marketing. The paper details how using this different sort of data can substantially improve predictive performance, even in an application for which predictive analytics has been applied for years. One of the most striking results has important implications for managers considering the value of big data. Using a real-life dataset of 21 million transactions by 1.2 million customers, as well as 289 other variables describing these customers, the results show that there is no appreciable improvement from moving to big data when using traditional structured data. However, in contrast, when using fine-grained behavior data, there continues to be substantial value to increasing the data size across the entire range of the analyses. This suggests that larger firms may have substantially more valuable data assets than smaller firms when using their transaction data for targeted marketing.


Working paper available here

"Matrix-Factorization-Based Dimensionality Reduction in the Predictive Modeling Process: A Design Science Perspective"

Jessica Clark and Foster Provost

Abstract

Dimensionality Reduction (DR) is frequently employed in the predictive modeling process with the goal of improving the generalization performance of models. This paper takes a design science perspective on DR. We treat it as an important business analytics artifact and investigate its utility in the context of binary classification, with the goal of understanding its proper use and thus improving predictive modeling research and practice. 

Despite DR's popularity, we show that many published studies fail to undertake the necessary comparison to establish that it actually improves performance. We then conduct an experimental comparison between binary classification with and without matrix-factorization-based DR as a preprocessing step on the features. In particular, we investigate DR in the context of supervised complexity control. These experiments utilize three classifiers and three matrix-factorization-based DR techniques, and measure performance on a total of 26 classification tasks.

We find that DR is generally not beneficial for binary classification. Specifically, the more difficult the problem, the more DR is able to improve performance (but it diminishes easier problems' performance). However, this relationship depends on complexity control: DR's bene fit is actually eliminated completely when state-of-the-art methods are used for complexity control. 

The wide variety of experimental conditions allows us to dig more deeply into when and why the different forms of complexity control are useful. We find that L2-regularized logistic regression models trained on the original feature set have the best performance in general. The relative bene t provided by DR is increased when using a classifier that incorporates feature selection; unfortunately, the performance of these models, even with DR, is lower in general. We compare three matrix-factorization-based DR algorithms and find that none does better than using the full feature set, but of the three, SVD has the best performance. 

The results in this paper should be broadly useful for researchers and industry practitioners who work in applied data science. In particular, they emphasize the design science principle that adding design elements to the predictive modeling process should be done with attention to whether they add value.


Working paper available here

"Who Gets Started on Kickstarter? Demographic Variations in Crowdfunding Success"

Lauren Rhue and Jessica Clark

Abstract

Crowdfunding platforms like Kickstarter are expected to "democratize" funding by increasing the availability of capital to traditionally underrepresented groups, but there is conflicting evidence about racial disparities in success rates. This paper contributes to the information systems literature on crowdfunding by examining the racial dynamics in the Kickstarter platform. In particular, we study three (sometimes conflicting) cues that allow potential backers to infer race: fundraiser photo, project photo, and textual content of project description. We create a novel data set comprised of project characteristics; the race of project and fundraiser photo subjects, determined using facial recognition software; and the full text of project descriptions. 

Our analysis results in three main findings. First, there are substantial differences in the textual content of project descriptions across racial groups. It is possible to predict with high accuracy the race of those associated with a Kickstarter project based only on the words in the description. Second, projects with Black fundraisers or subjects in project photos face significantly lower success rates, even when controlling for observable project characteristics and the textual content of project descriptions. Third, we address cases when the racial cues are not aligned. Race in the fundraiser photo has a greater effect on success probability than does race in the project photo; visually-identifiable race in general has a greater effect on success probability than does textually-identifiable race.

The results expand information systems theory on crowdfunding, identity, and discrimination, utilize novel "big data" techniques, and yield empirical results that demonstrate bias in an important online platform. This work has important practical implications both for online platform designers and for users of said platforms.


"Who's Watching TV?"

Jessica Clark, Jean-Francois Paiement, and Foster Provost

Abstract

Understanding the demographics of TV shows' audiences is of vital concern to advertisers and other stakeholders.  Such knowledge is traditionally learned using data sources such as Nielsen which measure individuals' viewership using small, opt-in panels and report aggregate numbers.  Massive viewership data available at the individual Set-Top Box (STB) level has led to new estimation methods, but there is a crucial weakness in how viewers are measured: it is impossible to tell with certainty which person is the one watching TV in multi-person households.  This work introduces and formulates the problem of estimating which person is watching, which to our knowledge has not been addressed in the existing literature.  We develop a novel framework for estimating the likelihood that each household member in a multi-person household is watching.  This method leverages characteristics of both multi-instance learning and domain adaptation by adapting probabilities learned in single-person STBs to the multi-person STB setting.  A core difficulty of the problem is that there are no ground truth labels telling who is actually watching; therefore, we derive a set of tasks at which models must succeed in order to demonstrate that they have succeeded at the core problem of interest.  Two current state-of-the-art heuristic methods fail at at least two of the necessary tasks, but the novel method we develop succeeds at all of them.  Thus, the solution has implications for researchers interested in understanding television viewing behavior by individuals and groups, as well as broad applications within the television advertising industry and to any situation where multiple people share the same device or account but individual inferences are desired.  A major TV provider is planning on deploying this method for use in their TV ad-targeting system.  No personally identifiable information (PII) was gathered or used in conducting this study.  To the extent any data was analyzed, it was anonymous and/or aggregated data, consistent with the carrier's privacy policy.

Working paper available here