"Mining Massive Fine-Grained Behavior Data to Improve Predictive Analytics"
David Martens, Foster Provost, Jessica Clark, and Enric Junque de Fortuny
Organizations increasingly have access to massive, fine-grained data on consumer behavior. Despite the hype over big data, and the success of predictive analytics, only a few organizations have incorporated such fine-grained data in a nonaggregated manner into their predictive analytics. This paper examines the use of massive, fine-grained data on consumer behavior—specifically payments to a very large set of particular merchants—to improve predictive models for targeted marketing. The paper details how using this different sort of data can substantially improve predictive performance, even in an application for which predictive analytics has been applied for years. One of the most striking results has important implications for managers considering the value of big data. Using a real-life dataset of 21 million transactions by 1.2 million customers, as well as 289 other variables describing these customers, the results show that there is no appreciable improvement from moving to big data when using traditional structured data. However, in contrast, when using fine-grained behavior data, there continues to be substantial value to increasing the data size across the entire range of the analyses. This suggests that larger firms may have substantially more valuable data assets than smaller firms when using their transaction data for targeted marketing.
"Matrix-Factorization-Based Dimensionality Reduction in the Predictive Modeling Process: A Design Science Perspective"
Jessica Clark and Foster Provost
Dimensionality Reduction (DR) is frequently employed in the predictive modeling process with the goal of improving the generalization performance of models. This paper takes a design science perspective on DR. We treat it as an important business analytics artifact and investigate its utility in the context of binary classification, with the goal of understanding its proper use and thus improving predictive modeling research and practice.
Despite DR's popularity, we show that many published studies fail to undertake the necessary comparison to establish that it actually improves performance. We then conduct an experimental comparison between binary classification with and without matrix-factorization-based DR as a preprocessing step on the features. In particular, we investigate DR in the context of supervised complexity control. These experiments utilize three classifiers and three matrix-factorization-based DR techniques, and measure performance on a total of 26 classification tasks.
We find that DR is generally not beneficial for binary classification. Specifically, the more difficult the problem, the more DR is able to improve performance (but it diminishes easier problems' performance). However, this relationship depends on complexity control: DR's benefit is actually eliminated completely when state-of-the-art methods are used for complexity control.
The wide variety of experimental conditions allows us to dig more deeply into when and why the different forms of complexity control are useful. We find that L2-regularized logistic regression models trained on the original feature set have the best performance in general. The relative benet provided by DR is increased when using a classifier that incorporates feature selection; unfortunately, the performance of these models, even with DR, is lower in general. We compare three matrix-factorization-based DR algorithms and find that none does better than using the full feature set, but of the three, SVD has the best performance.
The results in this paper should be broadly useful for researchers and industry practitioners who work in applied data science. In particular, they emphasize the design science principle that adding design elements to the predictive modeling process should be done with attention to whether they add value.
"Who Gets Started on Kickstarter? Demographic Variations in Crowdfunding Success"
Lauren Rhue and Jessica Clark
Crowdfunding platforms like Kickstarter are expected to "democratize" funding by increasing the availability of capital to traditionally underrepresented groups, but there is conflicting evidence about racial disparities in success rates. This paper contributes to the information systems literature on crowdfunding by examining the racial dynamics in the Kickstarter platform. In particular, we study three (sometimes conflicting) cues that allow potential backers to infer race: fundraiser photo, project photo, and textual content of project description. We create a novel data set comprised of project characteristics; the race of project and fundraiser photo subjects, determined using facial recognition software; and the full text of project descriptions.
Our analysis results in three main findings. First, there are substantial differences in the textual content of project descriptions across racial groups. It is possible to predict with high accuracy the race of those associated with a Kickstarter project based only on the words in the description. Second, projects with Black fundraisers or subjects in project photos face significantly lower success rates, even when controlling for observable project characteristics and the textual content of project descriptions. Third, we address cases when the racial cues are not aligned. Race in the fundraiser photo has a greater effect on success probability than does race in the project photo; visually-identifiable race in general has a greater effect on success probability than does textually-identifiable race.
The results expand information systems theory on crowdfunding, identity, and discrimination, utilize novel "big data" techniques, and yield empirical results that demonstrate bias in an important online platform. This work has important practical implications both for online platform designers and for users of said platforms.
"Who's Watching TV?"
Jessica Clark, Jean-Francois Paiement, and Foster Provost