Regression Models for Trait Prediction

Alice Bezett
Alice BezettData ScientistMore about Alice

Alice Bezett is one of our resident data scientists here at Bannerconnect. She digs deep into our vast quantities of first, second and third party data for the insights that give our clients’ programmatic advertising the edge to stand out from the crowd. She comes to us from a background in finance and academia but has found her passion for statistics renewed in programmatic. Read more from Alice on how a career in programmatic has strengthened her love of statistics here and how she found the most important factors in a campaign here

Recently, we in the data department have been playing with a fun project – can we predict different traits of the user based on their browsing patterns and habits? This could be predicting if you’re male or female, or it could just be finding your general interests. This might be interesting for creating audiences for campaigns, although (as always) it’s more interesting to have further insights into how people browse the internet. Personally, I find the way the internet has become pervasive in our lives fascinating, as within my life I’ve gone from looking up formulas in textbooks in the library to googling them. I’d like to think that the way I browse media is unique – we’re all snowflakes after all. But what if I’m just acting in the exact same way as every other woman in the 30-40 year category? Finding these commonalities is the subject of our research, which can help us predict your traits from the content you browse.

We have a very large pool of anonymous data, in which we can track a single user over their internet journey. We actually know nothing about these users, but perhaps we can make broad, sweeping statements? For example, we assume that if you spend all your time on Formula 1 websites, that you really like cars (probably a male dominated hobby), and if you’re dominating food forums, perhaps you have a strong interesting in cooking – which could be perceived as being more woman centered. This thinking is, of course, deeply flawed and plays into all the stereotypes I’d prefer to avoid – especially as my cultural heritage makes an interest in rugby mandatory (am I a man now?). However, if our aim was to predict sports lovers or those with a strong interest in homewares rather than gender, then perhaps our method gives a reasonable start.

Now, the much more interesting part is, can we study a person’s general behaviour on non-related sites, such as news or weather sites, and predict their interest in either cars or cooking? It could be that there are subtle differences in the usage of general sites by people with different interests – such as looking at the lifestyle section of the news site versus looking at the sports section. Do car lovers check the weather more often than food lovers? Or at different times of day, or for lengths of browsing?

This is where making a model to predict behaviours becomes useful. A good model should include the relevant features necessary to make the prediction of whether you like cars or cooking and be able to “choose” for itself which features are the most important to our problem.

Our first step in creating this model is to investigate and decide which model we should use, out of all the different models available to us. We’ll focus on just two options: logistic or linear regression. The way I have asked the question: cars yes/no, cooking yes/no, actually implies that we should use a logistic regression: this is characterized by a yes/no variable as the outcome (as we covered in my previous blog). This is different from a linear regression, where the outcome is a continuous variable, such as the height of a child, or cost of a house, where the outcome can be any of a range of variables. In each case, we suppose that the outcome (cooking or not, or height of a child) can be directly related to the features in our model.

For me, this choice of which regression to use has always been rather set in stone: the building blocks of regression have several assumptions that must be met, such as the distribution of data and errors, and the types of data present. To violate those assumptions in my mind seriously challenges the validity of the resulting model. In fact, this is how we are usually taught statistics. However, it seems rules are no fun unless someone is breaking them, which led me to the blog of Paul von Hippel.

Paul is an associate professor at the University of Texas, with a background in data science and finance – and a strong history of using the methods he advocates. The main idea in his blog is that for certain situations, a linear model can work just fine instead of a logistic one. In general, this limit holds for a predicted variable that has a moderate chance of occurring: for example, if you expect a probability between 20% and 80% of an event, then a linear model could work just as well in place of the logistic model. This is clearly the case in the example we’re discussing here, where we’d expect the traits we’re looking at (cars and cooking), to be reasonably common.

An even deeper analysis of the errors that might occur is given by Hellevik, where he shows for a variety of examples that the error in using a linear model over a logistic one is very small for moderate probabilities and is particularly good reading on this topic.

The obvious question from this discussion then becomes: why would we even bother using the wrong model? If there is no particular advantage, don’t we just take the risk that our analysis is wrong? There are actually two reasons why we might want to take the step of using a linear model.

  1. The first is that linear regression is much easier to understand and to explain to others. This means that our model can be easily understood by a variety of people, and such questions as which variable is most important in our model can be seen with a glance.
  2. The second much larger advantage is that linear regression is much easier to solve computationally – saving us a lot of time and computational power. With our mountain of data, this can make a very large difference, and give quite an advantage to running these models at scale!

This topic is something we’re currently playing with and has so far made for some very interesting trials. Look out for an update when we find out if we can accurately predict interest in Formula1…..hopefully quickly. Just not as quick as those cars.

Related content