As already mentioned in the last post, accurately measuring all user journeys is a challenge. The tracking solution needs to integrate all channel-clicks (and preferably -views), the ones from cookies that have converted at some point and the ones from cookies that haven’t. Such a system needs to include setting cookies and referrer analysis and combines purchase-data with channel-click data on a cookie-id, i.e user-id basis. It’s decisive for the quality of the attribution modeling to spend sufficient time validating the tracking implementation. Is every campaign and channel interaction being tracked? Are there interactions that are double counted? Comparing the aggregated numbers with stats from Google Analytics or other web analytics tools is useful. But from painful experience, I can only recommend to manually test as much as possible. This would include creating fake user journeys, for example by clicking on one’s own google ads and test-purchasing a product. Then one should validate whether these actions have been tracked correctly as per timestamp.

For developing an attribution model the conversion event is our binary target variable (customer bought or signed up). The variables about the channel interaction (e.g. the number of clicks on “paid search”) are the covariates that predict the value of the target variable.

This way the modeling problem becomes a regular classification problem. These types of problems are very common to statistical modeling and machine learning, which allows us to apply proven processes and methods from these fields. These include using training- and test-sets, cross-validation and a set of metrics for measuring model quality. Common metrics for evaluating the quality of predictive models with a binary target variable are AUC (area under the curve) or a pseudo-R-squared.

Since we aim at interpreting the model’s coefficient estimates as the effect of each channel interaction, we need to make sure that we have relatively robust estimators. This requires variable selection and eliminating multicollinearity, e.g. through measuring variance inflation factors.

Some seem to achieve good results with bagged logistic regression with regards to having robust estimators (see this paper). Generally speaking, different modeling technique can be used such as SVMs, random forests or artificial neural nets.

Once we have sufficiently good and robust model in place, we can go ahead and “score” through every customer journey:

This customer journey started off with a display ad click, followed by a click on an affiliate-site. The last-click right before the conversion was a paid search click. For every stage, we apply the model and calculate the probability that the customer converts. We assign the incremental change of conversion probability to the latest channel. So in the “Research & Compare” phase of this particular journey the affiliate channel generated an increase in conversion probability of 0.6%, so affiliate will be assigned 0.6% of this conversion. This approach takes into account the previous channel interactions. It dynamically calculates channel effects individually per user journey instead of using the same channel weights overall user journeys.

In this particular case, the overall probability for a conversion-event “caused” by the user journey seems rather low (2%). If it’s necessary, the whole conversion or revenue value could be distributed proportionally to the incremental probability increases, so affiliate would receive 33.33% due to this customer journey.

As a result of this process, we have the number of conversions and sums of revenues attributed to each channel. These figures are then being related to calculate CPOs and ROMI which guide re-allocation of channel budgets. In the next post, I’ll discuss how these steps could be integrated into one marketing attribution system and reporting tool. I will also go into testing optimized budget allocations and taking into account device switches and cookie deletions.