Infer's predictive models are made up of a number of factors that are likely to influence future behavior or results. Infer looks at two main categories to define our criteria on whether or not to include such factors in your model build.
=> Is the data leaky? Leaky Data refers to fields in the training data that, when included, show a falsely predictive result in the model. It is important to identify these signals up-front, especially when evaluating multiple vendors, because they can make the model look great in the training set, but will nto end up performing well in the live set. We look for a few things when identifying leaky signals in your model:
- Is the field editable by sales reps: An example of a field like this is purchase order information. This field is usually filled out by a rep. immediately before the opportunity is closed and after payment is received. Therefore, even though a positive value in this field is highly predictive of a closed-won opportunity, it should not be included in your model.
- Is the data in your field specifically time sensitive: An example of this is created date of the lead. Since we are looking at all of your historical data in the training set, it would make sense that leads with older create dates have a better chance of converting because they have had more time to be nurtured by sales or marketing. This creates a predictive score that's weighted heavier. However, once we start scoring live leads, the net new leads that come into your system will automatically be weighted lower based on their create date alone which isn't necessarily accurate. This is a leaky signal that shouldn't be included in your model.
=> Is there enough coverage in this field across all of your leads? The coverage of a specific field refers to the amount of leads with values in that field across your entire database. Even if it's a signal that is powerful in predicting whether your lead is a good or bad fit, if it only applies to a limited amount of leads in the system then it won't be useful for you. We ask ourselves the following:
- Is the integrity of this data maintained in every field: If you have editable fields that are not filled in all the time, these are not likely to be great signals to include, even if they would have strong predictive power (i.e. lead source).
- Are there fields that only populate on a small amount of leads: An example of this would be a free trial field. The likelihood of a lead converting or closing from this action is probably high, however, if it only affects a handful of leads out of your entire database, then there may not be enough coverage to prevent the inverse (leads who didn't request a free trial) from negatively affecting leads that are still good fits.