[ad_1]
A practical guide to effectively assessing data and making decisions to enrich and improve your models
I have served as VP of Data Science, AI and Research at two public companies for the past five years. In both roles, AI was central to the company’s core product. We partnered with data vendors to enrich our data with relevant features that improved model performance. After multiple failed experiences with data vendors, This post will help you save time and money When testing a new vendor.
caveat: This process shouldn’t begin until you have a very clear picture of the business metrics for your model and have invested a significant amount of time optimizing the model. Working with most data vendors for the first time is usually a lengthy process (up to weeks, often months) and can be very expensive (some data vendors I’ve worked with cost tens of thousands of dollars per year, others cost millions of dollars per year when operating at scale).
This is usually a big investment, Don’t start the process unless you can clearly articulate how the go/no go decision will be made. Read that sentence again because this is the #1 mistake I’ve seen. In my case, this involved always converting every decision input into dollars.
For example, a model performance metric might be the PRAUC of a classification model that predicts fraud. Assume that the addition of new data increases the PRAUC from 0.9 to 0.92. This is a huge improvement from a data science perspective. However, it costs 25 cents per call. To determine if this is worth it, you need to translate the incremental PRAUC into profit. This stage can take time and requires a good understanding of your business model. How exactly does an increase in PRAUC translate into increased revenue/profit for your company? This is not always easy for most data scientists.
This post won’t cover all aspects of selecting a data vendor (for example, we won’t discuss negotiating a contract), but it will highlight the key aspects you should expect from a data science leader.
If you’re a decision maker and your company operates at scale, there’s a good chance you’ll get cold emails from vendors on a regular basis. While random vendors may have value, it’s usually best to consult with industry experts and understand what data vendors commonly use. In that industryWhen dealing with data, network effects and economies of scale are huge, so the biggest and most well-known vendors usually bring more value. Don’t trust a vendor that offers a solution for every problem or industry. And remember, the most valuable data is usually the one that’s most hard-won and created, not the one that’s easily collected online.
A few points to cover when starting an initial conversation:
- Who are their customers? How many large customers are there in your industry?
- The costs (at least orders of magnitude) could be a factor in early deal abandonment.
- Time travel capabilities: Do you have the technical capabilities to go back in time and tell the story of how the data existed at a past snapshot? This is important when performing historical proofs of concept (more on this later).
- Technical constraints: Latency (pro tip: always look at p99 or other higher percentiles rather than averages), uptime SLAs, etc.
Assuming the vendor checks the boxes for the key points above, you’re ready to plan a proof-of-concept test. You need a benchmark model with clear evaluation metrics that can be translated into business metrics. Your model needs a training set and an out-of-hours test set (and possibly one or more validation sets). Typically you’ll send the relevant features from the training and test sets along with timestamps so the vendor can merge the data as it existed in the past (time travel). You can then retrain the model using the vendor’s features and evaluate the differences on the out-of-hours test sets.
Ideally, you should not share your target variables with the vendor. The vendor may request to receive the target variables in order to “tune/fine-tune” the model, train a custom model, perform feature selection, or perform other operations to tailor the features to their needs. If you do share your target variables, make sure you only share them for the training set. The test set is never.
If the above paragraph made you feel uneasy, that’s great. When working with vendors, they’re always eager to demonstrate the value of their data, and this is especially true for smaller vendors (where every deal can make a big difference).
One of the worst experiences I had working with a vendor was a few years ago. A new data vendor had just signed a Series A, made a lot of hype, and promised highly relevant data for one of our models. It was a new product that lacked relevant data, and we thought this would be a good way to get things off the ground. We went ahead and started a POC. During that time, the vendor’s model improved our AUC on the training set from 0.65 to 0.85. On the test set, the vendor’s model completely failed; it was ridiculously overfitting on the training set. After discussing this with the vendor, they asked for a target variable for the test set to analyze the situation. The vendor put a senior data scientist on the job and asked for a second iteration. We waited a few weeks for new data to be collected (to use as a new, unknown test set). Again, the vendor dramatically improved the AUC on the new train, but it failed again on the test set. Needless to say, we didn’t go ahead.
- Set a high ROI threshold.
Start by calculating the ROI. Estimate the incremental net benefit generated by your model compared to the costs. Most projects require a good positive return. Set a higher threshold than usual because there is a lot of room for issues that reduce your return (data drift, staged rollout, limited usage in all segments, etc.). At times, I have asked for a financial return of 5x the enrichment cost as a minimum threshold to move forward with a vendor, as a buffer against data drift, potential overfitting, and uncertainty in the ROI point estimate. - Partial reinforcement:
The ROI of the overall model may not be sufficient. However, some segments may show much higher lift than others. It may be best to split the model in two and power only those segments. For example, say you are running a classification model to identify fraudulent payments. New data tested may yield a high ROI in Europe but not in other regions. - Gradual Enhancement: If you have a classification model, you can consider splitting the decision into two phases.
- Phase 1 – Run the existing model
- Enhance only observations that are close to the decision threshold (or above the threshold depending on your use case). All observations far from the threshold are decided in Phase 1.
- Phase 2 – Run a second model to refine the decision
This approach is very useful, especially when dealing with imbalanced data, to reduce costs by enriching a small subset while capturing most of the lift. It is less useful if the second model creates a large change. For example, if an apparently very safe order is identified as fraudulent by the enriched data later, you will need to enrich most (if not all) of the data to capture that lift. Doing the enrichment incrementally can double your latency time, as you are running two similar models in sequence. So, carefully consider how to optimize the tradeoff between latency, cost, and performance lift.
Working effectively with a data vendor can be a long and tedious process, but it can significantly improve your model performance. We hope this guide helps you save time and money. Happy modeling!
[ad_2]
Source link