You want optimal customer segments or product recommendation lists to turn your customer data into maximum value. However, these problems are so hard that no algorithm can guarantee the optimal solution without checking every possible solution. Because you cannot check them all – it would take years – most likely you iterate a stochastic algorithm a few times and pick the best solution among them. A cloud of uncertainty hangs over you: Are the segments good enough? Are they off by far? How much value is at stake? There ought to be a way to find out.
We understand your frustration because we deal with these questions every day at Infobalen and in our academic research. We always wanted to resolve how many times we need to run our clustering algorithms but focused on algorithm development and applications instead. Finally, after repeatedly receiving questions about this problem from other researchers, we decided to give it a go.
In our recent preprint Exploring the solution landscape enables more reliable network community detection, we introduce a procedure to determine the minimum number of searches required for a good result given a user-specified resolution length that sets the accepted accuracy.
The procedure can operate on any distance measure between solutions with a quality score and therefore applies to a broad class of problems. While the specific problem at hand may suggest a particular type of distance measure, a simple and interpretable one is a good start. In the article, we recommend a weighted version of the Jaccard distance. With customers clustered in segments, the distance between two solutions is the weighted average fraction of customers that best-matching segments do not have in common. Accordingly, we can translate an accepted accuracy of at most one percent non-matching customers to a resolution length of 0.01 between similar solutions.
The procedure holds out some solutions, clusters the remaining ones into clusters with similar solutions with a fast and deterministic algorithm, and then tests how many of the holdout solutions fit in the solution clusters.
We continue searching for new solutions until, say, 95 percent of the holdout solutions fit in the solution clusters. In each solution cluster, the solution with the highest quality forms the center, and the user-specified resolution sets its boundary: only solutions within the resolution length from the center belongs to the cluster. In this way, you can require more accurate solutions by using a smaller resolution length.
The fast and deterministic solution-clustering algorithm does not require a prespecified number of clusters and evades the ambiguities that come with multiple solutions. It consists of the following three steps:
Order all solution from highest to lowest quality. Let the highest quality solution form solution-cluster center 1. Repeat for not yet clustered solutions: Pick the one with the highest quality and assign it to the first of the m solution-cluster centers that it is closer to than the resolution length. If no such solution-cluster center exists, let it form solution-cluster center m + 1. For example, in the schematic solution landscape in the figure below, the solution-clustering algorithm first lets the highest quality solution marked with a big filled square form the center of solution cluster 1. In order of decreasing solution quality, it then assigns the remaining solutions marked with filled squares to the same cluster before it lets the one marked with a big filled circle, which does not fit in the same cluster with the given resolution length, form the center of solution cluster 2. Finally, it assigns the remaining solutions marked with filled circles to that cluster.
A schematic solution landscape projected into a two-dimensional space with isolines for quality score. Black circles and white squares represent two solution clusters, with the solutions distributed based on their distances. Open symbols represent holdout solutions. Because all holdout solutions marked with open symbols in the figure fit in these clusters, the solutions pass the test for the given resolution length. Therefore, we can pick the best solution among them and be confident in a statistical sense that the algorithm will not find a better solution that is significantly different from the best one identified so far.
With this procedure, you can adapt the number of searches to the problem at hand so that you do not waste time with too many searches when dealing with easier problems or miss high-quality solutions after too few searches when dealing with harder problems.
While saving time can be worth a lot, being confident that you are getting the most out of your optimization algorithm can be worth even more.
Add your comments to the LinkedIn post.