What Is Merchant Normalization?

Shoan Jain
Shoan Jain
Senior Data Scientist at Coupa's AI-Center of Excellence

Shoan Jain is a Senior Data Scientist at Coupa's AI-Center of Excellence. She joined Coupa through acquisition of Deep Relevance, a startup that used AI to monitor occupational fraud. She holds a Masters degree in International and Development Economics from Yale University and a bachelor's degree in Economics from Indian Institute of Technology, Kanpur.

Read time: 4 mins
merchant normalization

Take a look at some sample entries for merchants in our expenses database:

  • "uber trip tnla7 help.u"
  • "uber us jan30 abtod"
  • "uber technologies"
  • "uber trip"
  • "uber"
  • “uber eats”

As human beings, it’s very easy for us to say that all the merchants, but the last, are the same — Uber, the rideshare app. The last one on the other hand is Uber Eats, a food delivery app. Even if we could manually update the database to reflect the merchant as “Uber” and “Uber Eats” respectively, this won’t scale. In Coupa, we have over 4 million distinct merchant names (and growing) in tens of millions of expense lines.

Machine learning (ML) algorithms and our vast community data allow us to normalize these names at scale.

Now, you might be wondering why we want to normalize these merchant names. First, there is a lot of value in providing visibility into your spend with different merchants. That is, it is good to understand how much you are spending with Uber versus other merchants. Second, after we consolidate the spend by merchants, we can see which merchants might be better managed with pre-negotiated, contracted, and discounted prices.

Although this problem seems obvious and it might appear there are easy solutions, it is quite difficult in three dimensions:

  1. By definition, we don’t have much information on merchants. In Coupa (and in most procurement terminology), the merchants are the vendors listed on expense reports. The first time the system sees anything about the merchant is when a user submits an expense report. In contrast, a supplier is a vendor that we’ve loaded into the system and issue invoices to. As such, the suppliers get verified and have such information as location, parent companies, tax IDs, and so on.
  2. The data entered on merchants is unstructured and messy. Since this data comes from expense reports, it enters Coupa database through a variety of channels like credit card feed, parsed email itineraries, receipt extraction, or manual entry by the user. This essentially means that there is no standardised input.
  3. The scope and scale of this problem poses difficulties for even the best big-data platforms. There are tens of millions of line-items to analyze. And, the sheer number of merchants is large. There can be some merchants with over 500,000 expense lines, while over 2 million merchant names seem to appear only once in the entire community (our challenge is to figure out when two or more of these rare merchants are actually the same).

We decided to tackle this big problem in multiple steps, starting with thorough data cleaning. In addition to the standard text cleaning procedures such as removing special characters and stop words, we also applied data cleaning specific to this use case: removing transaction IDs (these occur frequently when we get merchants from credit card feeds) and replacing accented characters with Latin alphabet. This step is crucial as the merchant data is highly unstructured and the merchant names are our only input. After the data is cleaned, the number of distinct merchant names reduces by almost a third.

To get a sense of the difficulty in automating the normalization of merchant names, we first tried standard clustering techniques to find merchants that were actually the same. However, this provided poor results.

So we moved on to more rigorous natural language processing (NLP) solutions involving fuzzy matching to determine the similarity between merchant name strings. Here the size of the dataset and length of the merchant names proved to be a hurdle. Without additional processing, the fuzzy matching would have literally taken years. To reduce the number of comparisons, we came up with a combination of rules and ML algorithm. This cut run time to under 10 minutes, with good accuracy.

To improve the accuracy, we layered a graph analysis on top, using the similarity between the merchant names to create a graph where similar merchant names are connected by edges. The connected components in this graph are clustered together as the same merchant.

Finally, with our cluster of same merchants, we need to find a name that can represent all the members. Again, while it seems straightforward for a human being to label “hilton hotels,” “hilton,” “hilton inn and suites,” “hilton garden inn charles,” and so on, as “Hilton Hotels,” it is not so obvious for the algorithm. Here we let the data tell us which name the users use most often for this merchant.

This approach uses a beautiful combination of our community data (having access to a broad range of merchant names), instance-specific data (making use of all the data available to us), and our understanding of this specific problem.

Now that we have this foundation set up, we are able to normalize merchants on all new expense lines on a daily basis, including those seen for the first time in the community, in all the regions. And, we continue to enhance and expand this work as we grow.