Finding Duplicate Invoices In-Flight with AI

Kiran Ratnapu
Kiran Ratnapu
Director, Data Science, Coupa Software

Kiran Ratnapu, as Director, Data Science leads AI initiatives at Coupa’s AI COE where his team explores and builds AI products using the vast community data. He founded Deep Relevance, a startup that used AI to monitor occupational fraud prior to being acquired by Coupa. Kiran has over 20 years of experience in applying AI in multiple domains including Computational Genomics, Online Advertising, Finance & Business Spend Management. He holds a Bachelor's degree in Mechanical Engineering from Indian Institute of Technology, Bombay and an MBA from MIT Sloan School of Management.

Read time: 6 mins
Finding Duplicate Invoices In-Flight with AI

Applications of artificial intelligence (AI) at Coupa permeate across the platform. As with everything else in Coupa, AI has been thoughtfully applied to areas where it adds real value. One such area is financial fraud.

Detecting financial fraud can be challenging, costly, and time-consuming for organizations. However, with Coupa’s robust AI-powered fraud detection solution, Spend Guard, we are able to help customers catch fraud and errors in-flight before they are even paid. Within Spend Guard, one of the many checks that our customers have found valuable is in detecting duplicate invoices.

Duplicate invoices happen when multiple invoices with slightly different attributes (invoice numbers, dates, and sometimes amounts) are submitted for the same goods or services. When not detected, multiple payments are processed contributing to significant amounts of spend leakage. It is estimated that companies lose 0.5% of invoice spend on this problem. In just the Coupa community alone, we have been detecting an average of $1.7 million per year in duplicate invoice spend per customer in their first year. At one large CPG customer, we found a duplicate invoice of $1.5 million within a few weeks of going live with Spend Guard. And, this invoice had already been approved through their previous processes.

Detecting duplicate invoices is not as easy as it sounds. If you use traditional approaches, you will miss a lot of duplicates.

Traditional techniques involve exact match on a combination of Supplier + Invoice ID + Invoice Date + Amount. This is an easy check, but unfortunately will miss many duplicate invoices. Most of the real-world duplicate invoices will go undetected with this logic. Duplicate invoices can have different values for each of those fields. This primarily happens because of multiple errors in the process of entering invoice data from invoice images into the system.

We use a combination of more advanced logic and AI to help us find duplicate invoices.

We take a different approach. We use a combination of more advanced logic and AI to help us find duplicate invoices. Our approach to this, broadly speaking, involves the following:

  1. First, we use AI to predict errors or duplicates on different invoice record fields such as supplier, invoice ID, and invoice description.
  2. Second, we use these predictions to cast a wide net and arrive at a large number of approximate duplicate candidates. 
  3. Third, we whittle down these probable candidates through a series of AI techniques. This includes detecting false positive scenarios like recurring invoices or installment payments. It also includes detecting similarities and differences in hard to compare fields, like invoice descriptions. 
  4. Fourth, to keep innovating, we are also exploring using AI to extract fields directly from the invoice image and compare, to improve accuracy.

We will elaborate a little more on a few of the above.

3 Examples of duplicate field detection

1. Suppliers

Consider two seemingly different suppliers, 'Boston Mutual Life 5569' vs. 'Boston Mutual Insurance,' that are indeed one supplier. Going beyond simple name match rules, we employ multiple signals like name, address, tax ID, and DUNS number. Approximate string matching techniques are employed for text signals like supplier name and address. Graph clustering algorithms are then used to robustly determine the duplicate suppliers with a level of confidence not possible with rule-based methods.

2. Invoice IDs

Examples of some commonly occurring duplicate invoice ID patterns are: ‘INV2345’ vs. ‘INV2345a’ vs. ‘INVb2345.’ If it was only the prefixes and suffixes that were different, a rules system could detect similar invoice IDs. However, when the difference appears in the middle of the text, the number of possible combinations can quickly get too complex to detect with a rules based system. Instead, we use AI-based (Lexical Sequence Similarity) techniques to detect these variations.

3. Invoice descriptions

Invoice description differences are not as common but are more involved. They can differ not just in the characters but in the meaning as well.

Let us consider these two invoice descriptions and try to match using different approaches: ‘Ink HY Cyan LC3029CS Brother’ vs. ‘Embellished connection-printer ink.’

  1. If we did an exact match, it would result in 0% (no match).
  2. A match based on common words would be around 25%.
  3. A popular AI Natural Language Processing (NLP) technique called ‘word2vec’ gives us a 45% match.
    1. This method matches the meanings of text, called semantic understanding, instead of characters or words. It is trained on a language corpus containing primarily Wikipedia, news articles, etc. and aims to understand the commonly spoken English language. But the language in the procurement world is anything but commonly spoken English!
  4. So, we turned to our community data and observed that we had access to a huge corpus of procurement language (>300 million documents and counting). This led us to build our own custom semantic AI models. These custom semantic models for the above same descriptions yields us a 77% match!

This shows the enormous value of combining the right business knowledge, the right community data, and AI.

How we reduce false positives

In any AI based prediction model, you will have false positives (we think it is a duplicate invoice and it actually isn’t) and false negatives (there is a duplicate invoice, but we don’t find it). In these models we have to make a trade-off between these false positives and negatives.

For duplicate invoices, we like to err on the side of having more false positives. We think it is better for people to see a longer list of invoices that aren’t duplicates to avoid missing true duplicates. That is, the payoff can be very high (like the $1.5 million invoice we mentioned above) to make it worthwhile to wade through some that aren’t problematic.

However, we still put effort into reducing the false positives. To do this, we again used our large community database of many invoice descriptions and feedback from users (learning what is and isn’t a duplicate) to help tune our AI algorithms. This, for example, helps narrow our initial candidate duplicates as well as finds the recurrent or installment invoices.

Continuing innovation with invoice image comparisons

Almost all the duplicate issues are due to information mismatch between the invoice image and invoice record. Our approach so far has been focused on invoice records. One new line of work is the tougher problem of invoice image content comparison. This would address the problem at the source. Advanced patent pending Deep Learning AI on invoice extraction comes to our rescue (see more details here). These AI models extract the contents with a high degree of accuracy enabling a reliable comparison.

Hopefully, this peek inside one of our AI applications gives you a good feel for how the combination of our expertise in Business Spend Management, our community data, and the power of AI can bring real value.

Seeing is Believing. Attend any one of our weekly Live Demo sessions. Register now.