September 10, 2024 • 6 minutes read

What We Learnt by Grinding Grocery Bills with LLMs

A fun experiment turned into a surprisingly valuable lesson for us. As part of building the hybrid workflow demo, we were working on understanding loan application documents—particularly images of them.

My conversations with Aravind and Vibhor revolved around document extraction, accuracy, and the user experience. Here are the key steps we focused on:

Vibhor had written here about the importance of good human interaction in one of his previous posts.

The Grocery Bill Experiment

One day, Aravind decided to toss his latest grocery bill—41 rows and 5 columns of data, or 205 items—at a few large language models (LLMs). The idea was simple: Can these LLMs accurately extract and understand this data?

Grocery Bill

Prompt: “Convert this itemized bill to a csv file with 5 columns: item, MRP, QTY, Price Amount”

He tested with three models:

Interestingly, repeated runs with the same LLM and the same prompt gave different results (e.g., ChatGPT gave 78% accuracy once and 85% the next time).

Are all LLM mistakes hallucinations?

When Aravind was testing LLM outputs, he noticed that not all errors were created equal. Here’s what he found:

Throwing all these errors into the "hallucination" bucket doesn’t really help. Each type of mistake has its own quirks, and fixing them may require different approaches.

Why does this matter? If we treat every error the same, we’ll miss opportunities to improve the accuracy of LLMs. Understanding the nature of each mistake lets us apply more specific corrections and ultimately get better results.

Combining Results for Better Accuracy

Aravind didn’t stop there. He applied a "better of three" algorithm:

This approach boosted accuracy to 99.5%—only one mistake out of 205 items. This method might not work for all document types. Different scenarios and types of documents require testing with other LLMs and combining their results using different algorithms.

This is an example of a vote-based ensemble.

Getting Technical – What Is an Ensemble?

Ensembling isn’t a new trick. It’s simply a method where multiple models—or even different versions of the same model—work together to improve results. By combining outputs from various models, we can boost accuracy and reliability, instead of relying on a single model’s prediction.

So, how does this work with LLMs? Here are a few ways to build an ensemble:

Ensembling can make LLMs perform better by reducing errors or biases. But, it also comes with a cost—more models running means higher computational power and resources.

Presentation Matters: Visualizing Confidence

Another key part of Aravind’s experiment was figuring out how to present the extracted data to users. He played around with the idea of using color codes to indicate the confidence level of each result:

This simple design would allow users to easily identify which values they might need to double-check. A clean UX is essential in making sure that users feel confident in the data they’re interacting with.

Grocery Bill LLM output

A More Complex Scenario

Now imagine a more complicated scenario—verifying income across bank statements, pay stubs, and tax returns. LLMs could extract and summarize this data, producing something like this:

“Mike Powell earned an average of $15,000 per month from May to July 2024, including a one-time bonus of $6,000 and rental income of $650 per month. Mike’s total income was $120,000 in 2022 and $126,000 in 2023. He has been with his current employer since April 2023, where his net pay increased by $1,900 per month.”

This was all deduced from pay stubs, 1099 forms, and bank statements. The challenge then becomes presenting these complex deductions in a way that feels intuitive to the user, especially when there are different confidence levels for each piece of information.

In the case of grocery bill, a color-coded confidence scale was a simple but effective way to show users what to trust and what to verify. What would a good UX look like to present the above statement with different confidence levels for different extractions and inferences?

The (U)X Factor

The key takeaway here is that while LLMs are great at extracting and interpreting data, the success of any AI tool depends on how easily users can interact with the results.

We think that designing a good UX, specific to the hybrid workflow use case will be one of the most important factors for GenAI adoption in business workflows.

Write to us if you have some interesting use cases for Hybrid workflows and would like to discuss those with us.