Turning the tables: A benchmark for LLMs in data analysis

A new benchmark designed by U-M researchers reveals the gaps, and potential, of AI models in understanding tabular data.
A close-up view of a spreadsheet or table with numerical data. Light blue vertical bar charts are overlaid in some columns, visualizing values.

Researchers in Computer Science and Engineering (CSE) at the University of Michigan have introduced a groundbreaking benchmark, called MMTU, designed to evaluate large language models (LLMs) on a wide range of table-related tasks. This initiative, spearheaded by Edgar F. Codd Distinguished University Professor of Electrical Engineering and Computer Science and Bernard A. Galler Collegiate Professor of Electrical Engineering and Computer Science H. V. Jagadish and PhD student Junjie Xing, in collaboration with Yeye He and others from Microsoft Research, addresses a significant gap in the evaluation process for AI models dealing with tabular data.

Increasingly, LLMs are being used to assist in data processing and analysis tasks. As organizations, researchers, and everyday users collect and rely on more tabular data, from spreadsheets to complex databases, the potential of LLMs to automate, explain, and generate insights from this data continues to grow. However, their proficiency in handling data in table form is limited, and their full capabilities with complex table tasks remain largely unexplored. 

Current evaluations of LLM performance predominantly focus on limited table tasks, such as translating questions into database queries, answering questions directly from tables, or checking if a statement about a table is true. While such tasks are important, they barely scratch the surface of the many varied scenarios in which tables are used, with many crucial operations going untested in these narrow evaluations. The resulting lack of understanding of how—and how well—LLMs read, parse, and analyze tabular data places serious limitations on their utility in data processing and analysis applications.

MMTU addresses this by expanding the scope of existing evaluation tools and assessing LLM performance on previously overlooked table tasks, such as schema matching and column transformation. These tasks are crucial in a wide array of applications, where data exists in diverse table formats across domains.

Table listing 25 benchmark tasks grouped into categories such as Table Transform, Table Matching, Data Cleaning, Column Transform, and more. Each row includes task names, descriptions, evaluation metrics, references, and number of questions, totaling over 30,000 examples.
Table of the 25 tasks in the MMTU benchmark, grouped by task category. Most have not traditionally been used to evaluate LLMs, expanding beyond standard tasks like NL-to-code, TableQA, and KB mapping.

Schema matching, for instance, involves aligning tables with disparate schemas to unify data from different sources, an essential task for data integration. Another featured task, column transformation, requires LLMs to convert data formats, such as turning dates from “MM/DD/YYYY” into “Month Day, Year,” or splitting a full name into first and last names, similarly crucial in preparing data to be analyzed. By evaluating such tasks, MMTU provides a comprehensive benchmark that reflects the complexity and diversity of table-related challenges. 

“We aim to bridge the gap between the natural language processing and database communities by providing a tool that facilitates a more holistic evaluation of AI models,” Jagadish explained. 

MMTU is powered by a comprehensive dataset that includes over 30,000 questions spanning 25 unique tasks, offering a robust framework for evaluating and improving LLM capabilities. The design of the benchmark draws on decades of research on tabular data, and the breadth of tasks it covers requires LLMs to demonstrate not only a grasp of basic concepts like reading values, but also advanced reasoning, manipulation, and coding skills.

Radar chart comparing four language models—gpt-4o, llama33-70b, o4-mini, and Deepseek-R1—across 10 task categories. Highest scores appear in table matching and table join, with lowest performance seen in column relationship, table understanding, and table transform.
A performance comparison of models in all 10 task categories. As the graphic shows, while models performed reasonably well in areas like table matching and table join, they struggled in more complex tasks such as column transformation, table understanding, and column relationship, highlighting clear areas for improvement.

The researchers’ findings show that while newer reasoning models tailored for complex problem-solving outperform chat-based models, significant advancements are still needed across several task categories to meet professional data analysis requirements. For instance, although models excel at tasks like translating English questions into database code, they falter on more practical skills like cleaning up messy data, matching information from separate sources, or executing multi-step data transformations—critical steps in making data useful for real-world applications.

“Our benchmark reveals that while LLMs perform well on certain tasks, they struggle with others like data cleaning and knowledge-based mapping,” said Xing. “This helps narrow down areas of LLM performance that require further research and development.”

By setting a new standard for evaluating AI models across a diverse and comprehensive suite of table-related tasks, MMTU represents a pivotal step toward developing LLMs capable of comprehensive data analysis. This benchmark enables future advances to better support users across fields—from healthcare to business—that depend on structured data processing. The MMTU dataset and code are available publicly via HuggingFace and GitHub, ensuring wide accessibility for ongoing research and application in this important area.

Explore:
H.V. Jagadish; Research News