Turning the tables: A benchmark for LLMs in data analysis

Researchers in Computer Science and Engineering (CSE) at the University of Michigan have introduced a groundbreaking benchmark, called MMTU, designed to evaluate large language models (LLMs) on a wide range of table-related tasks. This initiative, spearheaded by Edgar F. Codd Distinguished University Professor of Electrical Engineering and Computer Science and Bernard A. Galler Collegiate Professor of Electrical Engineering and Computer Science H. V. Jagadish and PhD student Junjie Xing, in collaboration with Yeye He and others from Microsoft Research, addresses a significant gap in the evaluation process for AI models dealing with tabular data.
Increasingly, LLMs are being used to assist in data processing and analysis tasks. As organizations, researchers, and everyday users collect and rely on more tabular data, from spreadsheets to complex databases, the potential of LLMs to automate, explain, and generate insights from this data continues to grow. However, their proficiency in handling data in table form is limited, and their full capabilities with complex table tasks remain largely unexplored.
Current evaluations of LLM performance predominantly focus on limited table tasks, such as translating questions into database queries, answering questions directly from tables, or checking if a statement about a table is true. While such tasks are important, they barely scratch the surface of the many varied scenarios in which tables are used, with many crucial operations going untested in these narrow evaluations. The resulting lack of understanding of how—and how well—LLMs read, parse, and analyze tabular data places serious limitations on their utility in data processing and analysis applications.
MMTU addresses this by expanding the scope of existing evaluation tools and assessing LLM performance on previously overlooked table tasks, such as schema matching and column transformation. These tasks are crucial in a wide array of applications, where data exists in diverse table formats across domains.

Schema matching, for instance, involves aligning tables with disparate schemas to unify data from different sources, an essential task for data integration. Another featured task, column transformation, requires LLMs to convert data formats, such as turning dates from “MM/DD/YYYY” into “Month Day, Year,” or splitting a full name into first and last names, similarly crucial in preparing data to be analyzed. By evaluating such tasks, MMTU provides a comprehensive benchmark that reflects the complexity and diversity of table-related challenges.
“We aim to bridge the gap between the natural language processing and database communities by providing a tool that facilitates a more holistic evaluation of AI models,” Jagadish explained.
MMTU is powered by a comprehensive dataset that includes over 30,000 questions spanning 25 unique tasks, offering a robust framework for evaluating and improving LLM capabilities. The design of the benchmark draws on decades of research on tabular data, and the breadth of tasks it covers requires LLMs to demonstrate not only a grasp of basic concepts like reading values, but also advanced reasoning, manipulation, and coding skills.

The researchers’ findings show that while newer reasoning models tailored for complex problem-solving outperform chat-based models, significant advancements are still needed across several task categories to meet professional data analysis requirements. For instance, although models excel at tasks like translating English questions into database code, they falter on more practical skills like cleaning up messy data, matching information from separate sources, or executing multi-step data transformations—critical steps in making data useful for real-world applications.
“Our benchmark reveals that while LLMs perform well on certain tasks, they struggle with others like data cleaning and knowledge-based mapping,” said Xing. “This helps narrow down areas of LLM performance that require further research and development.”
By setting a new standard for evaluating AI models across a diverse and comprehensive suite of table-related tasks, MMTU represents a pivotal step toward developing LLMs capable of comprehensive data analysis. This benchmark enables future advances to better support users across fields—from healthcare to business—that depend on structured data processing. The MMTU dataset and code are available publicly via HuggingFace and GitHub, ensuring wide accessibility for ongoing research and application in this important area.