Tabular Data Analysis Workshop @ VLDB 2025

Objectives

Data Analysis is described as the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Performing such tasks over large and heterogeneous collections of tabular data, as found in enterprise data lakes and on the Web, is extremely challenging and an attractive research topic in data management, AI, and related communities. The goal of this workshop is to bring together researchers and practitioners in these diverse communities that work on addressing the fundamental research challenges of tabular data analysis and building automated solutions in this space.

We aim to provide a forum for: a) exchange of ideas between two communities: 1) an active community of data management researchers working on data integration and schema and data matching problems over tabular data, and 2) a vibrant community of researchers in AI and Semantic Web communities working on the core challenge of matching tabular data to Knowledge Graphs as a part of the ISWC SemTab Challenges. b) presentation of late-breaking results related to several emerging research areas such as table representation learning and its applications, use of large language models (LLMs) for tabular data analysis, andautomation of data science pipelines, and automation of data science pipelines that rely on tabular data. c) discussion of real-world challenges related to implementing industrial-scale tabular data anaylsis pipelines, and data lakes and data lakehouse solutions.

Call For Papers

Audience: Our workshop encourages participation from researchers in data management, AI, and Semantic Web communities working on a wide range of problems relevant to tabular data analysis. We hope that this will constitute a single reference point for the researchers and practitioners working in that area and help form new collaborations. We also aim to provide a venue for researchers from industry and practitioners relying on various tabular data analysis tasks to present use cases and discuss their needs in addressing real-world problems and large-scale solutions.

Topics of Interest include but are not limited to:

Semantic Table Annotation
Automated Tabular Data Understanding
Using Large Language Models (LLMs) for Tabular Data Analysis
Exploratory Data Analysis over Tabular Data
Table Search in Data Lakes
Tabular Data Discovery
Metadata Management for Tabular Data Analysis
Data Augmentation with Tabular Data
Integration and Matching of Tabular Data
Knowledge Graph Construction and Completion with Tabular Data
Automated Discovery of ML Features from Tabular Data
ML Model Development with Tabular Data
Visualization and Interfaces for Tabular Data Analysis
Data Wrangling for Tabular Data Analysis
Deep Learning and Representation Learning for Tabular Data Analysis
Extraction and Analysis of Tabular Data from (HTML/PDF) Documents and Images
Analysis of Tabular Data on the Web (Web Tables)
Practical Applications of Tabular Data Analysis
Benchmarking and Evaluation Frameworks for Tabular Data Analysis

Submissions

Contributions to the workshop can take the form of technical papers, posters, or statements of interest addressing various aspects of tabular data analysis, as well as reports on SemTab Challenge participation. Long technical papers should be 8-10 pages long. Short technical papers should be no more than 4 pages long. Posters should not exceed 2 pages. References do not count towards the page limits mentioned above.

Submission site: https://cmt3.research.microsoft.com/TaDA2025
Submissions should follow the format outlined in the provided zipped LaTeX proceedings directory.

Reviews will be anonymous (not dual anonymous). Authors of accepted papers will have the option to include their papers in the VLDB workshop proceedings. At least one co-author is expected to register for the VLDB 2025 conference and present the paper in-person. Please visit the VLDB 2025 registration instructions for more information.
For camera-ready submission, the authors should prepare the following materials:

Camera-Ready Paper: your final revised paper in PDF format
(TaDA25_PaperId.pdf, replace "PaperId" with your assigned paper ID)
Completed Copyright Form: download from https://vldb.org/pvldb/vol13/VLDB_Copyright_License_Form.pdf
(TaDA25_PaperId_Copyright.pdf, replace "PaperId" with your assigned paper ID)

Submit both your camera-ready paper and the completed copyright form to https://cmt3.research.microsoft.com/TaDA2025

Important Dates

Submission deadline: ~~May 15, 2025~~ May 29, 2025
Notification of acceptance: ~~June 10, 2025~~ June 24, 2025
Camera-ready copy due: ~~July 1, 2025~~ July 15, 2025
Workshop Day: September 5, 2025

All Times are Anywhere on Earth (AoE).

Program

8:45 - 10:00

Session 1

8:45 - 9:00	Introduction
9:00 - 9:30	Keynote talk by Kavitha Srinivas *Do we need tabular foundation models?*
9:30 - 9:45	A Vision for SQL-Based Relational Deep Learning [pdf] Fahim Shahriar Khan, Ashraf Aboulnaga
9:45 - 10:00	From Features to Structure: Task-Aware Graph Construction for Relational and Tabular Learning with GNNs [pdf] Tamara Cucumides, Floris Geerts

10:00 - 11:00

Coffee Break + Poster Session

	Relationship Detection on Tabular Data Using Statistical Analysis and Large Language Models [pdf] Panagiotis Koletsis, Christos Panagiotopoulos, Georgios Papadopoulos, Vasilis Efthymiou
	Improving Column Type Annotation Using Large Language Models [pdf] Amir Babamahmoudi, Davood Rafiei, Mario Nascimento
	Query Plan Generation for Table Question Answering [pdf] Ivan Poddubny, Nikita Dorodnykh
	Table Header Recognition Based on Large Language Models [pdf] Ilya I. Okhotin, Nikita Dorodnykh
	TOPJoin: A Context-Aware Multi-Criteria Approach for Joinable Column Search [pdf] Harsha Kokel, Aamod Kathiwada, Tejaswini Pedapanti, Haritha Ananthakrishnan, Oktie Hassanzadeh, Horst Samulowitz, Kavitha Srinivas
	Evaluating SQL Selection/Projection over Table Embeddings [pdf] Mariam Mellouli, Paolo Papotti
	Optimizing Source Selection for Tuple-Value Discovery [pdf] Ahmad Fares, Georgia Troullinou, Silviu Maniu, Sihem Amer-Yahia
	Universal Embeddings of Tabular Data [pdf] Astrid Franz, Frederik Hoppe, Marianne Michaelis, Udo Göbel
	SemForest: Semantic-Aware Ontology Generation with Foundation Models [pdf] Guohui Guan, Sachin Konan, Larry Rudolph, Chang Ge

11:00 - 12:00

Session 2

11:00 - 11:15	StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation [pdf] Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, Horst Samulowitz
11:15 - 11:30	Towards Fine-Grained Extraction of Scientific Claims from Heterogeneous Tables Using Large Language Models [pdf] Daniele Bertillo, Laks V.S. Lakshmanan, Paolo Merialdo, Divesh Srivastava
11:30 - 12:00	Keynote talk by Paolo Papotti *Reinforcement Learning to enable Reasoning LLMs for Text2SQL*

Keynote talk by Paolo Papotti
Title: Reinforcement Learning to enable Reasoning LLMs for Text2SQL
Abstract: The ability to interact with complex databases using natural language (NL) is a key step in democratizing data access, a long-standing goal in the enterprise world. While Large Language Models (LLMs) have shown remarkable promise in translating NL questions into SQL queries (Text2QL), their performance stall when faced with the complexities of real-world enterprise databases. This talk will report a promising solution to enhance the reasoning capabilities of LLMs for this task. Our "Think2SQL" methodology investigates various strategies for improving LLM performance, including Zero-Shot Learning (ZSL), Supervised Fine-Tuning (SFT), and Reinforcement Learning (RL). RL, using rewards crafted around SQL execution accuracy, significantly boosts the performance of small LLMs, achieving results comparable to those of much larger models on complex datasets. Finally, we will highlight the path forward for Text2SQL systems capable of navigating the nuances of human language, such as ambiguity, in a real-world enterprise context.
Bio: Paolo Papotti is a Professor at EURECOM (France) since 2017 and the holder of a Chair of Artificial Intelligence at the 3IA Institute since 2024. He got his PhD from Roma Tre University (Italy) in 2007 and had research positions at the Qatar Computing Research Institute (Qatar) and Arizona State University (USA). His research is focused on data management and NLP. He has authored more than 150 publications and his work has been recognized with best paper awards (CIKM 2024, ISWC 2024), best demo awards (SIGMOD 2015, DBA 2020, SIGMOD 2022), and Google Faculty Research awards (2016, 2020).

Keynote talk by Kavitha Srinivas
Title: Do we need tabular foundation models?
Bio: Kavitha Srinivas is a Senior Research Scientist at IBM Research. Her current interests are in the space of semantics and data management; such as how to leverage LLMs to perform data discovery, predictive modeling, table retrieval etc.

Organization

General Chairs:

Program Committee Chairs:

Proceedings and Publicity Chair:

Chuan Lei

Program Committee:

Aamod Khatiwada (Northeastern University)
Amine Mhedhbi (Polytechnique Montréal)
Anastasia Dimou (KU Leuven)
Andra-Denis Ionescu (TU Delft)
Christian Bizer (University of Mannheim)
Christos Diou (Harokopio University of Athens)
George Papadakis (University of Athens)
Haridimos Kondylakis (FORTH-ICS & University of Crete)
Ismael Sanz (Universitat Jaume I)
Kaustubh Beedkar (IIT Delhi)
Kavitha Srinivas (IBM Research)
Kostas Stefanidis (Tampere University)
Marco Mesiti (University of Milan)
Paolo Papotti (EURECOM)
Rafael Berlanga Llavori (University Jaume I)
Roee Shraga (WPI)
Romila Pradhan (Purdue University)
Sajjadur Rahman (Megagon Labs)
You Wu (Google)
Yuval Moskovitch (University of Michigan)
Zezhou Huang (Columbia University)