Tabular Data Analysis Workshop @ VLDB 2024

Objectives

Data Analysis is described as the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision- making. Performing such tasks over large and heterogeneous collections of tabular data, as found in enterprise data lakes and on the Web, is extremely challenging and an attractive research topic in data management, AI, and related communities. The goal of this workshop is to bring together researchers and practitioners in these diverse communities that work on addressing the fundamental research challenges of tabular data analysis and building automated solutions in this space

We aim to provide a forum for: a) exchange of ideas between two communities: 1) an active community of data management researchers working on data integration and schema and data matching problems over tabular data, and 2) a vibrant community of researchers in AI and Semantic Web communities working on the core challenge of matching tabular data to Knowledge Graphs as a part of the ISWC SemTab Challenges. b) presentation of late-breaking results related to several emerging research areas such as table representation learning and its applications, automation of data science pipelines, and data lake and data lakehouse solutions. c) discussion of real-world data management challenges related to implementing industrial scale tabular data anaylsis solutions.

Call For Papers

Audience: Our workshop encourages participation from researchers in data management, AI, and Semantic Web communities working on a wide range of problems relevant to tabular data analysis. We hope that this will constitute a single reference point for the researchers and practitioners working in that area and help form new collaborations. We also aim to provide a venue for researchers from industry and practitioners relying on various tabular data analysis tasks to present use cases and discuss their needs in addressing real-world problems and large-scale solutions.

Topics of Interest include but are not limited to:

Semantic Table Annotation
Automated Tabular Data Understanding
Exploratory Data Analysis over Tabular Data
Table Search in Data Lakes
Tabular Data Discovery
Tabular Data Discovery in Data Lakes
Tabular Data Discovery for Causal Inference
Metadata Management for Tabular Data Analysis
Data Augmentation with Tabular Data
Integration and Matching of Tabular Data
Knowledge Graph Construction and Completion with Tabular Data
Automated Discovery of ML Features from Tabular Data
ML Model Development with Tabular Data
Visualization and Interfaces for Tabular Data Analysis
Data Wrangling for Tabular Data Analysis
Deep Learning and Representation Learning for Tabular Data Analysis
Foundation Models for Tabular Data Analysis
Extraction and Analysis of Tabular Data from (HTML/PDF) Documents and Images
Analysis of Tabular Data on the Web (Web Tables)
Practical Applications of Tabular Data Analysis
Benchmarking and Evaluation Frameworks for Tabular Data Analysis

Call for late breaking results: We are now ready to receive your poster paper (2-page, double-column) submissions for late breaking results, until July 8, AoE. Those submissions will go through a light review phase and accepted submissions will have the opportunity to present a poster at the workshop, for in-person attendees only, as well as a 5-minute recorded video presentation, shared on the workshop website. To submit, use our regular submission wesbite and pick "Poster / Statement of Interest" as submission type.

Submissions

Contributions to the workshop can take the form of technical papers, posters, or statements of interest addressing various aspects of tabular data analysis, as well as reports on SemTab Challenge participation. Long technical papers should be 8-10 pages long. Short technical papers should be no more than 4 pages long. Posters should not exceed 2 pages. References do not count towards the page limits mentioned above.

Submission site: https://cmt3.research.microsoft.com/TaDA2024
Submissions should follow the double-column CEUR-ART template

Reviews will be anonymous (not dual anonymous). Authors of accepted papers will have the option to include their papers in the CEUR-ART proceedings of the workshop. At least one co-author is expected to register for the VLDB 2024 conference and present the paper in-person. Please visit the VLDB 2024 registration instructions for more information.

Important Dates

Submission deadline: ~~May 9, 2024~~ May 30, 2024
Notification of acceptance: ~~June 10, 2024~~ June 28, 2024
Late breaking results (poster) submission deadline: July 8, 2024
Late breaking results (poster) notification: July 15, 2024
Camera-ready copy due: ~~June 28, 2024~~ August 5, 2024
Workshop Day: August 30, 2024

All Times are Anywhere on Earth (AoE).

Program

9:00 - 10:30

Session 1: Tabular Data Discovery

9:00 - 9:10	Opening Remarks
9:10 - 9:35	ALT-GEN: Benchmarking Table Union Search using Large Language Models [pdf] Koyena Pal, Aamod Khatiwada, Roee Shraga, Renée J. Miller
9:35 - 10:00	Finding Support for Tabular LLM Outputs [pdf] Grace Fan, Roee Shraga, Renée J. Miller
10:00 - 10:15	Humboldt: Metadata-Driven Extensible Data Discovery [pdf] Alex Bäuerle, Çağatay Demiralp, Michael Stonebraker
10:15 - 10:30	Toward a Declarative Query Language for Machine Learning [pdf] Hasan Jamil

10:30 - 11:00

Coffee Break

11:00 - 12:30

Session 2: Keynote Session

11:00 - 12:00	Keynote talk by Shi Han and Haoyu Dong, Microsoft Research Spreadsheet Intelligence and Data Analytics
12:00 - 12:15	Data Quality Management for Responsible AI in Data Lakes [pdf] Carolina Cortes, Camila Sanz, Lorena Etcheverry, Adriana Marotta
12:15 - 12:30	Large Language Models as Data Preprocessors [pdf] Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada

12:30 - 14:00

Lunch Break

14:00 - 15:20

Session 3: LLMs & Tabular Data

14:00 - 14:25	Schema Matching with Large Language Models: an Experimental Study [pdf] Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M. Peeters, Stijn Vansummeren
14:25 - 14:40	LLMs for Data Engineering on Enterprise Data [pdf] Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Matthias Urban, Anupam Sanghi, Carsten Binnig
14:40 - 14:55	Transform Table to Database Using Large Language Models [pdf] Zezhou Huang, Jia Guo, Eugene Wu
14:55 - 15:20	DEMA: Enhancing Causal Analysis through Data Enrichment and Discovery in Data Lakes [pdf] Kayvon Heravi, Saathvik Dirisala, Babak Salimi

15:20 - 16:00

Poster Session

16:00 - 17:10

Session 4: Machine Learning & Tabular Data

16:00 - 16:25	4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on RDBs [pdf] Minjie Wang, Quan Gan, David Wipf, ZhenkunCai, Ning Li, Jianheng Tang, Yanlin Zhang, ZizhaoZhang, ZunyaoMao, YakunSong, Yanbo Wang, Jiahang Li, HanZhang, Guang Yang, Xiao Qin, Chuan Lei, Muhan Zhang, Weinan Zhang, Christos Faloutsos, Zheng Zhang
16:25 - 16:40	Fast and Accurate Regional Effect Plots for Automated Tabular Data Analysis [pdf] Vasilis Gkolemis, Theodore Dalamagas, Eirini Ntoutsi, Christos Diou
16:40 - 16:55	GFS: Graph-based Feature Synthesis for Prediction over Relational Databases [pdf] Han Zhang, Quan Gan, David Wipf, Weinan Zhang
16:55 - 17:10	Closing Remarks & Awards

Keynote by Shi Han and Haoyu Dong, Microsoft Research
Title: Spreadsheet Intelligence and Data Analytics
Abstract: This keynote will unveil cutting-edge technologies designed to tackle the major challenges in spreadsheet intelligence, encompassing areas such as detecting table ranges, analyzing table structures and sheet layouts, understanding data semantics, and recommending data presentations. Based on spreadsheet intelligence, the presentation will also highlight our research and engineering efforts in boosting automation of data analytics to help Microsoft build technical leadership in the Business Intelligence market. In the trend of Large Language Models (LLMs), we will also present our latest explorations into integrating LLMs with spreadsheet intelligence and data analytics.

Accepted Papers

Hasan Jamil. Toward a Declarative Query Language for Machine Learning [pdf]
Koyena Pal, Aamod Khatiwada, Roee Shraga, Renée J. Miller. ALT-GEN: Benchmarking Table Union Search using Large Language Models [pdf]
Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Matthias Urban, Anupam Sanghi, Carsten Binnig. LLMs for Data Engineering on Enterprise Data [pdf]
Vasilis Gkolemis, Theodore Dalamagas, Eirini Ntoutsi, Christos Diou. Fast and accurate regional effect plots for automated tabular data analysis [pdf]
Zezhou Huang, Jia Guo, Eugene Wu. Transform Table to Database Using Large Language Models [pdf]
Han Zhang, Quan Gan, David Wipf, Weinan Zhang. GFS: Graph-based Feature Synthesis for Prediction over Relational Databases [pdf]
Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M. Peeters, Stijn Vansummeren. Schema Matching with Large Language Models: an Experimental Study [pdf]
Grace Fan, Roee Shraga, Renée J. Miller. Finding Support for Tabular LLM Outputs [pdf]
Minjie Wang, Quan Gan, David Wipf, ZhenkunCai, Ning Li, Jianheng Tang, Yanlin Zhang, ZizhaoZhang, ZunyaoMao, YakunSong, Yanbo Wang, Jiahang Li, HanZhang, Guang Yang, Xiao Qin, Chuan Lei, Muhan Zhang, Weinan Zhang, Christos Faloutsos, Zheng Zhang. 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs [pdf]
Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada. Large Language Models as Data Preprocessors [pdf]
Kayvon Heravi, Saathvik Dirisala, Babak Salimi. DEMA: Enhancing Causal Analysis through Data Enrichment and Discovery in Data Lakes [pdf]
Lorena Etcheverry, Adriana Marotta, Carolina Cortes, Camila Sanz. Data Quality Management for Responsible AI in Data Lakes [pdf]
Alex Bäuerle, Cagatay Demiralp, Michael Stonebraker. Humboldt: Metadata-Driven Extensible Data Discovery [pdf]

Organization

Organizing Committee:

Steering Committee:

Madelon Hulsebos (UC Berkeley)
Ernesto Jiménez-Ruiz (City, University of London)
Fatemeh Nargesian (University of Rochester)
Natasha Noy (Google)
Horst Samulowitz (IBM Research)

Program Committee:

Nora Abdelmageed (University of Jena)
Omar Benjelloun (Google)
Rafael Berlanga Llavori (University Jaume I)
Carsten Binnig (TU Darmstadt)
Christian Bizer (University of Mannheim)
Anastasia Dimou (KU Leuven)
Christos Diou (Harokopio University of Athens)
Zezhou Huang (Columbia University)
Madelon Hulsebos (UC Berkeley)
Andra Ionescu (TU Delft)
Ernesto Jiménez-Ruiz (City, University of London)
Aamod Khatiwada (Northeastern University)
Udayan Khurana (IBM Research)
Haridimos Kondylakis (FORTH-ICS)
Marco Mesiti (University of Milan)
Renée Miller (Northeastern University)
George Papadakis (University of Athens)
Paolo Papotti (EURECOM)
Nhan Pham (IBM Research)
Horst Samulowitz (IBM Research)
Ismael Sanz (Universitat Jaume I)
Roee Shraga (WPI)
Kostas Stefanidis (Tampere University)
Gerhard Weikum (Max Planck Institute for Informatics)
You Wu (Google)