Objectives

Data Analysis is described as the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision- making. Performing such tasks over large and heterogeneous collections of tabular data, as found in enterprise data lakes and on the Web, is extremely challenging and an attractive research topic in data management, AI, and related communities. The goal of this workshop is to bring together researchers and practitioners in these diverse communities that work on addressing the fundamental research challenges of tabular data analysis and building automated solutions in this space

We aim to provide a forum for: a) exchange of ideas between two communities: 1) an active community of data management researchers working on data integration and schema and data matching problems over tabular data, and 2) a vibrant community of researchers in AI and Semantic Web communities working on the core challenge of matching tabular data to Knowledge Graphs as a part of the ISWC SemTab Challenges. b) presentation of late-breaking results related to several emerging research areas such as table representation learning and its applications, automation of data science pipelines, and data lake and data lakehouse solutions. c) discussion of real-world data management challenges related to implementing industrial scale tabular data anaylsis solutions.

Call For Papers

Audience: Our workshop encourages participation from researchers in data management, AI, and Semantic Web communities working on a wide range of problems relevant to tabular data analysis. We hope that this will constitute a single reference point for the researchers and practitioners working in that area and help form new collaborations. We also aim to provide a venue for researchers from industry and practitioners relying on various tabular data analysis tasks to present use cases and discuss their needs in addressing real-world problems and large-scale solutions.

Topics of Interest include but are not limited to:
  • Semantic Table Annotation
  • Automated Tabular Data Understanding
  • Exploratory Data Analysis over Tabular Data
  • Table Search in Data Lakes
  • Tabular Data Discovery
  • Tabular Data Discovery in Data Lakes
  • Tabular Data Discovery for Causal Inference
  • Metadata Management for Tabular Data Analysis
  • Data Augmentation with Tabular Data
  • Integration and Matching of Tabular Data
  • Knowledge Graph Construction and Completion with Tabular Data
  • Automated Discovery of ML Features from Tabular Data
  • ML Model Development with Tabular Data
  • Visualization and Interfaces for Tabular Data Analysis
  • Data Wrangling for Tabular Data Analysis
  • Deep Learning and Representation Learning for Tabular Data Analysis
  • Foundation Models for Tabular Data Analysis
  • Extraction and Analysis of Tabular Data from (HTML/PDF) Documents and Images
  • Analysis of Tabular Data on the Web (Web Tables)
  • Practical Applications of Tabular Data Analysis
  • Benchmarking and Evaluation Frameworks for Tabular Data Analysis
Call for late breaking results: We are now ready to receive your poster paper (2-page, double-column) submissions for late breaking results, until July 8, AoE. Those submissions will go through a light review phase and accepted submissions will have the opportunity to present a poster at the workshop, for in-person attendees only, as well as a 5-minute recorded video presentation, shared on the workshop website. To submit, use our regular submission wesbite and pick "Poster / Statement of Interest" as submission type.

Submissions

Contributions to the workshop can take the form of technical papers, posters, or statements of interest addressing various aspects of tabular data analysis, as well as reports on SemTab Challenge participation. Long technical papers should be 8-10 pages long. Short technical papers should be no more than 4 pages long. Posters should not exceed 2 pages. References do not count towards the page limits mentioned above.

Submission site: https://cmt3.research.microsoft.com/TaDA2024
Submissions should follow the double-column CEUR-ART template

Reviews will be anonymous (not dual anonymous). Authors of accepted papers will have the option to include their papers in the CEUR-ART proceedings of the workshop. At least one co-author is expected to register for the VLDB 2024 conference and present the paper in-person. Please visit the VLDB 2024 registration instructions for more information.

Important Dates

  • Submission deadline: May 9, 2024 May 30, 2024
  • Notification of acceptance: June 10, 2024 June 28, 2024
  • Late breaking results (poster) submission deadline: July 8, 2024
  • Late breaking results (poster) notification: July 15, 2024
  • Camera-ready copy due: June 28, 2024 August 5, 2024
  • Workshop Day: August 30, 2024
All Times are Anywhere on Earth (AoE).

Program

9:00 - 10:30 Session 1: Tabular Data Discovery
9:00 - 9:10 Opening Remarks
9:10 - 9:35 ALT-GEN: Benchmarking Table Union Search using Large Language Models [pdf]
Koyena Pal, Aamod Khatiwada, Roee Shraga, Renée J. Miller
9:35 - 10:00 Finding Support for Tabular LLM Outputs [pdf]
Grace Fan, Roee Shraga, Renée J. Miller
10:00 - 10:15 Humboldt: Metadata-Driven Extensible Data Discovery [pdf]
Alex Bäuerle, Çağatay Demiralp, Michael Stonebraker
10:15 - 10:30 Toward a Declarative Query Language for Machine Learning [pdf]
Hasan Jamil
10:30 - 11:00 Coffee Break
11:00 - 12:30 Session 2: Keynote Session
11:00 - 12:00 Keynote talk by Shi Han and Haoyu Dong, Microsoft Research
Spreadsheet Intelligence and Data Analytics
12:00 - 12:15 Data Quality Management for Responsible AI in Data Lakes [pdf]
Carolina Cortes, Camila Sanz, Lorena Etcheverry, Adriana Marotta
12:15 - 12:30 Large Language Models as Data Preprocessors [pdf]
Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
12:30 - 14:00 Lunch Break
14:00 - 15:20 Session 3: LLMs & Tabular Data
14:00 - 14:25 Schema Matching with Large Language Models: an Experimental Study [pdf]
Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M. Peeters, Stijn Vansummeren
14:25 - 14:40 LLMs for Data Engineering on Enterprise Data [pdf]
Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Matthias Urban, Anupam Sanghi, Carsten Binnig
14:40 - 14:55 Transform Table to Database Using Large Language Models [pdf]
Zezhou Huang, Jia Guo, Eugene Wu
14:55 - 15:20 DEMA: Enhancing Causal Analysis through Data Enrichment and Discovery in Data Lakes [pdf]
Kayvon Heravi, Saathvik Dirisala, Babak Salimi
15:20 - 16:00 Poster Session
16:00 - 17:10 Session 4: Machine Learning & Tabular Data
16:00 - 16:25 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on RDBs [pdf]
Minjie Wang, Quan Gan, David Wipf, ZhenkunCai, Ning Li, Jianheng Tang, Yanlin Zhang, ZizhaoZhang, ZunyaoMao, YakunSong, Yanbo Wang, Jiahang Li, HanZhang, Guang Yang, Xiao Qin, Chuan Lei, Muhan Zhang, Weinan Zhang, Christos Faloutsos, Zheng Zhang
16:25 - 16:40 Fast and Accurate Regional Effect Plots for Automated Tabular Data Analysis [pdf]
Vasilis Gkolemis, Theodore Dalamagas, Eirini Ntoutsi, Christos Diou
16:40 - 16:55 GFS: Graph-based Feature Synthesis for Prediction over Relational Databases [pdf]
Han Zhang, Quan Gan, David Wipf, Weinan Zhang
16:55 - 17:10 Closing Remarks & Awards

Keynote by Shi Han and Haoyu Dong, Microsoft Research
Title: Spreadsheet Intelligence and Data Analytics
Abstract: This keynote will unveil cutting-edge technologies designed to tackle the major challenges in spreadsheet intelligence, encompassing areas such as detecting table ranges, analyzing table structures and sheet layouts, understanding data semantics, and recommending data presentations. Based on spreadsheet intelligence, the presentation will also highlight our research and engineering efforts in boosting automation of data analytics to help Microsoft build technical leadership in the Business Intelligence market. In the trend of Large Language Models (LLMs), we will also present our latest explorations into integrating LLMs with spreadsheet intelligence and data analytics.

Accepted Papers

  • Hasan Jamil. Toward a Declarative Query Language for Machine Learning [pdf]
  • Koyena Pal, Aamod Khatiwada, Roee Shraga, Renée J. Miller. ALT-GEN: Benchmarking Table Union Search using Large Language Models [pdf]
  • Jan-Micha Bodensohn, Ulf Brackmann, Liane Vogel, Matthias Urban, Anupam Sanghi, Carsten Binnig. LLMs for Data Engineering on Enterprise Data [pdf]
  • Vasilis Gkolemis, Theodore Dalamagas, Eirini Ntoutsi, Christos Diou. Fast and accurate regional effect plots for automated tabular data analysis [pdf]
  • Zezhou Huang, Jia Guo, Eugene Wu. Transform Table to Database Using Large Language Models [pdf]
  • Han Zhang, Quan Gan, David Wipf, Weinan Zhang. GFS: Graph-based Feature Synthesis for Prediction over Relational Databases [pdf]
  • Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M. Peeters, Stijn Vansummeren. Schema Matching with Large Language Models: an Experimental Study [pdf]
  • Grace Fan, Roee Shraga, Renée J. Miller. Finding Support for Tabular LLM Outputs [pdf]
  • Minjie Wang, Quan Gan, David Wipf, ZhenkunCai, Ning Li, Jianheng Tang, Yanlin Zhang, ZizhaoZhang, ZunyaoMao, YakunSong, Yanbo Wang, Jiahang Li, HanZhang, Guang Yang, Xiao Qin, Chuan Lei, Muhan Zhang, Weinan Zhang, Christos Faloutsos, Zheng Zhang. 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs [pdf]
  • Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada. Large Language Models as Data Preprocessors [pdf]
  • Kayvon Heravi, Saathvik Dirisala, Babak Salimi. DEMA: Enhancing Causal Analysis through Data Enrichment and Discovery in Data Lakes [pdf]
  • Lorena Etcheverry, Adriana Marotta, Carolina Cortes, Camila Sanz. Data Quality Management for Responsible AI in Data Lakes [pdf]
  • Alex Bäuerle, Cagatay Demiralp, Michael Stonebraker. Humboldt: Metadata-Driven Extensible Data Discovery [pdf]

Organization

Organizing Committee: Steering Committee:
  • Madelon Hulsebos (UC Berkeley)
  • Ernesto Jiménez-Ruiz (City, University of London)
  • Fatemeh Nargesian (University of Rochester)
  • Natasha Noy (Google)
  • Horst Samulowitz (IBM Research)
Program Committee:
  • Nora Abdelmageed (University of Jena)
  • Omar Benjelloun (Google)
  • Rafael Berlanga Llavori (University Jaume I)
  • Carsten Binnig (TU Darmstadt)
  • Christian Bizer (University of Mannheim)
  • Anastasia Dimou (KU Leuven)
  • Christos Diou (Harokopio University of Athens)
  • Zezhou Huang (Columbia University)
  • Madelon Hulsebos (UC Berkeley)
  • Andra Ionescu (TU Delft)
  • Ernesto Jiménez-Ruiz (City, University of London)
  • Aamod Khatiwada (Northeastern University)
  • Udayan Khurana (IBM Research)
  • Haridimos Kondylakis (FORTH-ICS)
  • Marco Mesiti (University of Milan)
  • Renée Miller (Northeastern University)
  • George Papadakis (University of Athens)
  • Paolo Papotti (EURECOM)
  • Nhan Pham (IBM Research)
  • Horst Samulowitz (IBM Research)
  • Ismael Sanz (Universitat Jaume I)
  • Roee Shraga (WPI)
  • Kostas Stefanidis (Tampere University)
  • Gerhard Weikum (Max Planck Institute for Informatics)
  • You Wu (Google)