Objectives

Data Analysis is described as the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision- making. Performing such tasks over large and heterogeneous collections of tabular data, as found in enterprise data lakes and on the Web, is extremely challenging and an attractive research topic in data management, AI, and related communities. The goal of this workshop is to bring together researchers and practitioners in these diverse communities that work on addressing the fundamental research challenges of tabular data analysis and building automated solutions in this space

We aim to provide a forum for: a) exchange of ideas between two communities: 1) an active community of data management researchers working on data integration and schema and data matching problems over tabular data, and 2) a vibrant community of researchers in AI and Semantic Web communities working on the core challenge of matching tabular data to Knowledge Graphs as a part of the ISWC SemTab Challenges. b) presentation of late-breaking results related to several emerging research areas such as table representation learning and its applications, automation of data science pipelines, and data lake and data lakehouse solutions. c) discussion of real-world data management challenges related to implementing industrial scale tabular data anaylsis solutions.

Call For Papers

Audience: Our workshop encourages participation from researchers in data management, AI, and Semantic Web communities working on a wide range of problems relevant to tabular data analysis. We hope that this will constitute a single reference point for the researchers and practitioners working in that area and help form new collaborations. We also aim to provide a venue for researchers from industry and practitioners relying on various tabular data analysis tasks to present use cases and discuss their needs in addressing real-world problems and large-scale solutions.

Topics of Interest include but are not limited to:
  • Semantic Table Annotation
  • Automated Tabular Data Understanding
  • Exploratory Data Analysis over Tabular Data
  • Table Search in Data Lakes
  • Tabular Data Discovery
  • Tabular Data Discovery in Data Lakes
  • Tabular Data Discovery for Causal Inference
  • Metadata Management for Tabular Data Analysis
  • Data Augmentation with Tabular Data
  • Integration and Matching of Tabular Data
  • Knowledge Graph Construction and Completion with Tabular Data
  • Automated Discovery of ML Features from Tabular Data
  • ML Model Development with Tabular Data
  • Visualization and Interfaces for Tabular Data Analysis
  • Data Wrangling for Tabular Data Analysis
  • Deep Learning and Representation Learning for Tabular Data Analysis
  • Foundation Models for Tabular Data Analysis
  • Extraction and Analysis of Tabular Data from (HTML/PDF) Documents and Images
  • Analysis of Tabular Data on the Web (Web Tables)
  • Practical Applications of Tabular Data Analysis
  • Benchmarking and Evaluation Frameworks for Tabular Data Analysis
Call for late breaking results: We are now ready to receive your poster paper (2-page, double-column) submissions for late breaking results, until June 20, AoE. Those submissions will go through a light review phase and accepted submissions will have the opportunity to give a short presentation at a workshop session, for in-person attendees only, as well as a 5-minute recorded video presentation, shared on the workshop website.

Submissions

Contributions to the workshop can take the form of technical papers, posters, or statements of interest addressing various aspects of tabular data analysis, as well as reports on SemTab Challenge participation. Long technical papers should be 8-10 pages long. Short technical papers should be no more than 4 pages long. Posters should not exceed 2 pages. References do not count towards the page limits mentioned above.

Submission site: https://cmt3.research.microsoft.com/TaDA2023
Submissions should follow the double-column CEUR-ART template

Submissions will be single-blind. Authors of accepted papers will have the option to include their papers in the CEUR-ART proceedings of the workshop. At least one co-author is expected to register for the VLDB 2023 conference and present the paper in-person.

Important Dates

  • Abstract submission deadline: May 23, 2023
  • Submission deadline: May 15, 2023 May 30, 2023
  • Late breaking results (poster) submission deadline: June 20, 2023
  • Notification of acceptance: June 23, 2023
  • Camera-ready copy due: July 20, 2023
All Times are Anywhere on Earth (AoE).

Program

10:20-10:30 Opening
10:30-11:20 Keynote talk - Renée Miller: From Discovery to Integration of Data Lake Tables
11:20 - 12:00 Session 1
11:20 - 11:40 Keti Korini, Christian Bizer. Column Type Annotation using ChatGPT [pdf]
11:40 - 11:50 Viet-Phi Huynh, Yoan Chabot, Raphael Troncy. Towards Generative Semantic Table Interpretation [pdf]
11:50 - 12:00 Aneta Koleva, Martin Ringsquandl, Volker Tresp. Adversarial Attacks on Tables with Entity Swap [pdf]
12:00 - 13:30 Lunch
13:30 - 14:20 Keynote talk - Alon Halevy: Personal Digital Data: Where LLMs Meet Structured Data
14:20 - 15:00 Session 2
14:20 - 14:30 Hamed Mirzaei, Davood Rafiei. Table Union Search with Preferences [pdf]
14:30 - 14:40 Vijay S Kumar, Varish Mulwad, Jenny Williams, Tim Finin, Sharad Dixit, Anupam Joshi. Knowledge Graph-driven Tabular Data Discovery from Scientific Documents [pdf]
14:40 - 14:50 Arif Usta, Semih Salihoglu. To Join or Not to Join: An Analysis on the Usefulness of Joining Tables in Open Government Data Portals [pdf]
14:50 - 15:00 Liane Vogel, Carsten Binnig. WikiDBs: A Corpus Of Relational Databases From Wikidata [pdf]
15:00 - 15:50 Poster session
15:00 - 15:05 Davood Rafiei, Arash Dargahi Nobari, Soroush Omidvartehrani. Discovering and Integrating Tabular Data [pdf]
15:05 - 15:10 Eva Chrysostomaki, Maria Stratigi, Vasilis Efthymiou, Kostas Stefanidis, Dimitris Plexousakis. Fair Sequential Group Recommendations in SQUIRREL Movies [pdf]
15:50 - 16:00 Closing and Awards

Keynote by Renée Miller, Northeastern University
Title: From Discovery to Integration of Data Lake Tables
Abstract: We have made tremendous strides in providing tools for data scientists to discover new tables that are useful for their analyses. But despite these advances, the proper integration of discovered tables has been under-explored. An interesting semantics for integration, called full disjunction, was proposed in the 1990’s, but there has been little progress in using it for data science to integrate tables culled from data lakes. In this talk, I will overview both ALITE, a method to integrate (possibly incomplete) tables using a new scalable implementation of full disjunction, and DIALITE, an open discovery system that lets users discover, integrate, then analyze a set of tables using discovery methods such as Starmie, a new table union search method. To evaluate our systems, we developed and shared three new benchmarks for integration that use real data lake tables. I will present open problems and challenges in developing and evaluating scalable table search and integration methods on real data.

The ALITE [1] work was led by Aamod Khatiwada in collaboration with Professors Roee Shraga of the Worcester Polytechnic Institute, Renée Miller, and Wolfgang Gatterbauer of Northeastern University in Boston. DIALTE [2} was also led by Aamod Khatiwada in collaboration with Professors Roee Shraga and Renée Miller. Starmie [3] was led by Grace Fan in collaboration with Megagon Labs researchers Jin Wang, Yuliang Li and Dan Zhang.
[1] Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, Renée J. Miller: Integrating Data Lake Tables. PVLDB. 16(4): 932-945 (2022).
[2] Aamod Khatiwada, Roee Shraga, Renée J. Miller: DIALITE: Discover, Align and Integrate Open Data Tables. ACM SIGMOD, 187-190 (2023).
[3] Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, Renée J. Miller: Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representations. PVLDB. 16(8): 1726-1739 (2023)

Keynote by Alon Halevey, Meta AI
Title: Personal Digital Data: Where LLMs Meet Structured Data
Abstract: The important question of how companies and organizations use our data has received a lot of attention in the technology and policy communities. An equally important question that deserves more focus going forward is how we, as individuals, can take advantage of the data we generate to improve our health, vitality, and productivity and our overall well-being. We create a variety of data throughout our days, including our photos, workout stats, locations we’ve been to, the stuff we buy online and the content we consume. Fusing all this data together enables us to build a fascinating timeline of our lives. To leverage these timelines in order to help us produce new satisfying experiences we need to be able to query our timelines in natural language and to share short summaries of it with external services.

This talk will start by motivating the work on fusing personal digital data including its potential pitfalls. I will then discuss multiple approaches to the problem of querying timelines, which is an application area that forces us to consider deeply how language models can be used to query data that is partially structured and partially not.

Accepted Papers

The joint proceedings of VLDB 2023 workshops are available here: https://ceur-ws.org/Vol-3462/.

  • Keti Korini, Christian Bizer. Column Type Annotation using ChatGPT [pdf]
  • Hamed Mirzaei, Davood Rafiei. Table Union Search with Preferences [pdf]
  • Liane Vogel, Carsten Binnig. WikiDBs: A Corpus Of Relational Databases From Wikidata [pdf]
  • Aneta Koleva, Martin Ringsquandl, Volker Tresp. Adversarial Attacks on Tables with Entity Swap [pdf]
  • Arif Usta, Semih Salihoglu. To Join or Not to Join: An Analysis on the Usefulness of Joining Tables in Open Government Data Portals [pdf]
  • Vijay S Kumar, Varish Mulwad, Jenny Williams, Tim Finin, Sharad Dixit, Anupam Joshi. Knowledge Graph-driven Tabular Data Discovery from Scientific Documents [pdf]
  • Viet-Phi Huynh, Yoan Chabot, Raphael Troncy. Towards Generative Semantic Table Interpretation [pdf]
  • Eva Chrysostomaki, Maria Stratigi, Vasilis Efthymiou, Kostas Stefanidis, Dimitris Plexousakis. Fair Sequential Group Recommendations in SQUIRREL Movies [pdf]
  • Davood Rafiei, Arash Dargahi Nobari, Soroush Omidvartehrani. Discovering and Integrating Tabular Data [pdf]

Organization

Organizing Committee: Steering Committee:
  • Haoyu Dong (Microsoft)
  • Shi Han (Microsoft)
  • Madelon Hulsebos (University of Amsterdam)
  • Chuan Lei (AWS)
  • Fatemeh Nargesian (University of Rochester)
  • Natasha Noy (Google)
  • Horst Samulowitz (IBM Research)
Program Committee:
  • Omar Benjelloun (Google)
  • Rafael Berlanga Llavori (University Jaume I)
  • Jiaoyan Chen (The University of Manchester)
  • Peter Christen (The Australian National University)
  • Vassilis Christophides (ENSEA)
  • Vincenzo Cutrona (SUPSI)
  • Anastasia Dimou (KU Leuven)
  • Michael R. Glass (IBM Research AI)
  • Ekaterini Ioanou (Tilburg University)
  • Asterios Katsifodimos (TU Delft)
  • Udayan Khurana (IBM Research)
  • Hongrae Lee (Google)
  • Venkata Vamsikrishna Meduri (IBM Research, Almaden)
  • Marco Mesiti (University of Milan)
  • Renée Miller (Northeastern University)
  • Carina Negreanu (Microsoft Research)
  • George Papadakis (University of Athens)
  • Paolo Papotti (EURECOM)
  • Lucian Popa (IBM Almaden Research Center)
  • Ismael Sanz (Universitat Jaume I)
  • Roee Shraga (Northeastern University)
  • Kostas Stefanidis (Tampere University)
  • Raphael Troncy (EURECOM)
  • You Wu (Google)