Quick links: Schedule | Keynote Speakers | Accepted Papers


Time Monday (Sep 30) Tuesday (Oct 1) Wednesday (Oct 2)
Place Forum Digitale Technologien (noch SDF) IBI IBI
Address Salzufer 6, 10587 Berlin
Eingang über Otto-Dibelius-Straße
Dorotheenstraße 26, 10117 Berlin Dorotheenstraße 26, 10117 Berlin
Registration Registration
09:00 LWDA Keynote
Fairness in Machine learning: From definitions to mechanisms
Dr. Isabel Valera (MPI-IS)
Parallel Track Sessions 4
Coffee Break
Coffee Break
Parallel Track Sessions 1
11:00 LWDA Keynote
Beyond research data infrastructures: exploiting artificial & crowd intelligence towards building research knowledge graphs
Prof. Dr. Stefan Dietze (GESIS & HHU)
LWDA Keynote
Extending the data warehouse beyond analytics
Martin Grund (Amazon)
Lunch Break
LWDA Closing
Parallel Track Sessions 2
14:00 LWDA Opening (+SDF Info)
LWDA Keynote
SystemDS: An ML System for the End-to-End Data Science Lifecycle
Prof. Dr. Matthias Böhm (Graz University of Technology)
Coffee Break
Coffee Break Parallel Track Sessions 3
16:00 Joint LWDA Research Session
Community Meetings
17:00 Reception and Poster Session
18:00 Social Event

Workshop Schedules

LWDA Keynote Speakers

Matthias Böhm

SystemDS: An ML System for the End-to-End Data Science Lifecycle


Abstract: Machine learning (ML) applications profoundly transform our private lives and many domains such as health care, finance, transportation, media, logistics, production, and information technology itself. As motivation and background, we will first share lessons learned from building Apache SystemML for declarative, large-scale ML. SystemML compiles R-like scripts into hybrid runtime plans of local, in-memory operations on CPUs and GPUs, as well as distributed operations on data-parallel frameworks like Spark. This high-level specification simplifies the development of ML algorithms, but lacks support for important tasks of the end-to-end data science liefcycle and users with different expertise. Set out to overcome these limitations, we introduce SystemDS, a new open-source ML system that aims to support the end-to-end data science lifecycle from data integration, cleaning, and feature engineering, over efficient local, distributed, and federated ML model training, to deployment and serving. In this talk, we will present the preliminary system architecture including the language abstractions and underlying data model, as well as selected features such as fine-grained lineage tracing and its exploitation for model versioning, reusing intermediates, and debugging model training runs.

Bio: Matthias Boehm is a BMVIT-endowed professor for data management at Graz University of Technology, Austria, and a research area manager for data management at the colocated Know-Center GmbH, Austria. Prior to joining TU Graz in 2018, he was a research staff member at IBM Research - Almaden, CA, USA, with a major focus on compilation and runtime techniques for declarative, large-scale machine learning in Apache SystemML. Matthias received his Ph.D. from Dresden University of Technology, Germany in 2011 with a dissertation on cost-based optimization of integration flows. His previous research also includes systems support for time series forecasting as well as in-memory indexing and query processing. Matthias is a recipient of the 2016 VLDB Best Paper Award, a 2016 SIGMOD Research Highlight Award, and a 2016 IBM Pat Goldberg Memorial Best Paper Award.

Isabel Valera (plenary and FG-KDML)

Fairness in Machine learning: From definitions to mechanisms (slides)

Abstract: The use of machine learning models to assist decision making in both online (e.g., spam filtering, product personalization), as well as offline (e.g., pretrial risk assessment, mortgage approvals) settings. However, as automated data analysis supplements and even replaces human supervision in decision making, there are growing concerns from civil organizations, governments, and researchers about potential unfairness of these algorithmic systems towards people from certain demo-graphic groups (e.g., gender or ethnic groups). To address these concerns, the emerging field of ethical machine learning has proposed quantifiable notions of fairness as well as mechanisms for ensuring fair and unbiased algorithmic decision making. This talk summarizes the recent advances on how to ensure that the outcomes of such algorithmic decision making systems are fair, as well as the open challenges still to be addressed in this context.

Bio: Isabel Valera is a research group leader at the Max Planck for Intelligent Systems (MPI-IS). Isabel obtained her PhD in 2014 and her MSc degree in 2012, both degrees in Multimedia and Communications from the University Carlos III in Madrid, Spain. After her PhD, she worked at the MPI for Software Systems as a postdoctoral fellow, under the supervision of Dr. Manuel Gomez Rodriguez; and at the University of Cambridge as an associated researcher, under the supervision of Prof. Ghahramani. Her research turns around the development of machine learning methods that are expressive to capture the complex statistical properties of real-world data; robust to provide accurate uncertainty estimates on these properties; and accountable to ensure fairness and interpretability.

Stefan Dietze

Beyond research data infrastructures: exploiting artificial & crowd intelligence towards building research knowledge graphs


Abstract: The need for sharing and reuse of research data has been widely acknowledged and led to a multitude of data search platforms, registries as well as national and international initiatives. Whereas efforts often focus on improving discovery, retrieval and recommendation of popular research datasets, there is vast untapped potential in the form of dataset (references) hidden in unstructured Web sites or resources as well research data which can be mined from the (social) Web. In this context, crowd and artificial intelligence converge on the task of extracting machine-readable structured knowledge graphs about research data and resources, facilitated by techniques from information retrieval, NLP, (distributional) semantics, and machine learning. This talk will provide an overview of recent works in the aforementioned areas with applications in (but not limited to) the social sciences.

Bio: Stefan Dietze is full professor for Data & Knowledge Engineering at the Institute for Computer Science at Heinrich-Heine-University Düsseldorf (UDUS), Scientific Director of the department Knowledge Technologies for the Social Sciences at GESIS – Leibniz Institute for the Social Sciences and affiliated member at the L3S Research Center of the Leibniz University Hanover, Germany. His research interests are at the intersection of information retrieval, semantic technologies and artificial intelligence, and in particular, the extraction, fusion and search of knowledge and data on the Web in various application domains. Stefans work has been published at major scientific venues, such as WWW/The Web Conf, SIGIR, or ISWC, where he also frequently serves as programme and organisation committee member.

Martin Grund

Extending the data warehouse beyond analytics

Martin Grund

Abstract: Traditional data warehousing systems are seen as silos at the end of the data processing pipeline. Over the last years Amazon Redshift has become the largest data warehousing system in the cloud with our customers storing tremendous amounts of data in it. In the recent time, data processing for analytics started to evolve again, showing new software engineering challenges. In this talk, we will share our perspective on the evolution of the cloud data warehousing system from tightly integrated storage and compute to being able to process exabytes of data every month. In addition, we will highlight certain new developments in the area of data management in the cloud.

Bio: Martin Grund is a principal engineer working for Amazon Web Services. Since almost four years he has been working on Amazon Redshift, primarily on Amazon Redshift Spectrum, building systems to process Exabytes of data from the data lake. Just returned from California, he’s now leading the Redshift development team in Berlin. Before joining Amazon in Palo Alto, he worked for Cloudera on Apache Impala in San Francisco. Martin spent two years at the University of Fribourg in Switzerland funded by a research grant of SAP and finished his PhD thesis on In-Memory Data Management in 2012 at the Hasso Plattner Institute in Potsdam.

Invited Speakers

Ziawasch Abedjan

A Holistic Approach for Effective Error Detection


Abstract: Data cleaning is one of the most time-consuming and tedious tasks in data-driven tasks. Typically, it entails the identification of erroneous values and their correction. Effective error detection can significantly improve the subsequent correction step. Research in error detection has provided a variety of approaches, most of which require some prior knowledge about the dataset in order to set up and configure the approach with rules, sensitivity thresholds, or other parameters. Often these approaches only cover a certain type of errors. Recently, novel machine learning techniques have been proposed to treat error detection as a classification task. These approaches still require large amounts of training data scaling with the size of the dataset to cover the variety of residing error types inside a dataset. In this talk, I will present our work in progress towards a holistic error detection system that significantly reduces the amount of required labels by leveraging label propagation techniques and meta-learning. In a nutshell, we leverage existing error detection techniques as feature generators. First I discuss how manually configured off-the-shelf error detection techniques can be aggregated and automatically selected. Then I show, how both approaches can be combined and refined for a configuration-free error detection system that only requires about 20 labeled tuples to outperform state-of-the-art techniques.

Bio: Ziawasch Abedjan is Juniorprofessor and head of the “Big Data Management” (BigDaMa) Group at TU Berlin. Prior to that, Ziawasch was a postdoc at the “Computer Science and Artificial Intelligence Laboratory” at MIT working on various data integration problems. Ziawasch received his PhD on from the Hasso Plattner Institute in Potsdam, Germany. He is recipient of the Best Dissertation Award of the University of Potsdam, the 2014 CIKM Best Student Paper Award, and the 2015 SIGMOD Best Demo Award. His research is funded by the DFG, the Federal Ministry for Research and Education, and the Federal Ministry of Transport, Building and Urban Development.

Felix Biessmann (Track: FG-DB)

Data Quality in Machine Learning Production Systems


Abstract: Machine learning (ML) algorithms have become a standard technology in production software systems. This creates new challenges for the maintainers of software systems featuring ML components. While classical software systems can be tested before being put into production, such testing is difficult for machine learning systems: depending on the data ingested during training or prediction phase the behaviour of a ML system can be different. Thus ensuring robust and reliable functioning of ML systems requires careful monitoring and improvements of various data quality aspects, which can be difficult to automate. This talk summarizes some recent work on leveraging ML technology for automating the measurement and improvement of data quality problems in the context of ML production systems and beyond.

Bio: Felix Biessmann obtained a BSc in Cognitive Science from University of Osnabrück, a MSc in Neuroscience from the International Max Planck Research School, Tübingen and a PhD in Machine Learning from TU Berlin. In 2013 he was appointed an assistant professorship for Machine Learning at Korea University, Seoul, before joining Amazon Research, Berlin in 2014. Since October 2018 he is Professor for Machine Learning at Beuth University and the Einstein Center for Digital Future, Berlin. His research focusses on machine learning for biomedical applications, multimodal data integration and data quality improvement.

Anika Groß (Track: FG-DB)

Yet Another Matching Task: Link Reuse and Evolution in Data Integration Workflows

Abstract: Today, many new findings and decisions are based on the analysis and interpretation of very large data sets, and this holds for a multitude of scientific and industrial domains. To increase the benefit of data analytics, it is useful to combine data from a variety of different sources. This requires high-quality data integration, including the expensive and tedious matching of data and metadata objects from two or more sources. The generated links may further be used to integrate and merge knowledge from many different sources in complex knowledge graphs. Due to a rapid development in most domains, many (already linked) data sources are continuously updated, i.e., they undergo a steady evolution. To improve data quality and avoid redundant effort, data integration workflows should make use of already existing and especially validated links between two or more sources, and prefer link reuse over redetermination. In this talk, I will present different approaches based on link reuse for data integration. I will further discuss the impact of evolution on existing link collections and integrated data sources, and discuss open challenges for future investigation.


Bio: Since 2019, Anika Groß is a Professor for Database Systems at Anhalt University of Applied Sciences in Saxony-Anhalt. She was a Postdoc in the Database Group at Leipzig University until 2017, and joined a data science team for strategic research and development of electric vehicles at Daimler AG in 2018. Anika studied bioinformatics at Martin-Luther-Universität Halle-Wittenberg and received her PhD in Computer Science at Leipzig University in 2014. Her research focusses on data integration for data science and analytics in various application domains such as medicine, environmental research and social sciences.

Joint LWDA Research Session

Monday, 16:00 - 17:00

Accepted Papers

DB Papers (full and short)

IR Papers

IR Talks (w/o papers)

KDML Papers (full and short)

KDML Talks (w/o papers)

WM Papers

WM Talks

BI Papers

BI Talks (w/o papers)