Program

Quick links: Schedule | Keynote Speakers | Accepted Papers

Schedule



Time	Monday (Sep 30)	Tuesday (Oct 1)	Wednesday (Oct 2)
Place	Forum Digitale Technologien (noch SDF)	IBI	IBI
Address	Salzufer 6, 10587 Berlin Eingang über Otto-Dibelius-Straße	Dorotheenstraße 26, 10117 Berlin	Dorotheenstraße 26, 10117 Berlin
08:00

		Registration	Registration
		Registration	Registration
09:00		LWDA Keynote Fairness in Machine learning: From definitions to mechanisms Dr. Isabel Valera (MPI-IS)	Parallel Track Sessions 4 DB, KDML, WM



10:00
		Coffee Break
		Coffee Break	Coffee Break
		Parallel Track Sessions 1 DB, KDML, BI, WM, IR	Coffee Break
11:00			LWDA Keynote Beyond research data infrastructures: exploiting artificial & crowd intelligence towards building research knowledge graphs Prof. Dr. Stefan Dietze (GESIS & HHU)


			LWDA Keynote Extending the data warehouse beyond analytics Martin Grund (Amazon)
12:00
		Lunch Break
			LWDA Closing
	Registration
13:00

		Parallel Track Sessions 2
		DB, KDML, BI, WM, IR
14:00	LWDA Opening (+SDF Info)
	LWDA Keynote SystemDS: An ML System for the End-to-End Data Science Lifecycle Prof. Dr. Matthias Böhm (Graz University of Technology)

		Coffee Break
15:00		Coffee Break

	Coffee Break	Parallel Track Sessions 3
	Coffee Break	DB, KDML, BI, WM, IR
16:00	Joint LWDA Research Session


		Community Meetings
17:00	Reception and Poster Session DB, KDML, BI, WM, IR



18:00		Social Event



19:00

Workshop Schedules

LWDA Keynote Speakers

Matthias Böhm

SystemDS: An ML System for the End-to-End Data Science Lifecycle

boehm_image

Abstract: Machine learning (ML) applications profoundly transform our private lives and many domains such as health care, finance, transportation, media, logistics, production, and information technology itself. As motivation and background, we will first share lessons learned from building Apache SystemML for declarative, large-scale ML. SystemML compiles R-like scripts into hybrid runtime plans of local, in-memory operations on CPUs and GPUs, as well as distributed operations on data-parallel frameworks like Spark. This high-level specification simplifies the development of ML algorithms, but lacks support for important tasks of the end-to-end data science liefcycle and users with different expertise. Set out to overcome these limitations, we introduce SystemDS, a new open-source ML system that aims to support the end-to-end data science lifecycle from data integration, cleaning, and feature engineering, over efficient local, distributed, and federated ML model training, to deployment and serving. In this talk, we will present the preliminary system architecture including the language abstractions and underlying data model, as well as selected features such as fine-grained lineage tracing and its exploitation for model versioning, reusing intermediates, and debugging model training runs.

Bio: Matthias Boehm is a BMVIT-endowed professor for data management at Graz University of Technology, Austria, and a research area manager for data management at the colocated Know-Center GmbH, Austria. Prior to joining TU Graz in 2018, he was a research staff member at IBM Research - Almaden, CA, USA, with a major focus on compilation and runtime techniques for declarative, large-scale machine learning in Apache SystemML. Matthias received his Ph.D. from Dresden University of Technology, Germany in 2011 with a dissertation on cost-based optimization of integration flows. His previous research also includes systems support for time series forecasting as well as in-memory indexing and query processing. Matthias is a recipient of the 2016 VLDB Best Paper Award, a 2016 SIGMOD Research Highlight Award, and a 2016 IBM Pat Goldberg Memorial Best Paper Award.

Isabel Valera (plenary and FG-KDML)

Fairness in Machine learning: From definitions to mechanisms (slides)

Abstract: The use of machine learning models to assist decision making in both online (e.g., spam filtering, product personalization), as well as offline (e.g., pretrial risk assessment, mortgage approvals) settings. However, as automated data analysis supplements and even replaces human supervision in decision making, there are growing concerns from civil organizations, governments, and researchers about potential unfairness of these algorithmic systems towards people from certain demo-graphic groups (e.g., gender or ethnic groups). To address these concerns, the emerging field of ethical machine learning has proposed quantifiable notions of fairness as well as mechanisms for ensuring fair and unbiased algorithmic decision making. This talk summarizes the recent advances on how to ensure that the outcomes of such algorithmic decision making systems are fair, as well as the open challenges still to be addressed in this context.

Bio: Isabel Valera is a research group leader at the Max Planck for Intelligent Systems (MPI-IS). Isabel obtained her PhD in 2014 and her MSc degree in 2012, both degrees in Multimedia and Communications from the University Carlos III in Madrid, Spain. After her PhD, she worked at the MPI for Software Systems as a postdoctoral fellow, under the supervision of Dr. Manuel Gomez Rodriguez; and at the University of Cambridge as an associated researcher, under the supervision of Prof. Ghahramani. Her research turns around the development of machine learning methods that are expressive to capture the complex statistical properties of real-world data; robust to provide accurate uncertainty estimates on these properties; and accountable to ensure fairness and interpretability.

Stefan Dietze

Beyond research data infrastructures: exploiting artificial & crowd intelligence towards building research knowledge graphs

dietze_image

Abstract: The need for sharing and reuse of research data has been widely acknowledged and led to a multitude of data search platforms, registries as well as national and international initiatives. Whereas efforts often focus on improving discovery, retrieval and recommendation of popular research datasets, there is vast untapped potential in the form of dataset (references) hidden in unstructured Web sites or resources as well research data which can be mined from the (social) Web. In this context, crowd and artificial intelligence converge on the task of extracting machine-readable structured knowledge graphs about research data and resources, facilitated by techniques from information retrieval, NLP, (distributional) semantics, and machine learning. This talk will provide an overview of recent works in the aforementioned areas with applications in (but not limited to) the social sciences.

Bio: Stefan Dietze is full professor for Data & Knowledge Engineering at the Institute for Computer Science at Heinrich-Heine-University Düsseldorf (UDUS), Scientific Director of the department Knowledge Technologies for the Social Sciences at GESIS – Leibniz Institute for the Social Sciences and affiliated member at the L3S Research Center of the Leibniz University Hanover, Germany. His research interests are at the intersection of information retrieval, semantic technologies and artificial intelligence, and in particular, the extraction, fusion and search of knowledge and data on the Web in various application domains. Stefans work has been published at major scientific venues, such as WWW/The Web Conf, SIGIR, or ISWC, where he also frequently serves as programme and organisation committee member.

Martin Grund

Extending the data warehouse beyond analytics

Martin Grund

Abstract: Traditional data warehousing systems are seen as silos at the end of the data processing pipeline. Over the last years Amazon Redshift has become the largest data warehousing system in the cloud with our customers storing tremendous amounts of data in it. In the recent time, data processing for analytics started to evolve again, showing new software engineering challenges. In this talk, we will share our perspective on the evolution of the cloud data warehousing system from tightly integrated storage and compute to being able to process exabytes of data every month. In addition, we will highlight certain new developments in the area of data management in the cloud.

Bio: Martin Grund is a principal engineer working for Amazon Web Services. Since almost four years he has been working on Amazon Redshift, primarily on Amazon Redshift Spectrum, building systems to process Exabytes of data from the data lake. Just returned from California, he’s now leading the Redshift development team in Berlin. Before joining Amazon in Palo Alto, he worked for Cloudera on Apache Impala in San Francisco. Martin spent two years at the University of Fribourg in Switzerland funded by a research grant of SAP and finished his PhD thesis on In-Memory Data Management in 2012 at the Hasso Plattner Institute in Potsdam.

Invited Speakers

Ziawasch Abedjan

A Holistic Approach for Effective Error Detection

abedjan_image

Abstract: Data cleaning is one of the most time-consuming and tedious tasks in data-driven tasks. Typically, it entails the identification of erroneous values and their correction. Effective error detection can significantly improve the subsequent correction step. Research in error detection has provided a variety of approaches, most of which require some prior knowledge about the dataset in order to set up and configure the approach with rules, sensitivity thresholds, or other parameters. Often these approaches only cover a certain type of errors. Recently, novel machine learning techniques have been proposed to treat error detection as a classification task. These approaches still require large amounts of training data scaling with the size of the dataset to cover the variety of residing error types inside a dataset. In this talk, I will present our work in progress towards a holistic error detection system that significantly reduces the amount of required labels by leveraging label propagation techniques and meta-learning. In a nutshell, we leverage existing error detection techniques as feature generators. First I discuss how manually configured off-the-shelf error detection techniques can be aggregated and automatically selected. Then I show, how both approaches can be combined and refined for a configuration-free error detection system that only requires about 20 labeled tuples to outperform state-of-the-art techniques.

Bio: Ziawasch Abedjan is Juniorprofessor and head of the “Big Data Management” (BigDaMa) Group at TU Berlin. Prior to that, Ziawasch was a postdoc at the “Computer Science and Artificial Intelligence Laboratory” at MIT working on various data integration problems. Ziawasch received his PhD on from the Hasso Plattner Institute in Potsdam, Germany. He is recipient of the Best Dissertation Award of the University of Potsdam, the 2014 CIKM Best Student Paper Award, and the 2015 SIGMOD Best Demo Award. His research is funded by the DFG, the Federal Ministry for Research and Education, and the Federal Ministry of Transport, Building and Urban Development.

Felix Biessmann (Track: FG-DB)

Data Quality in Machine Learning Production Systems

biessmann_image

Abstract: Machine learning (ML) algorithms have become a standard technology in production software systems. This creates new challenges for the maintainers of software systems featuring ML components. While classical software systems can be tested before being put into production, such testing is difficult for machine learning systems: depending on the data ingested during training or prediction phase the behaviour of a ML system can be different. Thus ensuring robust and reliable functioning of ML systems requires careful monitoring and improvements of various data quality aspects, which can be difficult to automate. This talk summarizes some recent work on leveraging ML technology for automating the measurement and improvement of data quality problems in the context of ML production systems and beyond.

Bio: Felix Biessmann obtained a BSc in Cognitive Science from University of Osnabrück, a MSc in Neuroscience from the International Max Planck Research School, Tübingen and a PhD in Machine Learning from TU Berlin. In 2013 he was appointed an assistant professorship for Machine Learning at Korea University, Seoul, before joining Amazon Research, Berlin in 2014. Since October 2018 he is Professor for Machine Learning at Beuth University and the Einstein Center for Digital Future, Berlin. His research focusses on machine learning for biomedical applications, multimodal data integration and data quality improvement.

Anika Groß (Track: FG-DB)

Yet Another Matching Task: Link Reuse and Evolution in Data Integration Workflows

Abstract: Today, many new findings and decisions are based on the analysis and interpretation of very large data sets, and this holds for a multitude of scientific and industrial domains. To increase the benefit of data analytics, it is useful to combine data from a variety of different sources. This requires high-quality data integration, including the expensive and tedious matching of data and metadata objects from two or more sources. The generated links may further be used to integrate and merge knowledge from many different sources in complex knowledge graphs. Due to a rapid development in most domains, many (already linked) data sources are continuously updated, i.e., they undergo a steady evolution. To improve data quality and avoid redundant effort, data integration workflows should make use of already existing and especially validated links between two or more sources, and prefer link reuse over redetermination. In this talk, I will present different approaches based on link reuse for data integration. I will further discuss the impact of evolution on existing link collections and integrated data sources, and discuss open challenges for future investigation.

gross_image

Bio: Since 2019, Anika Groß is a Professor for Database Systems at Anhalt University of Applied Sciences in Saxony-Anhalt. She was a Postdoc in the Database Group at Leipzig University until 2017, and joined a data science team for strategic research and development of electric vehicles at Daimler AG in 2018. Anika studied bioinformatics at Martin-Luther-Universität Halle-Wittenberg and received her PhD in Computer Science at Leipzig University in 2014. Her research focusses on data integration for data science and analytics in various application domains such as medicine, environmental research and social sciences.

Joint LWDA Research Session

Monday, 16:00 - 17:00

Aleksandar Bojchevski and Stephan Günnemann: Adversarial Attacks on Node Embeddings via Graph Poisoning (long paper, 30 minutes)
Christian Zeyen, Lukas Malburg and Ralph Bergmann: Adaptation of Scientific Workflows by Means of Process-Oriented Case-Based Reasoning (30 minutes)

Accepted Papers

DB Papers (full and short)

Mohammad Mahdavi, Felix Neutatz, Larysa Visengeriyeva and Ziawasch Abedjan: Towards Automated Data Cleaning Workflows
Mark Lukas Möller, Meike Klettke and Uta Störl: Keeping NoSQL Databases up to date – Semantics of Evolution Operations and their Impact on Data Quality
Lan Jiang, Gerardo Vitagliano and Felix Naumann: A Scoring-based Approach for Data Preparator Suggestion
Steffi Scherzinger: Have your Students Build their own mini Hive in just eight Weeks
Peter K. Schwab, Maximilian Langohr, Jonas Röckl, Demian Vöhringer, Andreas M. Wahl and Klaus Meyer-Wegener: Query-Driven Enforcement of Rule-Based Policies for Data-Privacy Compliance
Triet Doan, Lena Wiese, Sven Bingert and Ramin Yahyapour: A Graph Database for Persistent Identifiers
Dennis Marten, Holger Meyer and Andreas Heuer: Database support for automotive analysis

IR Papers

Max Luebbering, Julian Kunkel and Patricio Farrell: What Company Does My News Article Refer to?Tackling Multi Class Problems With Topic Modeling
Hendrik Adam and Philipp Schaer: Information Extraction for Semi-structured Email Corpora
Ayan Bandyopadhyay, Linda Achilles, Thomas Mandl, Mandar Mitra and Sanjoy Kumar Saha: Identification of Depression Severity for Users of Online Platforms
Narges Tavakolpoursaleh, Johann Schaible and Stefan Dietze: Using Word Embeddings for Recommending Datasets based on Scientiﬁc Publications
Andreas Lommatzsch and Jonas Katins: An Information Retrieval-based Approach for Building Intuitive Chatbots for Large Knowledge Bases

IR Talks (w/o papers)

Satya Almaisan, Andreas Spitz and Michael Gertz: Word Embeddings for Entity-annotated Texts (Extended Abstract)
Dhruv Gupta and Klaus Berberich: GYANI: Structured Search and Analytics in Annotated Document Collections (Extended Abstract)
Mandy Neumann, Christopher Michels, Philipp Schaer and Ralf Schenkel: Prioritizing and Scheduling Conferences for Metadata Harvesting in dblp (Extended Abstract)

KDML Papers (full and short)

Daniyal Kazempour, Anna Beer, Oliver Schrüfer and Thomas Seidl: Clustering Trend Data Time-Series through Segmentation of FFT-decomposed Signal Constituents
Anna Beer, Nadine Sarah Schüler and Thomas Seidl: A Generator for Subspace Clusters
Daniyal Kazempour, Long Mathias Yan and Thomas Seidl: From Covariance to Comode in context of Principal Component Analysis
Maximilian Archimedes Xaver Hünemörder, Anna Beer, Daniyal Kazempour and Thomas Seidl: CODEC - Detecting Linear Correlations in Dense Clusters with Comedian-based PCA
Raphael Fischer, Nico Piatkowski and Katharina Morik: Parameter Sharing for Spatio-Temporal Process Models
Eduardo Brito, Bogdan Georgiev, Daniel Domingo-Fernández, Charles Hoyt and Christian Bauckhage: RatVec: A General Approach for Low-dimensional Distributed Vector Representations via Domain-specific Rational Kernels
Thomas Goerttler and Marius Kloft: Learning a Multimodal Prior Distribution for Generative Adverserial Nets
Florian Richter, Florian Wahl, Alona Sydorova and Thomas Seidl: k-process: Model-Conformance-based Clustering of Process Instances
Annika Pick, Tamas Horvath and Stefan Wrobel: Support Estimation in Frequent Itemset Mining by Locality Sensitive Hashing
Christian Bauckhage, Nico Piatkowski, Rafet Sifa, Dirk Hecker and Stefan Wrobel: A QUBO Formulation of the k-Medoids Problem
Noor Jamaludeen, Vishnu Unnikrishnan, Maya Sekeran, Majed Ali, Le Anh Trang and Myra Spiliopoulou: Assessing the reliability of crowdsourced labels via Twitter
Mirko Bunse and Katharina Morik: What Can We Expect from Active Class Selection?
Aissatou Diallo, Markus Zopf and Johannes Fürnkranz: Learning Analogy-Preserving Sentence Embeddings for Answer Selection
Karsten Tymann, Matthias Lutz, Patrick Palsbröker and Carsten Gips: GerVADER - A German adaptation of the VADER sentiment analysis tool for social media texts
Sebastian Wankerl, Gerhard Götz and Andreas Hotho: Solving Mathematical Exercises: Prediction of Student’s Success
Janina Sontheim, Florian Richter and Thomas Seidl: Temporal Deviations on Event Sequences
Christian Bauckhage, Rafet Sifa, Dirk Hecker and Stefan Wrobel: Max-Sum Dispersion via Quantum Annealing
Felix Gonsior, Nico Piatkowski and Katharina Morik: Another view on optimization as probabilistic inference
Sascha Mücke, Nico Piatkowski and Katharina Morik: Learning Bit by Bit: Extracting the Essence of Machine Learning

KDML Talks (w/o papers)

Christian Beyer, Vishnu Unnikrishnan, Eirini Ntoutsi and Myra Spiliopoulou: Extended Summary: Entity-Centric Stream Mining
Tobias Koopmann, Alexander Dallmann, Lena Hettinger, Thomas Niebler and Andreas Hotho: Extended Summary: On the right track! Analysing and predicting navigation success in Wikipedia
Aleksandar Bojchevski and Stephan Günnemann: Adversarial Attacks on Node Embeddings via Graph Poisoning
Nico Piatkowski: Hyper-Parameter-Free Generative Modelling with Deep Boltzmann Trees
Sibylle Hess, Wouter Duivesteijn, Philipp Honysz and Katharina Morik: The SpectACl of Nonconvex Clustering: A Spectral Approach to Density-Based Clustering
Pascal Welke, Tamas Horvath and Stefan Wrobel: Probabilistic and Exact Frequent Subtree Mining in Graphs Beyond Forests
Amal Saadallah, Florian Priebe and Katharina Morik: Drift-based Dynamic Ensemble Members Selection using Clustering for time series forecasting
Markus Ring, Daniel Schlör, Dieter Landes and Andreas Hotho: Extended Summary: Flow-based network traffic generation using Generative Adversarial Networks
Florian Seiffarth, Tamás Horváth and Stefan Wrobel: Maximal Closed Set and Half-Space Separations in Finite Closure Systems
Allan Sales, Leandro Balby Marinho and Adriano Veloso: Extended Summary: Media Bias Characterization in Brazilian Presidential Elections
Stefan Bloemheuvel, Benjamin Kloepper and Martin Atzmueller: Graph Summarization for Computational Sensemaking on Complex Industrial Event Logs
Parisa Shayan, Roberto Rondinelli, Menno van Zaanen and Martin Atzmueller: Descriptive Network Modeling and Analysis for Investigating User Acceptance in a Learning Management System Context

WM Papers

Lisa Grumbach and Ralph Bergmann: Towards Case-Based Deviation Management for Flexible Workflows
Patrick Klein, Lukas Malburg and Ralph Bergmann: FTOnto: A Domain Ontology for a Fischertechnik Simulation Production Factory by Reusing Existing Ontologies
Joachim Baumeister, Veronika Sehne and Carolin Wienrich: A Systematic View on Speech Assistants for Service Technicians
Hannes Reil and Michael Leyer: Auswirkung des Internet der Dinge auf das Wissen über Arbeitsprozesse von Mitarbeitern in KMUs
Andreas Korger and Joachim Baumeister: Case-Based Retrieval and Adaptation of Regulatory Documents and their Context
Katja Berčič: Towards a Census of Relational Data in Mathematics
Eric Rietzke, Ralph Bergmann and Norbert Kuhn: ODD-BP - an Ontology- and Data-Driven Business Process Model
Marcel Kolbe, Pascal Reuss, Jakob Michael Schoenborn and Klaus-Dieter Althoff: Conceptualization and Implementation of a Reinforcement Learning Approach Using a Case-Based Reasoning Agent in a FPS Scenario
Saliha Irem Besik and Johann-Christoph Freytag: Ontology-Based Privacy Compliance Checking for Clinical Workflows
Michael Kohlhase and Max Rapp: Context Graphs for Argumentation Logics
Viktor Eisenstadt and Klaus-Dieter Althoff: Overview of 4R CBR Cycle Modifications (Extended Version)

WM Talks

Patrick Klein, Lukas Malburg and Ralph Bergmann: Learning Workflow Embeddings to Improve the Performance of Similarity-Based Retrieval for Process-Oriented Case-Based Reasoning
Christian Zeyen, Lukas Malburg and Ralph Bergmann: Adaptation of Scientific Workflows by Means of Process-Oriented Case-Based Reasoning
Mirjam Minor, Alexander Herborn and Dierk Jordan: Case-based Data Masking for Software Test Management
Carsten Maletzki, Eric Rietzke, Lisa Grumbach, Ralph Bergmann and Norbert Kuhn: Utilizing Ontology-Based Reasoning to Support the Execution of Knowledge-Intensive Processes

BI Papers

Stefan Bregenzer: Is the V-Model XT ready for IT-projects applying CRISP-DM?
Jens Albrecht, Andreas Belger, Ralph Blum and Roland Zimmermann: Business Analytics on Knowledge Graphs for Market Trend Analysis
Kathrin Pfähler: IT-basierte Entscheidungsunterstützung für die Ersatzteilversorgung mit Additive Manufacturing
Sebastian Trinks: Edge Computing im Spannungsfeld der Smart Factory – Ein Status Quo

BI Talks (w/o papers)

Christoph Kollwitz: Entwicklung einer Typologisierung der Rollen von Big Data Analytics in Innovationsprozessen