<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"><div>
<div>*Description*</div>
<div> </div>
<div>A vital part of proposing new machine learning and data mining approaches is evaluating them empirically to allow an assessment of their capabilities. Numerous choices go into setting up such evaluations: how to choose the data, how to preprocess them (or not), potential problems associated with the selection of datasets, what other techniques to compare to (if any), what metrics to evaluate, etc. and last but not least how to present and interpret the results.<br/>
Typically, one learns how to make those choices on-the-job, often by copying the evaluation protocols used in the existing literature - a procedure that can easily lead to the development of problematic habits. Numerous, albeit scattered, publications have called attention to those questions [1-5] and have occasionally called into question published results, or the usability of published methods.</div>
<div> </div>
<div>Those studies consider different evaluation aspects in isolation, and the issue becomes even more complex because setting up an experiment introduces additional dependencies and biases: having chosen an evaluation metric with little bias can be easily undermined choosing data that cannot appropriately treated by one of the comparison techniques, for instance, and having carefully addressed both aspects is of little worth if the statistical test chosen does not allow to assess significance.<br/>
At a time of intense discussions about a reproducibility crisis in natural, social, and life sciences, and conferences such as SIGMOD, KDD, and ECML/PKDD encouraging researchers to make their work as reproducible as possible, we therefore feel that it is important to bring researchers together, and discuss those issues on a fundamental level. In non-computational sciences, experimental design has been studied in depth, which has given rise to such principles as randomization, blocking, or factorial experiments. While these principles are usually not applied in machine learning and data mining, one desirable goal that arose during workshop discussions is that of the formulation of a checklist that quickly allows to evaluate the experiment one is about to perform, and to identify and correct weaknesses. An important starting point of any such list has to be: “What question do we want to answer?”</div>
<div> </div>
<div>An issue directly related to the dataset choice mentioned above is the following: even the best-designed experiment carries only limited information if the underlying data are lacking. We therefore also want to discuss questions related to the availability of data, whether they are reliable, diverse, and whether they correspond to realistic and/or challenging problem settings. This is of particular importance because our field is at a disadvantage compared to other experimental science: whereas there, data are collected (e.g. in social sciences), or generated (e.g. in physics), we often “only” use existing data.<br/>
<br/>
Finally, we want to emphasize the responsibility of the researchers to communicate their research as objectively as possible. We also want to highlight the critical role of the reviewers: The typical expectation of many reviewers seems to be that an evaluation should demonstrate that a newly proposed method is better than existing work. This can be shown on a few example datasets at most and is still not necessarily true in general. Rather it should be demonstrated in papers (and appreciated by reviewers) to show on what kind of data a new method works well, and also where it does not, and this way in which respect is different from existing work and therefore is a useful complement. A related topic is therefore also how to characterize datasets, e.g., in terms of their learning complexity [6] and how to create benchmark datasets, an essential tool for method development and assessment, adopted by other domains like computer vision, IR etc.</div>
<div> </div>
<div>*Topics*</div>
<div> </div>
<div>In this workshop, we mainly solicit contributions that discuss those questions on a fundamental level, take stock of the state-of-the-art, offer theoretical arguments, or take well-argued positions, as well as actual evaluation papers that offer new insights, e.g. question published results, or shine the spotlight on the characteristics of existing benchmark data sets.</div>
<div> </div>
<div>As such, topics include, but are not limited to:<br/>
- Benchmark datasets for data mining tasks: are they diverse/realistic/challenging?<br/>
- Impact of data quality (redundancy, errors, noise, bias, imbalance, ...) on qualitative evaluation<br/>
- Propagation/amplification of data quality issues on the data mining results (also interplay between data and algorithms)<br/>
- Evaluation of unsupervised data mining (dilemma between novelty and validity)<br/>
- Evaluation measures<br/>
- (Automatic) data quality evaluation tools: What are the aspects one should check before starting to apply algorithms to given data?<br/>
- Issues around runtime evaluation (algorithm vs. implementation, dependency on hardware, algorithm parameters, dataset characteristics)<br/>
- Design guidelines for crowd-sourced evaluations<br/>
- Principled experimental workflows</div>
<div> </div>
<div>The workshop will feature a mix of invited speakers, a number of accepted presentations with ample time for questions since those contributions will be less technical, and more philosophical in nature, and a panel discussion on the current state, and the areas that most urgently need improvement, as well as recommendation to achieve those improvements. Workshop submissions will be published in the CEUR-WS workshop series. An important objective of this workshop is a document synthesizing these discussions that we intend to publish at a more prominent venue.</div>
<div> </div>
<div>*Submission*</div>
<div> </div>
<div>Papers should be submitted as PDF, using the Springer LNCS style, available at https://www.springer.com/gp/computer-science/lncs/conference-proceedings-guidelines. Submissions should be limited to ten pages and submitted via Easychair at https://easychair.org/conferences/?conf=edml20.</div>
<div>Papers will be reviewed by at least two members of the Program Committee on the basis of technical quality, relevance, significance, and clarity. Submitting a paper to the workshop means that if the paper is accepted at least one author should present the paper at the workshop. Accepted papers will be published after the workshop with CEUR-WS or Springer.</div>
<div> </div>
<div>*Important dates*</div>
<div> </div>
<div>Submission deadline: June 09, 2020<br/>
Notification deadline: July 07, 2020<br/>
SDM early bird registration deadline: July 20, 2020<br/>
Camera ready: July 21, 2020<br/>
Conference dates: September 14-18, 2020<br/>
<br/>
*Organizers*</div>
<div> </div>
<div>Eirini Ntoutsi, Leibniz University Hannover & L3S Research Center, Germany, ntoutsi@kbs.uni-hannover.de<br/>
Erich Schubert, Technical University Dortmund, Germany, erich.schubert@cs.tu-dortmund.de<br/>
Arthur Zimek, University of Southern Denmark, zimek@imada.sdu.dk<br/>
Albrecht Zimmermann, University Caen Normandy, France, albrecht.zimmermann@unicaen.fr</div>
<div> </div>
<div>The workshop's website can be found at https://imada.sdu.dk/Research/EDML/2020/</div>
</div></div></body></html>