SEBD 2017

Invited talks

Serge Abiteboul

Bio

Serge Abiteboul obtained his Ph.D. from the University of Southern California, and a State Doctoral Thesis from the University of Paris-Sud. He has been a researcher at the Institut National de Recherche en Informatique et Automatique since 1982, Directeur de Recherche CE since 01/01/2004, in a research team located at ENS Paris since 2016. He is now Distinguished Affiliated Professor at Ecole Normale Supérieure de Cachan . He was a Lecturer at the École Polytechnique and Visiting Professor at Stanford and Oxford University. He has been Chair Professor at Collège de France in 2011-12 and Francqui Chair Professor at Namur University in 2012-2013. He co-founded the company Xyleme in 2000. Serge Abiteboul has received the ACM SIGMOD Innovation Award in 1998, the EADS Award from the French Academy of sciences in 2007; the Milner Award from the Royal Society in 2013; and a European Research Council Fellowship (2008-2013). He became a member of the French Academy of Sciences in 2008, and a member the Academy of Europe in 2011. He has been a member of the Conseil national du numérique (2013-2016) and Chairman of the Scientific board of the Société d'Informatique de France (2013-2015). His research work focuses mainly on data, information and knowledge management, particularly on the Web. He founded and is an editor of the blog binaire.blogs.lemonde.fr.

The new frontier in data management: Ethics

Data management and analysis technology have made tremendous progress in the last fifty years mostly driven by issues such as designing better data models and improving performance. Such classical issues can be further improved but the current technology already fits the needs of most applications. We will argue that the next frontier is ethics. This technology, notably big data, holds incredible promise of improving people’s lives, accelerating scientific discovery and innovation, and bringing about positive societal change. Yet, if not used responsibly, this technology can propel economic inequality, destabilize global markets and affirm systemic bias. We will mention some of the issues and directions of research that open.

Ronald Fagin

Bio

Ronald Fagin is an IBM Fellow at IBM Research – Almaden. IBM Fellow is IBM's highest technical honor. There are currently around 90 active IBM Fellows (out of around 400,000 IBM employees worldwide), and there have been only around 250 IBM Fellows in the over 50-year history of the program. Fagin received his B.A. in mathematics from Dartmouth College and his Ph.D. in mathematics from the University of California at Berkeley. He is a Fellow of IEEE, ACM, and AAAS (American Association for the Advancement of Science). He has co-authored four papers that won Best Paper Awards and three papers that won Test-of-time Awards, all in major conferences. He was named Docteur Honoris Causa by the University of Paris. He won the IEEE Technical Achievement Award, IEEE W. Wallace McDowell Award (the highest award of the IEEE Computer Society), and ACM SIGMOD Edgar F. Codd Innovations Award (a lifetime achievement award in databases). He is a member of the US National Academy of Engineering and the American Academy of Arts and Sciences.

Applying database theory to practice

The speaker will talk about applying database theory to practice, with a focus on two IBM case studies. In the first case study, the practitioner initiated the interaction. This interaction led to the following problem. Assume that there is a set of “voters” and a set of “candidates”, where each voter assigns a numerical score to each candidate. There is a scoring function (such as the mean or the median), and a consensus ranking is obtained by applying the scoring function to each candidate’s scores. The problem is to find the top k candidates, while minimizing the number of database accesses. The speaker will present an algorithm that is optimal in an extremely strong sense: not just in the worst case or the average case, but in every case! Even though the algorithm is only 10 lines long (!), the paper containing the algorithm won the 2014 Gödel Prize, the top prize for a paper in theoretical computer science.

The interaction in the second case study was initiated by theoreticians, who wanted to lay the foundations for “data exchange”, in which data is converted from one format to another. Although this problem may sound mundane, the issues that arise are fascinating, and this work made data exchange a new subfield, with special sessions in every major database conference.

This talk will be completely self-contained, and the speaker will derive morals from the case studies.

Floris Geerts

Bio

Floris Geerts holds a research professor position at the University of Antwerp, Belgium. Before that, he held a senior research fellow position in the database group at the University of Edinburgh, and a postdoc position in the data mining group at the University of Helsinki. He received his PhD in 2001 at the University of Hasselt, Belgium. His research interests include the theory and practice of databases and the study of query processing on big data and data quality in particular. He received several best paper awards and was recipient of the 2015 Alberto O. Mendelzon Test of Time Award (PODS 2015). He is an associate editor of ACM TODS, was general chair of EDBT/ICDT 2015 and is PODS 2017 PC chair.

Bounded evaluation of database queries

Large datasets introduce challenges to the scalability of query evaluation. Indeed, it may take hours, days or even longer to get your query answers on huge datasets. Similar issues arise when asking queries on small databases, when only limited computational resources are available. One way out of this is to only require approximate answers, hoping that these answers can be fetched more quickly and by only consuming limited resources. However, when querying one's bank balance, one typically prefers exact answers. In many cases, finding exact query answers is thus still needed.

Observe, however, that not all the data in your big database may be needed to answer your query. Furthermore, if the actual amount of data needed is small and can be efficiently identified using indexes, query evaluation may become feasible. Queries for which this is possible are called scale independent. In this talk, I will survey various formalisms of scale independence and show how scale independence can be made applicable to a large class of queries. Intuitively, scale independence can be guaranteed by imposing the "right" indexes on the underlying database. Indeed, given indexes, statistical information and a query, an efficient query plan can be automatically obtained. Furthermore, this query plan guarantees that only a bounded amount data is accessed, yet exact answers will be returned, precisely what we needed. I will illustrate that scale independent queries are quite common in practice and that they indeed can be efficiently answered on big datasets.

Carlo Zaniolo

Bio

Following his "Laurea in Ingenieria Elettronica," from the University of Padua, Carlo Zaniolo did his graduate work at the University of California, Los Angeles (UCLA), where he received an M.S. and a Ph.D. in Computer Science. Carlo had to interrupt his Ph.D. studies to serve his turn of duty in the Italian Army. Remarkably, that was the time in which he came up with his first results on Relational Database theory. These results were then extended and refined in his 1976 dissertation, which introduced multivalued dependencies, and the use of null values and hypergraphs in the design of relational schemata. After his graduation, Carlo worked for several computer companies and research centers, including Burroughs Corporation (now Unisys), Sperry Research, Bell Labs, and the Microelectronics and Computer Corporation (MCC). At MCC, Carlo led the Deductive Computing Laboratory and the development of an ambitious Datalog project known as LDL++. Then in 1991, Carlo returned to academe, and actually to his alma mater, as a full professor in the Computer Science Department of the School of Engineering and Applied Science, and as the N.E. Friedmann chair in Knowledge Science. Carlo, who is the author of more than 250 refereed publications, is now serving as the Director of the UCLA/SEAS Scalable Analytics Institute (ScAI). More information about Carlo and his research can be found in: "Carlo Zaniolo Speaks Out on his Passion for Relational Databases and Logic," Distinguished Profiles by Marianne Winslett and Vanessa Braganholo, SIGMOD Record, Volume 45, Number 3, September 2016, Pages 29--342.

Scaling-Up Reasoning and Advanced Analytics for BigData: from Relational Database Systems to Apache Spark

Relational Databases have provided the fertile seminal ground on which BigData has grown to become the pillar of advanced Data Science applications which are transforming modern society, besides our computing field. In particular, the invention of parallel DBMS using a share-nothing architecture represents a key contribution that, when it was introduce by DBMS researchers and vendors in the 80s, was so far ahead of its time that its full potential was only realized thirty years laters with the introduction of BigData systems such as Apache Spark. In this tutorial, I will briefly describe how Apache Spark has generalized Hadoop's MapReduce parallelism by supporting high-level languages and analytics, such as Scala and Python, besides SQL queries, streaming data processing, and Graph applications. The suitability of this open-source framework to support BigData systems, has been confirmed recently by our BigDatalog system [1] that supports rule-based reasoning and advanced Analytics on BigData. BigDatalog is an extension of Datalog that achieves performance and scalability on both ApacheSpark [1] and multicore systems [2]. In particular, graph analytics written in BigDatalog outperform those written in GraphX, the native graph-processing framework of Spark. Indeed, BigDatalog provides a highly declarative and portable language of superior performance and scalability for a wide range of applications, including complex queries, graph applications and several advanced KDD analytics that will be presented in the tutorial. Thus, after a short introduction to parallel BigData systems and Datalog, the tutorial will focus on (i) the architecture and implementation techniques used to achieve high levels of performance and parallelization in our Datalog system, and (ii) the crucial language extensions that made the support of BigData analytics possible--including, e.g., the generalized least-fixpoint semantics that allowed the introduction of basic aggregates in recursive programs.

[1] Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, Carlo Zaniolo: Big Data Analytics with Datalog Queries on Spark. SIGMOD Conference 2016: 1135-1149.
[2] Mohan Yang, Alexander Shkapsky, Carlo Zaniolo, Scaling up the performance of more powerful Datalog systems on multicore machines, The VLDB Journal 1-20, December 2016.