Data Management in Machine Learning

There have been many advances in the field of Machine Learning in the recent years, but considering how data-intensive this field is (especially deep learning), most of these advances will fall short if the back-end infrastructure is not robust enough to handle large scale data. While increase in compute power has helped in the advancement of this field, an equal amount of innovation has been necessary in how the data is managed. The database community has been actively working on tackling data management-related challenges that arise in large-scale machine learning systems.

In this post I will be covering some of the recent advances that have been made with regards to ML-oriented database systems, and end with some open problems which still lack good solutions. This post is mainly based on this paper by Kumar et. al. This post is intended to serve as a reference guide for the interested readers to exlpore various popular systems that have been developed to serve the needs of large scale ML systems. I will be linking relevant source pages and research papers for the systems that I mention throughout this post.

Ever since the data mining boom of the late 1990s and early 2000s, the database community has been working on data management-oriented challenges in ML. A number of systems have therefore been developed for scalable and fast ML, some of which I will be referring to in this article.

The development in the ML-oriented DB systems is mainly oriented along three lines of work:

  1. Integrating ML algorithms and languages with existing data systems such as RDBMSs
  2. Adapting data management-inspired techniques such as query optimisation
  3. Combining data management and ML ideas to build systems that improve ML lifecycle-related tasks

ML in Data Systems

ML computations can be integrated with an RDBMS to bring it closer to where the data resides. This avoids the cost of having to move the data to specialised ML toolkits. Below listed are some of the methods used to achieve this:

  1. Earlier scalable ML techniques exploited user-defined function (UDF) and user-defined aggregate (UDA) abstractions in data systems. Examples of these kind of systems are ATLAS, in-RDBMS ML libraries such as Oracle Data Mining and GLADE.
    1. ATLAS - DBMSs have long suffered from SQL’s lack of power and extensibility. ATLAS is a powerful database language and system that enables users to develop complete data-intensive applications in SQL by writing new aggregates and table functions in SQL rather than in procedural languages.
    2. GLADE - GLADE is a scalable distributed framework for large scale data analytics. GLADE arised out of the need to serve the requirements of companies that wanted to apply statistics and machine learning methods on data, since SQL and other relational DB systems do not support it directly.
  2. Efforts have also been made to integrate ML with data systems by optimising ML over datasets that are logically the output relational queries, especially, joins. Orion is one such example, which introduced “factorised learning” to push generalised linear models through joins to avoid redundancy in ML computations.
  3. RDBMSs can also be used to support complex mutli-relational ML models known as statistical relational learning. DeepDive is an example of such a system, which exploits the advanced join processing capabilities of RDBMSs to scale inference systems which makes it possible to apply such methods to large-scale databases.
  4. To help the users focus more on the learning task at hand, some systems have been developed to provide higher levels of abstractions to simplify the development and customisation of ML algorithms. These systems work by either generating SQL queries or generating jobs for data-parallel frameworks such as Spark or Hadoop. An example of such a system is Riot-DB.

DB-INSPIRED ML SYSTEMS

This section covers systems and domain-specific languages (DSLs) inspired by databases, programming languages and high-performance computing.

  1. A number of state-of-the-art optimising compilers for ML algorithms like SystemML, OptiML, rely on simplification rewrites.
  2. Modern in-memory database systems often apply query compilation. SystemML introduced a holistic framework for automatic rewrite identification and operator fusion, including the generation of sparsity-exploiting operators.
  3. Since manly ML algorithms are iterative and perform repeated matrix-vector multiplications, systems like SystemML employ lightweight database compression techniques and execute linear algebra operations directly on the compressed matrix representation.

ML LIFECYCLE SYSTEMS

This section covers advances in tasks such as model selection and model management, which apply ideas which are databases oriented, such as declarativity, interactivity, and optimisation.

  1. Feature Engineering - Feature engineering is often the most time-consuming part of an applied ML project. Advances in this domain involve using database-inspired ideas such as indexing and partitioning to read only parts of the data so that the access times are reduced. Zombie is one such implementation. KeystoneML provides libraries for certain forms of featurisation and optimises pipelines of ML operators over Spark. Hamlet applies statistical learning theory to exploit database dependencies to drop features before using the data for learning the model without significantly affecting the accuracy.
  2. Model Selection and Management - It is the process of obtaining the most appropriate machine learning model with respect to the learning task at hand. There have been a couple of attempts to automate this process. Longview integrates model management into a DBMS to automate certain aspects of model selection, ModelHub proposed a language and storage manager for managing deep NNs which are common in computer vision, and ModelDB instruments ML libraries to capture and manage models. There are several cloud ML services as well, such as Mircosoft’s AzureML which aims to simplify and manage end-to-end ML workflows.

Open Problems

  1. Size and Sparsity estimation: Many optimization techniques require prior knowledge of the size and sparsity of matrices for cost comparisons and valid plan generation. However, it turns out that this task is non-trivial for programs to infer in complex methods involving linear algebra. Therefore, principled techniques are required for estimating the size and sparsity of matrices in data.
  2. Convergence estimation: Most ML algorithms involve finding the most optimal solution in an iterative fashion, in which the algorithm tries to converge to the optimal solution over a number of steps. Knowing the number of steps that it would take for the algorithm to converge would help in better resource allocation, and for optimising data re-organisations. It would also help in estimating the amount of progress that has been made towards reaching the optimal solution.
  3. Adaptive Query processing and storage: Generally, in case the workloads changes frequently, adaptive query processing and storage techniques are used to handle the workload.
  4. Integrating Relational and Linear Algebra: A seamless integration of relational and linear algebra is required so that users can easily perform data transformations for feature engineering such as joins and aggregates, and training and prediction of different ML models.
  5. ML system benchmarks: Benchmarks that compare the implementation of very large scale machine learning systems would be very helpful. There already exist some benchmarks that try to address this, such as this paper by Cai et. al. However, even this paper only uses four different platforms such as Spark and SimSQL. A more comprehensive survey would be highly beneficial.
Avatar
Ritik Dutta
Computer Science & Engineering Undergraduate