Tech Talk: Accelerating Data Science Through a Unified Machine Learning Platform

Many top technology companies encounter hurdles when building and deploying machine learning (ML) models. At the Spark + AI Summit, companies presented initial machine learning efforts that faced barriers between data science development and deployment. Data scientists utilized a plethora of open source tools to train machine learning models, and then each engineering team built one-off

Many top technology companies encounter hurdles when building and deploying machine learning (ML) models. At the Spark + AI Summit, companies presented initial machine learning efforts that faced barriers between data science development and deployment. Data scientists utilized a plethora of open source tools to train machine learning models, and then each engineering team built one-off systems to deploy these models. This led to duplicate deployment platforms and excruciatingly long data science projects.

A common solution to this problem is moving data scientists’ workflow onto an end-to-end machine learning platform. Uber’s Michelangelo and Airbnb’s Bighead are two examples of end-to-end machine learning, where the overarching goals are to:

  • Accelerate the data science life cycle (from 3+ months, to weeks or days)
  • Empower the data scientist
  • Ensure model deployment speed and quality
  • Enable real-time inference
  • Eliminate redundant work and hidden tech debt

Below I highlight some of the observed key ingredients in a successful end-to-end machine learning platform.

Moving machine learning off the desktop

Almost every presentation highlighted the use of containers, such as Docker, as a way to accelerate and empower data scientists. Specifically, a container’s ability to move a data scientist’s workflow onto a server from a desktop is an enormous benefit. Training locally on a desktop limits the size of the training data to the desktop’s memory, whereas utilizing containers enables a data scientist to train on any size server, allowing training to scale. Additionally, training in a container effortlessly facilitates the automation of model deployment.

A commonality among many of the ML platforms is the utilization of notebooks for data exploration and training. Notebooks, such as Jupyter, allow processing to occur on the server, while providing a clean interface to view results in a browser. Additionally, notebooks enable users to share work easily, which speeds up data science development by allowing data scientists to leverage past work.

Another important benefit of moving training off the desktop is enhanced security. Training on a server ensures data only exists there, and eliminating local copies of potentially sensitive data reduces a company’s vulnerabilities.

Progressing from data wrangling to streamlined data management

Feature parity between training and deployment is a significant challenge for machine learning systems. For example, when data scientists engineer features in offline analytical systems, and then engineers develop the same features for production systems, the refactoring involved not only duplicates development effort, but can also introduce bugs, extending timeframes even further. Thus, having a centralized feature store is a major component of many of the ML platforms discussed at the conference.

A feature store not only empowers data scientists to easily create training data frames, but it also speeds up deployment. Once a data scientist adds a feature to the platform, the feature will be available for offline or online services (depending on the feature’s configuration). Additionally, a centralized feature store eliminates redundant feature creation, and speeds up development by allowing the data scientist to reuse prior feature engineering. Some crucial components of successful feature generation in ML platforms are:

  • Feature metadata (such as owner, description, and SLA) encourages data scientists to reuse prior features because the metadata clearly defines the feature’s provenance.
  • Spark has become a foundational tool for feature generation because it easily enables data scientists and engineers to create both offline as well as online jobs.
    • Offline – create and schedule Spark jobs to feed the feature store from a data lake.
    • Online – real-time models cannot access a data lake in a performant manner. Thus, many of the platforms presented utilize a combination of precomputed and near-real-time features. Batch Spark jobs generally create the pre-computed features, and a combination of data streams, such as Kafka and Spark streaming jobs, generate the near-real-time features.

+++++

While not all companies have end-to-end machine learning platforms, taking advantage of even one of their key ingredients will help accelerate and empower data scientists. For example, at Intuit, developing and deploying models in Docker containers reduced model deployment timeframes by two thirds, while enabling real-time inference.

What have you found are key components of a successful ML platform? Do you have tips for accelerating the data science life cycle?