Training Data for Machine Learning: Human Supervision from Annotation to Data Science

Sarkis, Anthony




Your training data has as much to do with the success of your data project as the algorithms themselves--most failures in deep learning systems relate to training data. But while training data is the foundation for successful machine learning, there are few comprehensive resources to help you ace the process. This hands-on guide explains how to work with and scale training data. Data science professionals and machine learning engineers will gain a solid understanding of the concepts, tools, and processes needed to:

  • Design, deploy, and ship training data for production-grade deep learning applications
  • Integrate with a growing ecosystem of tools
  • Recognize and correct new training data-based failure modes
  • Improve existing system performance and avoid development risks
  • Confidently use automation and acceleration approaches to more effectively create training data
  • Avoid data loss by structuring metadata around created datasets
  • Clearly explain training data concepts to subject matter experts and other shareholders
  • Successfully maintain, operate, and improve your system


您的訓練數據對於數據項目的成功與算法本身一樣重要 - 大多數深度學習系統的失敗與訓練數據有關。然而,儘管訓練數據是成功機器學習的基礎,但很少有全面的資源來幫助您掌握這個過程。這本實踐指南將解釋如何處理和擴展訓練數據。數據科學專業人員和機器學習工程師將對以下概念、工具和流程有深入的理解:

  • 設計、部署和交付用於生產級深度學習應用的訓練數據

  • 與不斷增長的工具生態系統集成

  • 識別和修正基於新訓練數據的故障模式

  • 改進現有系統性能並避免開發風險

  • 自信地使用自動化和加速方法更有效地創建訓練數據

  • 通過結構化元數據來避免數據丟失

  • 清晰地向主題專家和其他利益相關者解釋訓練數據概念

  • 成功維護、操作和改進您的系統


Anthony Sarkis is the lead engineer on Diffgram Training Data Management software and founder of Diffgram Inc. Prior to that he was a Software Engineer at Skidmore, Owings & Merrill and co-founded


Anthony Sarkis 是 Diffgram Training Data Management 軟體的首席工程師,也是 Diffgram Inc. 的創辦人。在此之前,他曾在 Skidmore, Owings & Merrill 擔任軟體工程師,並共同創辦了。