This is the second of a set of book reading lists written from the point of view of a software engineer who wants to develop a basic knowledge of machine learning. In Part 1, we looked at some introductory books to the discipline. Herein, we'll look at some more hands-on programming books.
First, let's talk about Python.
When compiling this list, there was a real issue with it becoming "Machine Learning For Python Engineers". As of 2018, Python is the dominant language for applied machine learning and data science. This is no doubt due to decades of effort by the Python community on an ecosystem of libraries for numerics and data manipulation, which act as building blocks for data science and machine learning frameworks. While many frameworks under the hood are C++ (such as TensorFlow, Caffe2 and MXNet), Python is the medium through which they're used. As machine learning becomes a general discipline, I hope to see more broader use of languages for ML applications. In the meantime, Python is the most useful second language to pick up if you don't have it already.
On to the books.
Machine Learning Books for Developers
Deep Learning With Python by François Chollet. This covers neural networks via the popular Keras library designed by Chollet. The book makes a useful split between text and image examples and has a great practical overview of tensors, gradients and network structures for programmers in chapters 2 and 3. Deep Learning With Python is best programming centric introduction to deep learning I’ve read so far.
Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron. The first half of this book covers common supervised approaches - classification, regression, support vector machines, trees and forests - using scikit-learn. The second half covers neural networks and deep learning with TensorFlow. There are two things Hands-On Machine Learning does well. First instead of trying to cover the gamut of techniques, it focuses on the ones that tend to get used (albeit the section on reinforcement learning could have been edited out). Second is the overview throughout the book on important gotchas - Chapter 3 briefly covers performance measurement, chapter 2 discusses data preparation and chapter 8 is given over entirely to dimensionality reduction. If you're coming at machine learning from a software background, these are good to be aware of.
Another candidate specifically for TensorFlow is TensorFlow for Deep Learning, which also uses Python examples and whose final edition was released this March. It's in the maybe bucket for now, as I'm still reading it 🙂
Advanced Analytics with Spark 2nd ed. by Sandy Ryza, Uri Laserson, Josh Wills and Sean Owen. Spark is a go to framework in the data processing and analytics space. This works through a set of analytics and learning problems by pairing a domain with a technique (eg "anomaly detection in network traffic with K-mens clustering") using Spark and its MLib extension as the underlying framework. Except for one chapter, the example code is in Scala, the idiomatic language for Spark and its functional-style APIs. If you're more familiar with Java, or another alt. JVM language like Kotlin or Groovy, the code's still accessible and avoids advanced Scala features. Apart from the problem-centric style of discourse, this is a good book to get a feel for data pipelines and dataflow approaches to building learning systems, which are in wide use in the JVM's "post-Hadoop" world. A nice feature of the book is you can largely read the chapters out of order - it's a good book for dipping in and out of. Advanced Analytics with Spark does assume some knowledge of Spark's APIs, but nothing onerous and O'Reilly, the publisher has a tranche of books on Apache Spark - so if you want to learn Spark itself, there's good options to choose from.
Hands-On Programming with R by Garrett Grolemund and Hadley Wickham. This book is a good intro to R and suitable as a companion to An Introduction to Statistical Learning mentioned in part 1. Here's the thing with R - it's a good interactive way for learning about the data science discipline itself, but you're unlikely to engineer online systems with it. To my mind, R and its ecosystem remains focused on statistics and pure data science workflows more than building machine learning applications, so if your focus is engineering applications, investing in Python may have a better payoff, especially in conjunction with Jupyter notebooks. If you want to go further into R as a tool for data analysis in its own right, then R for Data Science, also by Grolemund and Wickham is a good option.
Data Mining: Practical Machine Learning Tools and Techniques 4th ed., Ian Witten, Eibe Frank, Mark Hall, Christopher Pal. I’ve bought every edition of this book since graduating with an AI degree many moons ago, it's still a personal favourite of mine. The book's software has a Java focus, so if that's a primary language for you, it's worth a look and does a good job providing an overview of what learning systems actually are, and the core approaches. That said this book needs some caveats. First the new content in this edition on deep learning feels like a bolt-on, and not a good starting point - Deep Learning With Python will serve you better in that area. Second is the Weka toolkit used in the book - it's Java based, so useful on that front, but has not grown in popularity over time compared to scikit-learn, R's inbuilts, or Apache Spark, and should be viewed as a learning choice rather than a transferrable skill.
What about Language X?
I'm conscious this list only really covers four languages - Java, Python, R and Scala. That's a narrow set, but it reflects current realities. For example, I can't make relevant book recommendations for Golang, Ruby or Rust (or even Apple CoreML), which is frustrating 🙁 Same for C++ which is interesting given how many underlying ML frameworks are written in C++. Likewise there are interesting frameworks that don't have book coverage, examples being MXNet, Caffe2 and PyTorch. To be clear, my own bias is server side systems, so I tend to focus on what's predominant there - today that's Python, then the JVM.
What about Not-Books?
Finally, not everyone learns best through books! While I like to learn through reading, maybe more people learn better hands-on. Good news - there's a mountain of resources available online from Kaggle to Github to awesome lists, to a constant stream of tutorial posts and recommendations on sites like Medium and Quora. All the frameworks mentioned to their great credit have online tutorials - here are a few of them:
- Apache Spark: MLLib Guide
- scikit-learn: machine learning and data science tutorials.
- Keras: examples
- Apache MXNet/Gluon: tutorials
- Cascading: data science tutorials
- Pytorch: official tutorials
- Jupyter: try Jupyter in a browser
Another (fair) argument is that the machine learning application space is moving quickly, too quickly for books. Especially when it comes to programming, books are inevitably centred around frameworks, and so are much more likely to get outdated versus online media. I think this is true, but I believe the books in this list are stable. In any case, I hope this list was useful, and the suggestion in part 1 still stands - do look around for other options if these don't seem like the right ones for you.