Top 10 Quora Machine Learning Writers and Their Best Advice, Updated
Gain some insight on a variety of topics with select answers from Quora's current top machine learning writers. Advice on research, interviews, hot topics in the field, how to best progress in your learning, and more are all covered herein.
This post is based on Most Viewed Writers in Machine Learning, the 10 writers with the most answer views in the last 30 days, as retrieved on June 25, 2017.
Just so there is no confusion, please note that this post is "authored" by me, but none of the information contained herein -- from the questions to the answers -- has anything to do with me. I simply edited these informative responses together.
Machine learning. At Quora.
I don’t think it’s important to memorize formulae. In fact, I think it can even be counter-productive.
If you understand how a machine learning algorithm works, and I mean truly understand it on a low level, not just the high level intuition, you should be able to derive the formula yourself. In practice, this is something you would hardly ever need to do, as you can just look it up.
Memorizing a formula can give you the illusion that you understand the principles behind it.
Do you need [a local GPU rig]? If you are serious about studying DL, yes. Understanding an architecture or an algorithm and getting it to work are two very different stories, the only real way to acquire knowledge is to try things for yourself and analyze the results.
If you consider buying multiple cheap GPUs to learn how to work with them - don’t. If your framework supports distributed computation, it does everything in a painless way. If it doesn’t, this is not a task for a beginner, and generally a pain in the rearward.
For training modern architectures CPUs are not a substitute for a GPU in any way. I have a really damn good CPU and it would take weeks to train a network that I usually train overnight. A consumer-grade i5 (I don’t think that overpaying for i7 is a great idea) is even slower.
Excerpt from answer to: How does one prepare for a computer vision research scientist interview?
There was a light programming portion and basic questions about computer vision and machine learning for about half the positions. At the other half, there were no technical questions at all. Typically, if you have been coding yourself and have been attending conferences regularly, you shouldn’t need to prepare for this part. At most, you can brush up on your C++ in a couple of days if you really need to.
The two things they want to know are whether: (I) you can work as an independent researcher, and (II) whether your expected proportion of research work to software development work matches the position.
Excerpt from answer to: What are some problems or motivations of generating images using GAN?
You can use GANs to :
- Generate simulated training data and simulated training environments
- Fill in missing data
- Train a classifier with semi-supervised learning (where the classifier learns from both labeled and unlabeled data… and with GANs, also learns from completely imaginary data)
- Do supervised learning where the supervision signal says that any one of multiple correct answers are acceptable, instead of just having one specific answer you request for each training example
- Replace expensive simulations with statistical generation
- Sample from the posterior distribution of a generative model
- Learn embeddings that are useful for other tasks
Excerpt from answer to: What's trending in machine learning (outside of deep learning)?
I don’t know about trending, but I know of a powerful method, outside of mainstream ML, which is demonstrated to have tremendous flexibility, interpretability, and the advantage of relative ease of implementation in VLSI/FPGA hardware.
The easiest way to understand how Volterra series works is that it is a series of digital filters estimated to perform a transformation from an input signal to an appropriate output. The shape, time-delay, and number of convolution kernels (filters) comprises the features of the model that must be estimated in order to perform an accurate prediction of the behavior of a complex system.
Excerpt from answer to: What are some best practices for training machine learning models?
- You should pick an offline optimization metric that correlates as well as possible to the product objectives. Many times, a good proxy for the product objectives can be an online A/B test result or some other online metric.
- You can only know that a metric correlates well to online A/B tests by running different experiments and tracking offline metrics
- E.g. Metrics that tend to correlate well to ranking-related problems are recall@n, NDCG, or MRR (mean reciprocal rank)
- A good metric:
- Should allow to easily compare different models
- Should be as easy to understand and interpret as possible
- It is a good idea to track your metric(s) per user segment you care about (e.g. new users, stale users, very active users, locales....)
- Measure your metric on the test set (not training, not validation)
Yes there is what is called transfer learning which you can use with almost any machine learning (ML) algorithm without retraining the whole system. For example one can get a pre-trained network and add an extra simple classifier on top and only train that classifier on the new training samples while keeping the pre-trained weights fixed. This works well in practice for related tasks.
However, there are limitations with transfer learning , for it to work well we need to make sure the new samples have a similar distribution to the orignal samples.
Excerpt from answer to: In AI deep learning, who would you say are the top researchers after Hinton, Lecun, and Bengio?
This question is posted wrong. We all know now that Schmidhuber’s contribution is on par with, if not more important than, the contributions of Hinton, LeCun, and Bengio.
There are only two key ideas in DL:
- CNN (Fukushima-LeCun)
- LSTM (Schmidhuber)
Everything else, including Hinton’s and Bengio’s work, is secondary compared to these two. This is not to say that those are not important, no, they are super important in popularizing NN, but if you just talk about “original ideas” as the Nobel Prize always say, then it’s LeCun and the even earlier Fukushima, and Schmidhuber. If there were a Nobel Prize for DL, it should go to these people.
I don’t think that a MOOC is enough. You need to practice seriously. For example, try to reproduce the results obtained in a couple of papers of interest to you, compete in a Kaggle competition, etc. Then try to join an academic lab in which there are other students and researchers doing deep learning, either as a visitor/intern or a graduate student.
Excerpt from answer to: How do computational scientists decide which strategy to use for cross validation?
Let’s consider a 2 class problem and the same distribution for training and testing data.
K-fold cross-validation (CV) may fail if during the folds formation, the validation set does not contain any samples from the negative class and the training set contains only the positive samples. To avoid it, you may want to do stratified K-fold CV to ensure proportionate number of samples in the training and validation set. Different 10-fold cross-validation experiments with the same learning method and dataset often produce different results, because of the effect of random variation in choosing the folds themselves. Stratification reduces the variation, butcannot not eliminate it entirely.
Leave one-out is better because you get maximum number of data for training; however, the cost is excessive amount of training required (for a data with 1000 samples, you have to do it 1000 times). A highly dramatic situation may occur when, let’s say, the data is randomly generated, and the best a classifier can do is to predict the majority class,, thus 50% error rate. But in each fold of leave-one-out, the opposite class to the test instance is in the majority—and therefore the predictions will always be incorrect, leading to an estimated error rate of 100%. Leave-one-out can not be stratified because there is only sample to test.
Typically, 10 time 10-fold stratified CV is employed.