A few months ago I started a new job for the first time in 10 years, leaving my comfortable home at a government FFRDC for an exciting opportunity with the new data science and analytics team at FitnessKeeper. Even though I’ve been a huge fan of the RunKeeper app for years, it was a terrifying transition. Sure, joining a well-funded start-up with tens of millions of users is hardly the riskiest venture in the world, but to me—a 30-something academic with two kids (and as many mortgages), who spent the past decade in the calm and cozy world of national defense—it seemed like sheer lunacy.
Now, a few months later, my initial terror has been replaced by a daily cocktail of fun, excitement, and low-grade anxiety that is the standard fare of the start-up. Moreover, the challenges I face in my new work have helped me find a deeper appreciation for the lessons from my years in graduate school and government research. Here are a few of the things I’ve learned.
(1) You Can’t Do Inference Without Making Assumptions
In his textbook Information Theory, Inference, and Learning Algorithms, David J. C. MacKay (one of my favorite authors in machine learning and information theory) wrote what I consider the cardinal rule of data science: you cannot do inference without making assumptions. This mantra stuck with me over the years, even as the ascent of “big data” has brought with it fantastical promises of a world where our decisions are governed entirely by analytics and algorithms instead of instinct and preconception.
When working with data, you always have to make assumptions. Always. Even a calculation as innocuous as taking the arithmetic mean of a data set belies an abundance of latent assumptions. Does the law of large numbers hold? Are the samples ergodic? Does the mean of the “true distribution” of the data even exist? Of course, before you can know whether any of these assumptions are valid, the first step is to know that you’re making them.
The types of data we choose to collect, the questions we choose to ask, and the techniques we choose to use for analysis all are functions of our own prior experiences and beliefs. Every known approach to inference demands that we assume something about the way our data behave. Some (like Bayesian statistics) make these assumptions explicit, while others (like machine learning) tend to hide them within black boxes. In short, the art of data science isn’t about “using data” to avoid making assumptions, it’s about judiciously applying a specific set of assumptions to extract reasonable and robust insights from your data.
(2) If It Seems Too Good (or Bad) to Be True, It Usually Is
Like most people in this line of work, I love cool statistics. Sifting through data sets day after day, we all want to uncover those few, Freakonomics-style gems that really blow people away, the kind that are so outrageous that people only believe them because the “numbers don’t lie”. Unfortunately, most of the time the numbers tell you exactly what you expect them to… and if they don’t, then it’s almost always because of a mistake.
In data science you have to be vigilant. As the U.S. Department of Homeland Security likes to say, “If you see something, say something.” Once you understand your data well enough, most results should be pretty unsurprising. Take note of anything that defies your expectations, and be suspicious of any results that seem too amazing. Chase down every lead you can, check and re-check your work, and try not to get too disappointed when an exciting new discovery turns out to be a bug in your Python script. Eventually, one of those rare, mind-blowing statistics won’t be a mistake—it’ll be a gem.
(3) More Data Is Always Better, Except When It Isn’t
When it comes to data, modern information science argues that bigger is better. We live in a golden age where our ability to instrument, measure, and record every aspect of our daily lives has reached unprecedented levels, and the tech industry has shown us how it can use this data to provide products and services we never imagined.
There is obviously an allure to using every bit of data you have to move ever closer (but of course, never reaching) the asymptotic holy land of perfect statistical comprehension. However, it’s important to recognize those situations when adding data causes more problems than it solves. Here are some examples:
- Adding more attributes to our data combinatorially increases the number of relationships between these attributes we may consider, something that statisticians refer to as the curse of dimensionality. Thus, even if we are in a position to deal with every user or record in our database, it’s often not a good idea to try understand every attribute about them all at once (at least, not without performing some dimensionality-reduction techniques first).
- Practically speaking, it’s often easier to work with a smaller data set. There’s a lot to be said for starting off with a sample of a few thousand data points that you can really understand, and obviously it’s harder to do rapid analysis and prototyping if you have to worry about efficient parallel implementations of your code. If you can get away with trivial parallelization then go for it, but don’t let your desire to use all of your precious data right off the bat keep you from getting quick and potentially valuable insights.
- Finally, sometimes using all your data is a bad idea because it requires you to make bad assumptions. For example, when time is involved, stationarity is always an issue—data from a month or a year ago may be too different to be lumped in with data from today. Before you try to crank through a huge population all at once, think carefully about whether the data are similar enough for the kind of analysis you’re considering.
(4) Wisely and Slow; They Stumble that Run Fast
Even when I was doing academic and government research, I never felt like I had enough time to understand my results before I had to move on, and now that I’m in a start-up environment, the problem is ten times worse. Most of the time, my impulse is to take a big swing on the first pitch, ingesting a big pile of data and shooting for the most ambitious analysis that I dare, hoping that I’ll get lucky and everything will just work out. Of course, things rarely just “work out”.
There’s a reason why in Star Trek you hardly ever see anyone running through the corridors of the Enterprise—just like the real-world military, Starfleet teaches that in general it’s more dangerous to run than it is to move quickly yet cautiously. When you’re starting a project, set aside some time for exploration before you settle on a line of analysis. If you’re working with a statistical model, generate synthetic data that matches your model before you move on to the real stuff. While checking your code and validating your results may seem like luxuries you don’t have time for, they often end up saving you time in the long run.
Most importantly, stop every once in a while and think, really think about what you’re doing. I am as guilty as anyone of heading down the rabbit hole, and it always stings when I realize I’ve wandered way too far. Don’t obsess over a particular technical challenge for hours (or worse, days) before giving yourself the chance to stop and realize there might be a better way to get the answers you want—or even that you should be seeking different answers altogether.
(5) A Little Bit of Knowledge Is a Dangerous Thing
The open-source movement has led to an unprecedented democratization of the tools for data analysis, which also means that people no longer have to blow a few grand on a Matlab license when they can get the same (or often better) functionality from Python or R for free. However, it also means that things like training a support vector machine or performing logistic regression—tasks that once required years of study in statistics, linear algebra, and machine learning—can now be attempted by anyone who can type pip install scikit-learn
. Unfortunately, even the most well-written implementations of these techniques require quite a bit of expertise to use properly, and using them improperly is often worse than not using them at all.
When in doubt, always choose a simpler method that you actually understand over a more complex one that you don’t. Moreover, even if you do understand a more complex method, consider trying the simpler one—it might work nearly as well, and likely will be easier to implement and validate. Of course, don’t stress too much when you realize that you’re out of your depth, because if you’re doing any kind of research or development worth doing, that is exactly the depth where you will spend most of your time.
While I find all of these lessons helpful in my work, there is one more thing I try to keep in mind: have fun. Working with big piles of data is pretty awesome these days. If you’re like me and believe that with enough data and enough work just about any question can be answered, then you also know there’s never been a better time in human history to have at it.
Good set of empirical rules – I would say – sometimes we need to be vigilant of the data quality as well and a lot of data science depends on domain knowledge as well.
I have always been a firm believer of this “Most importantly, stop every once in a while and think, really think about what you’re doing.”
Like the new blog and congrats on the move to FitnessKeeper, I’m a fan of their work as well.
One more rule to consider, especially in the world of fitness is “feedback governs dynamics”. This is my mantra from the system dynamics world. Anything that is truly dynamic, e.g. nonlinear is so because of feedback. Causality is often shunned in the world of statistics, but things that oscillate, diffuse, overshoot and collapse, etc. are everywhere.
Great post, and I’ll second the “domain knowledge” comment. You don’t need to be an expert but you need to know your space. Most people would double check if they thought a cookie dough recipe said to use two tablespoons of salt, but I’ve seen kids who lack the context just go with that. The same can be true of say the surface temperature of an airliner at cruising altitude or the speed of a runner descending a steep hill. It helps to know what is “reasonable”.