Is Your Code Good Enough to Call Yourself a Data Scientist?

Is your code good enough to be calling yourself a Data Scientist? Figure out how to determine the answer to this question... and gain some suggestions on ensuring that the answer is "yes!"

Sheamus McGovern as ODSC Chair.

At what point in your career can you call yourself a data scientist?  Is it when you declare it on LinkedIn or receive your first paycheck?  What is data science anyway?

ODSC Europe keynote speaker Gael Varoquaux gave one of the best and most succinct descriptions of what data science is:

Data Science = Statistics + Code

In a touch of humor, he also quoted another wit who stated that “a data scientist is a statistician who lives in San Francisco”. Humor aside Gale was spot on. I know a few of you will harangue over the term statistics in that equation and event Nate Silver has been quoted as saying. “I think data scientist is a sexed up term for a statistician,”.  That infinite loop debate aside, be free to throw in your own favorite term, be it AI, machine learning.etc.  However, I’d like to focus on the right side of that equation. The code bit.

Trying stuff...Good software development is much more that learning the syntax, data structures, and libraries of a programming language. It’s a set of disciplines and techniques.  Begin a professional data scientist means having professional competency in both terms of that equation. You may be a master of models but if your peers snicker at you and use phrases like ‘Common Law Feature’ or ‘Mad boyfriend/girlfriend bug’ about your code then you’re in trouble

Here are a few quick remedies and simple advice to get you coding like a pro.

Build some test cases!

We spend much of our time being data janitors, cleaning the data, and we hate it.  So much for the sexiest job of the 21st century lol! A Crowdflower survey reported that 57% of data scientist found cleaning and organizing data the least enjoyable part of their job.  Your features may not change but that data flowing through certainly will as you apply your models.  Inevitable your code will change, or break, or your results will be off. Is it a data issue or a code issue? Your test cases can go a long way to answering that question.

Code coverage

Now you’ve built some unit, integration or regression tests it’s time to add some code coverage frameworks such as or covr for R.  Code coverage is a measure of the extent to which your code has been tested when you run your test cases.  My advice.  Forget about 100% coverage and even far less is fine as long it covers the critical and messiest (i.e brittle) code bits. If you do get wonky results at least you know what parts of the code are suspect and what parts are solid (tested). Professional developers eventually come to realize there's a compromise between achieving 100% test coverage, being productive, and testing the code that matters.

Code Quality

This one is a no-brainer and it astounds me how even pro developers get lazy on this one.  Regardless of the programming language you develop your models in, there are many code quality tools out there.  Run it against your code and it will give you tips and pointers on improving your code quality and thus save yourself countless hours of refactoring and debugging.  Think of these tools as peer code reviews without being embarrassing in front of your peers.  Tip. Use them at the beginning and throughout a project if possible and not just the end. Duh...

Version control your data

If you need convincing on software version control then you are a lost cause and never darken my door.  For professionals it’s not optional, it’s required. Since software version control is a well established best practice, It’s curious that more data scientists don’t bother to version control their data, size, and security issues aside.  Simple case.  You have a CSV file and you commit it to git.  Commit another and the changes are readily identifiable. At the risk of mentioning Rails on a data science blog, ActiveRecord data migrations are a great example of this done well by providing an audit trail. Relational and NoSQL databases provide numerous (but perhaps not obvious), ways to version control you data in addition to just storing and accessing it.

Learn Some Patterns

Software design patterns are ubiquitous and well-established part of software development  The advantages of using a  language’s shared libraries are many including not having to code common tasks from scratch.  Data Science Design Patterns are similar in that they offer a possible solution to a commonly occurring problem.  I say, possible, since the pattern may have to be tweaked or reworked for your particular problem.  Information is limited on data science design patterns but I expect them to become more popular over the next few years and become as established and required as software design patterns.

Here are a few more somewhat blatantly obvious points to get you on the pro path

  • Commit your code and back it up.  The day you don’t and lose it is the day you will cry like a toddler on the first day of kindergarten.
  • Get a better code editor.  Not using Jupyter Notebooks?...what is wrong with you!!
  • Don’t be afraid to refactor your code.  Got it right the first time? Highly doubtful.
  • Comment your code. The code quality tools mentioned above will catch lots of things like potential bugs, uncommented code etc but it can’t determine how inane your comments are.  Your code tells a story that most likely someone else will read. Make it a good one.
  • Read the source.  We love open source data science libraries and they make us enormously productive.  However, have you ever looked at the source code? It’s tough going but do it now and again and I promise it will be insightful and highly educational.

As you advance in your career you will put greater emphasis on connecting  with fellow data scientists who are in the trenches and can guide you on what coding languages, tools, and practices they find useful.  Applied data science conferences such as ODSC West are an excellent way to accomplish and accelerate this goal.  ODSC events give you the opportunity to connect with your peers, and learn the latest languages, tools, and topics associated with programming for data science. You also get to hear and learn from some of the top coders who brought you your favorite open source tools and libraries.

USE CODE ODSC_KDN for 20 percent extra discount