Should Data Science become a Profession: Pro and Con

A Data Science Code of Professional Conduct can protect both consumers of data science and data scientists themselves. But it is useful and possible without a single professional body? Read the pro and con arguments and join lively debate on this topic.

On April 10, 2013 Gregory Piatetsky-Shapiro (KDnuggets), Eric Siegel
(Predictive Analytics World) and Michael Walker (Rose Business Technologies) Google Hangoutdiscussed on a Google Hangout whether data science should be an independent profession with a code of professional conduct and self-regulation.

Regulation of data science is under consideration (read here and here) and Michael Walker argued that either data science becomes a profession and regulates itself or Congress will impose draconian regulations that defeat the purpose of data science: to make life, business and government better. He has drafted a "Data Science Code of Professional Conduct".

Michael made these arguments in support of data science as a profession:

1) Data science is in the early stage and needs to develop a "Canon" (a body of principles, rules, standards, or norms) of scientific methods, principles and best practices for practitioners. Data science incorporates and overlaps several disciplines (data mining, statistics, machine learning, cloud/high-performance computing, databases, visualization), is wide open for innovation , and requires guidance to ensure data science is used to make life, business and government better, and prevent abuse. Ninety percent (90%) of the world's data has been produced in the past two years and will grow exponentially. How we extract meaning from all this data without creating "an illusion of reality" is important.

2) To protect both consumers of data science and data scientists from charlatans, illegal and unethical conduct and data science malpractice. A Data Science Code of Professional Conduct is needed to protect individuals privacy, clients confidential data, prevent conflicts of interest and to ensure data scientists have a duty to the greater good of society, and not just blind loyalty to the client.

3) Self-regulation versus imposed regulation. Either data science becomes a profession and regulates itself or congress will impose both good and bad regulations. It is better for data scientists to architect and implement a regulatory scheme than to trust congress to enact an appropriate regulatory structure that may defeat or limit the development of data science.

4) To create a check and balance against big government and big business using data science at the expense of the majority in society. Some argue that the internet, mobile smart-phones and computers are a big spying machine that big government and business uses to collect information on people further eroding civil liberties. The potential for abuse is significant and the professionalization of data science can mitigate harms.

Reasons to oppose data science becoming a profession include:

1) Professions tend to create artificial barriers to entry causing artificially higher prices.

2) Professions tend to be self-serving at the expense of consumers.

3) Professions - after a period of time - tend to stifle innovation to protect vested interests.

Michael Walker argued that - on balance - the equities favor data science becoming a profession. He pointed out that in many disciplines like medical research, economics and psychology, data manipulation is common and the scientific method has not been honored resulting in decreased reputation and the eroding trust of society. Future data scientists need to preempt this outcome by not only honoring the traditional scientific method, but by developing new data science "canons" and scientific methods to liberate meaning from data without creating an illusion of reality.

Eric Siegel is agnostic about whether data science needs to become a profession. Mr. Siegel agreed that data science can be abused - that a code of professional conduct may be useful and stated that a certification to establish a base level of competency may be prudent. He voiced concern over the civil liberties aspect of the use and potential abuse of data.

Gregory Piatetsky-Shapiro argued against data science becoming a profession. He asserted that other established organizations - like ACM (computing professionals) - is considering The Pledge of the Computing Professional which touches upon many themes relevant to Data Science - and also pointed out that INFORMS has Analytics Certification programs

He thinks these organizations will be adequate to develop data science.

Gregory asserted that while a code of professional conduct is a noble goal, it is meaningless without a central organization that promotes and enforces this goal, and currently data science is such a diverse field that central organization is very unlikely. Just looking at current Data Sceince related meetings on page, we see meetings sponsored by research societies like ACM, IEEE, INFORMS, SIAM, commercial companies like O'Reilly, GigaOM, IEG, Big Data Companies like IBM, SAS, EMC, and many others. It looks very unlikely that all these diverse interests will agree to a single organization to enforce any code of conduct. This view was shared by the majority of data scientist who took part in a recent KDnuggets Poll (March 2013) a were against a Data Science pledge

Michael responded that data science is a new field that encompasses a variety of skill sets from different disciplines and desperately requires a professional body to develop canons that incorporate and blend scientific methods from a myriad of disciplines. The blend of scientific methods will create something new and relying on the scientific methods of math, statistics, computer engineering and others - alone - is not sufficient. Data science requires its own professional canons.

Michael also asserted that - while a majority of data scientists may not at this time favor a "pledge" - a large majority of data science consumers would likely favor hiring a data scientist who is certified and is required to honor a code of professional conduct - similar to certified public accountants, lawyers and physicians. Considering the significant damage data science malpractice can cause, Walker speculated that the market would favor certified, professionalized data scientists. Moreover, a professional code can protect data scientists from unethical and illegal client conduct.

Mr. Walker suggested that we should learn from other professions like law and medicine - adopt the good and remove the bad to mitigate the negatives of a profession. To earn and maintain trust and credibility, data science must follow traditional scientific methods, innovate new methods and follow a code of professional conduct.

Comments from around the web:

In Next Gen Market Research (NGMR) - The Best MR Networking Group on the Web!

Tom Anderson
Why is everyone so gaga for regulations??

Gregory Piatetsky-Shapiro
Most data scientists in KDnuggets Poll are against regulation, but the question to be debated is whether government will impose some regulation. I doubt it. However, there are professional societies that create certifications. But if data scientists are in big demand and get increasing power, what is the responsibility that comes with that power?

Tom Anderson
ZZZzzzzz Makes perfect sense that most are against. Useless certifications are usually only a benefit to those who charge dues for them.

What is your stance, for or against?

This very same question came up during the panels I participated in last year at both the Text Analytics Summit and Text Analytics World events. There too consensus among both panelists and attendees seemed to be that standardization etc. were a bad idea.

Gregory Piatetsky-Shapiro
I can see a lot of demand for technical, professional certificates, to recognize an achievement in education. INFORMS offers CAP, many analytics outlets offer certificates:

I don't see any demand for a "non-technical" pledge or code of conduct.

Tom Anderson
The certs are probably ok for folks early on in their careers. Don't think there would be much demand among skilled practitioners who can point to the work they have already been doing.

Anyone with enough experience or a degree in associated fields would probably not go for it. The mere fact that those who are best in the field would not seek it out would serve to drive down the value of the cert.

INFORMS is a very interesting org which might have the credibility needed.

But if MR industry gives any indication of what would happen it would be that many different trade orgs and trainers of all types would all serve to further drive down value of any cert. Why not just say on your resume (assuming you needed it) that you have taken X Y and Z training courses/seminars. Whether cert or not, and employer worth their salt would need to confirm knowledge skill later anyway.

From Research Methods and Analytics

chris jensen
My belief is that data should be open to all, there so much hidden data and probably will always will be hidden data, such a shame for the human race...

Gene Shackman
I don't know if it SHOULD. I know that it won't. "data science" is too vague, too many different people with different backgrounds do it, there isn't any universally accepted definition, there is too much money involved so that too many people will want in, etc. So it's not likely to happen.

What is it that's really behind this question? I suspect that there is worry that too many untrained people do it, without really knowing all they should know. So, instead of regulations, we should encourage everyone involved to get training, and educate the public about statistics and data analysis.

Mark Biernbaum, PhD
I agree with Gene, in that this question is probably prompted by the droves of people doing data that have no real idea what they're doing (like in business, for example, and in particular, in marketing). I know that my sister got an MBA several years ago and was required to take 1 course in data, and that was it, regardless of the fact that her emphasis is marketing, which uses tons of data. I also know big-time data miners who could not tell you what the assumptions were that govern ANOVA testing - they live by the law of large numbers and don't think any of that applies to them.

Any credentialing program set up now would have to grandfather in thousands upon thousands upon thousands of individuals who have been using data their entire careers. For those being educated now, a credential option might not be bad - at least insuring they know the basics. And the credentialing option might help future generations. But right now, so many would be grandfathered in, it makes little sense. Also gaining agreement from professionals on what the credential should contain could be totally impossible, given the huge array of ways data is used in our society