Topics: Coronavirus | AI | Data Science | Deep Learning | Machine Learning | Python | R | Statistics

KDnuggets Home » News » 2020 » Apr » Tutorials, Overviews » Has AI Come Full Circle? A data science journey, or why I accepted a data science job ( 20:n15 )

Has AI Come Full Circle? A data science journey, or why I accepted a data science job


Personal journeys in Data Science can vary greatly between individuals. Some are just getting starting and wading into this vast ocean of opportunity, and others have been involved during its decades-long evolution as a professional field. This review of a longer journey can provide a broader perspective of how you might fit into this interesting career.



By Tom Khabaza, Experienced Data Miner.

Career path

I greatly enjoyed Admond Lee’s article in KDnuggets,  “Why Did I Reject a Data Science Job?” because I felt a strong sense of recognition.  I approved of Lee’s decision to choose the job over the title, perhaps because titles and terminology now seem to me so fluid after a long career.  Lee inspired me to think about my data science journey; perhaps KDnuggets readers would be interested in a longer perspective.

 

Prehistory

 

It’s a cliché to agonise about where to start.  My data science journey began in 1978 when I first laid hands on a computer, writing simple numerical algorithms in BASIC using a teletype connected to a mainframe, then using BASIC, FORTRAN IV and assembler to collect and analyse experimental results; using early versions of Unix and enthusing about Bourne’s new shell; learning about object oriented programming in Smalltalk from Byte magazine and getting excited about Stroustrup’s C++; finally lecturing on machine learning and AI.  All these threads led to my career in data science.

 

Machine Learning Applications

 

My real data science career (except we called it “data mining”) started in 1992, working for ISL, a British AI start-up, implementing our first commercial application applying machine learning to data.  Back then, we coded in POP-11 and used its machine learning libraries, all of which would seem very familiar to today’s data scientists using Python.

After coding 2 or 3 applications for different clients, we wanted to make the process easier by creating re-usable modules.  This was also the early days of graphical user interfaces, and we thought that a visual way of connecting these modules might be helpful.  This was the origin of Clementine, invented by Colin Shearer; Clementine quickly sparked more commercial interest than the rest of our products put together.

 

Data Mining for Data Owners

 

Clementine was not just a useful tool for machine learning engineers; behind it was a vision: to enable business people to use machine learning with their data, in the same way that spreadsheets had enabled them to do simple business computing without programming.  We often referred to Clementine as “the spreadsheet of data mining.”  The goal of Clementine was to make machine learning applications possible without programming or deep knowledge of machine learning.

My job in the mid-1990s was to lead the consultancy around Clementine: to go into client organisations, understand their business issues and related data, and show them how to build a machine learning solution.  We provided training courses on how to use the software, but that’s quite different from knowing how to formulate an analytical solution to a business problem.

This was also the era of the CRISP-DM project: collaborating closely with a small number of end-users and suppliers, and more loosely with a wide range of organisations, we aimed to produce an industry-standard methodology for data mining.  We produced a distilled version of the process by which machine learning applications are created, and we offered a systematic answer to the question I’d been answering through ad-hoc consultancy: “how can I start from a business objective and produce a machine learning solution?”  CRISP-DM had special strengths in its approach to business goals, business evaluation and deployment of results, and the importance of business knowledge throughout the process.

However, my experience with clients had also revealed a glitch in our “democratisation” of machine learning.  Although Clementine applications could indeed be built without programming or deep knowledge of machine learning, this did not mean that they were as easy to build as a simple spreadsheet.  Applying machine learning to business problems requires conceptual leaps: this showed up in our training courses, where the machine learning aspect was always the hardest part for students to understand. The transformation of data for analysis was also significant: each step was easy to understand, but a complex transformation process remains complex, and in some ways, just like programming, even when using a higher level toolkit.  Consider the difference between coding in assembler and Python; both are programming, but no one would choose assembler over Python without a good reason (except for fun).

 

Predictive Analytics & Mainstream Machine Learning

 

At the end of 1998, ISL was acquired by SPSS.  For Clementine as a product, this was a game-changer, opening up a global market and leading to a vast array of applications.  This was a period of expansion, during which the immense value of machine learning applications became apparent; we published case studies of industry-changing applications and ROI in billions of dollars. We applied Clementine to a wide range of data, including unstructured forms. The term “Predictive Analytics” was coined: a more understandable phrase with a clearer link to potential benefits.  Machine learning was becoming mainstream, and the main vehicle for this was commercial software packages and their associated methodologies (this description applies equally well to SAS and Clementine, now renamed “SPSS Modeler”).

 

The Era of Data Science and Open Source

 

Over the last 10 years, the growth of mainstream machine learning has continued hand-in-hand with the growth of open source software.  The association of machine learning applications with open source was surprising to many in the world of commercial software, but perhaps it should not have been.  The high price of commercial software created a barrier, not so much for organisations adopting machine learning as for individuals learning the tools of the trade.  It’s much easier for a university to teach or a curious individual to learn Python than a commercial package, so it’s easier to hire people with this kind of knowledge.

On the other hand, the outstanding benefits to be gained from machine learning applications made commercial software a victim of its own success.  What does it matter if developing a predictive model takes a little longer if the ROI is 10,000%?  Compared to the benefits, both software and development costs were insignificant, and in any case, developing a machine learning model is only a small part of the complete solution.  More complex development processes and a lack of systematic methodology may slow down development and reduce the quality of results, but the returns are still enormous, and, well, everybody’s doing it.

 

Perspective

 

At first, I resisted the term “data science,” but I’ve come to accept it; I’ve seen buzzwords come and go, and like Lee, I think that what I do is more important than what it’s called.

I’ve spent most of the last 12 years as a freelance data scientist, helping organisations get better value from their data, sometimes using SPSS Modeler or SAS, but increasingly using SQL, R, and Python in line with the preferences of my clients.  Developing a model in a coding language may be more cumbersome than using a higher-level tool, but I still enjoy coding, just as I did at the very start.

So it doesn’t bother me that AI applications are now built using tools like those of 30 years ago, or that more efficient tools are widely ignored.  It does bother me when the methodology is also 30 years out of date; exclusive emphasis on low-level technical issues can sideline business requirements and the business significance of intermediate findings.  Technical problems must be solved, but if we do so at the expense of business relevance, then we doom our solutions to fail.

I’ve seen good machine learning projects, and I’ve seen bad ones; the best are those where development is closely linked to business outcomes because this encourages a business-focused methodology and fosters success.  It’s good to be doing something useful!

 

Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy