Data Science History and Overview

In this era of big data that is only getting bigger, a huge amount of information from different fields is gathered and stored. Its analysis and extraction of value have become one of the most attractive tasks for companies and society in general, which is harnessed by the new professional role of the Data Scientist.



By Giuliano Liguori, Global CIO & Digital Transformation Manager.

 

What is Data Science in simple words?

 

The term “Data Science” has emerged only recently to specifically designate a new profession that is expected to make sense of the vast stores of big data. But making sense of data has a long history and has been discussed by scientists, statisticians, librarians, computer scientists, and others for years.

Nowadays, Data Science as a business field is really complicated, and due to its remarkable popularity, there are numerous descriptions of data science, for example:

Data Science is concerned with analyzing data and extracting useful knowledge from it. Building predictive models are usually the most important activity for a Data Scientist (Gregory Piatetsky, KDnuggets, https://www.kdnuggets.com/tag/data-science).

Data Science is concerned with analyzing Big Data to extract correlations with estimates of likelihood and error (Brodie, 2015a).

Data science is an emerging discipline that draws upon knowledge in statistical methodology and computer science to create impactful predictions and insights for a wide range of traditional scholarly fields (Harvard Data Science Initiative https://datascience.harvard.edu).

However, in simple words, data scientists just try to get insights from massive amounts of data that can help companies make smarter business decisions. We also define Data Science as a methodology by which actionable insights can be inferred from data.

Data science uses a wide array of data-oriented technologies, including SQLPythonR, and Hadoop, etc. However, it also makes extensive use of statistical analysis, data visualization, distributed architecture, and more to extract meaning out of sets of data. The information extracted through data science applications is used to guide business processes and reach organizational goals.

To complete this section, we will also provide a simple definition of the concepts of data mining, artificial intelligence, machine learning, and deep learning, as these are related to data science and each other.

  • Data mining aims to understand and discover new, previously unseen knowledge in the data.
  • Artificial intelligence (AI) is concerned with making machines smart, aiming to create a system that behaves like a human.
  • Machine learning is a subset of Artificial Intelligence. Machine learning aims to develop algorithms that can learn from historical data and improve the system with experience.
  • Deep learning is a subset of ML, in which data is passed via multiple numbers of non-linear transformations to calculate an output.

Fig. 1 Relationship between Artificial Intelligence, Machine Learning, Deep Learning, and Data Science.

Data science makes use of data mining, machine learning, Artificial Intelligence techniques.

For example, deep learning requires running Jupyter in more powerful environments. Fortunately, platforms like Saturn Cloud let users facilitate the management of the Jupyter development environment. In fact, by managing the resources of the environment, the user can enable more power in terms of CPU, GPU, and memory, just when it is necessary. A platform designed for cloud computing, therefore, allows keeping the environmental costs low, allowing the Data Scientist to pay only for the resources he uses.

 

A brief history of Data Science

 

Data Science has revolutionized several different aspects of our world. Let's take a look then at when and where data science comes from.

  • In 1962, John W. Tukey wrote in “The Future of Data Analysis” - The first milestone in the history of data science is globally recognized for the bright American mathematician John Tukey. The influence of John Tukey in statistical terms is enormous, but the most famous coinage attributed to him is related to computer science. In fact, it should be mentioned that he was the first to introduce the term "bit" as a contraction of "binary digit."
  • In 1974, Peter Naur published the Concise Survey of Computer Methods, which surveyed data processing methods across a wide variety of applications. The term “data science” becomes clearer, as he puts his own definition on it: “The science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.”
  • In 1977, the International Association for Statistical Computing (IASC) was founded.
  • In 1989, Gregory Piatetsky-Shapiro organized and chaired the first Knowledge Discovery in Databases (KDD) workshop.
  • In 1994, BusinessWeek published a cover story on “Database Marketing.”
  • In 1996, on the occasion of the conference of the International Federation of Classification Societies (IFCS), for the first time, the term “data science” is included in the title of the conference (“Data science, classification, and related methods”). In the same year, Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth publish “From Data Mining to Knowledge Discovery in Databases.”
  • In 1997, during his inaugural lecture as the H. C. Carver Chair in Statistics at the University of Michigan, Jeff Wu called for statistics to be renamed “data science” and statisticians to be renamed “data scientists.”

Fig. 2 History of Data Science.

Since the beginning of the 21st century, data stockpiles have expanded exponentially, largely thanks to advents in processing and storage that are both efficient and cost-effective at scale. The capability to collect, process, analyze, and display data and information in “real-time” give us an unprecedented opportunity to conduct a new form of knowledge discovery. To process this huge amount of data, Data Scientists need high performance also of a large portfolio of technologies to speed up tasks and data processing in a matter of seconds.

Disruptive technologies like artificial intelligence, machine learning, and deep learning are nowadays available for Data Scientists thanks to powerful platforms available.

 

Challenges to practicing Data Science

 

While the adoption of analytics has increased, it comes with its own set of challenges. A study conducted in 2017 by Kaggle on a sample of 16000 data professionals showed us the 10 most difficult challenges faced in their profession:

  1. Dirty data (36% reported)
  2. Lack of data science talent (30%)
  3. Company politics (27%)
  4. Lack of clear question (22%)
  5. Data inaccessible (22%)
  6. Results not used by decision-makers (18%)
  7. Explaining data science to others (16%)
  8. Privacy issues (14%)
  9. Lack of domain expertise (14%)
  10. Organization small and cannot afford the data science team (13%)

These appear as strong challenges to address. However, we need to realize that for every step forward in a new discipline, new challenges need to be addressed. We must embrace transformative changes, and we must be assured that changes help us ensure continuous improvement, acquiring new skills, expanding our knowledge, and exploring new approaches.

 

Who is a Data Scientist?

 

As pointed above, with constantly growing operating data and emerging new technologies, we increasingly need professional whit analytical acumen to extract valuable information and insights from the massive amount of data and make a precise decision. We call this type of expert "Data Science teams” or simply “Data Scientists."

The Data Scientist is an analytical data expert who should masterfully possess the necessary technical skills to solve complex problems in the modern world. Today's emerging technologies, such as AI, IoT, 5G, robotics, blockchain, and so on, rely heavily on data, and only those who will be able to operate with data and translate them into profitable products will guide the digital business of next future.

Therefore, Data Scientists are playing an essential role in the business development strategy of every company and organization. As said by Thomas H. Devenport and D.J. Patil, the Data Scientist is the sexiest job of the 21st Century.

 

Data Science tools for Data Scientists

 

An extensive collection of software tools is available to support the Data Scientist to dive into the world of Data Science. Nowadays, the platforms available enable Data Scientists to work at scale using the tools they know best: Python, Jupyter, and Dask. Usually, the services are provided through secure and scalable infrastructures for running data science and machine learning workloads within cloud environments. Data teams can develop and deploy data science models in Python at scale with automated DevOps and ML infrastructure engineering.

The platforms support a lot of useful Python libraries.

“Python libraries are collections of functions and methods that allows a Data Scientist to perform many actions without writing code.”

  • NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices.
  • Seaborn is a Python data visualization library based on matplotlib.
  • TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks.
  • PyTorch is an open-source machine learning library based on the Torch library.
  • Numba is an open-source JIT compiler that translates a subset of Python and NumPy into fast machine code using LLVM.
  • SciPy is a free and open-source Python library used for scientific computing and technical computing.
  • Pandas is a software library written for the Python programming language for data manipulation and analysis.
  • Scikit-learn is a free software machine learning library for the Python programming language.
  • Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.
  • Bokeh is a data visualization library in Python that provides high-performance interactive charts and plots.

Fig. 3 Python libraries.

For example, the well recognized Data Science Platform Saturn Cloud offers an end-to-end analytics platform all in Python on AWS. This includes:

  • Dask that allows organizations to scale out Python and dramatically reduce runtime.
  • Suite of collaboration tools, model deployment capabilities, and tools for the machine learning lifecycle.
  • Prefect that provides a workflow orchestration framework that eliminates manual effort on the part of developers and data scientists.
  • Integration with services like Docker and Kubernetes so that data scientists can build a custom image to meet their best development expectations.
  • Jupyter Notebooks to deploy, manage, and scale the PyData stack.

 

Where are we going? Perspectives.

 

As John Tukey predicted: “the future of data analysis can involve great progress, the overcoming of real difficulties and the provision of great service to all fields of science and technology.” During the last years, we have become witnesses of many of data-driven technological innovations, 5G lightning-fast Internet speed, machine learning, cloud computing, and the blockchain concept, with such a noteworthy list being far from exhaustive. The explosion of data along with growing technological abilities is just the beginning, and our life is becoming “smarter” with technology innovations that might be soon be integrated into all aspects of human life.

Original. Reposted with permission.

Related: