What is data science?

I think everyone that is trying to put together a blog about data science will feel the pressure to come up with a post explaining what is data science. So this is my take on the subject.

But before starting on it, I think is worth to talk about myself a little, since my career and my experiences will strongly influence my vision on the subject.

I major in Biological Sciences and I earn a Ph.D. researching the evolution of complex systems. My scientific work focused on simulating the evolution of complex animal features. The features include both morphological and genetic traits. To explore the data, I mostly used clustering algorithms, coupled with frequentist or Bayesian statistical methods. So, as you can see, a lot of statistics and computations, where I would change the initial conditions to see if I could get to convergent results. Later, and unrelated to my research, I taught myself how to code. Later yet, I changed my career and to pursue a data scientist position. This previous experience with experimental science and the academia will strongly influence my vision of data science and how I work.

Data Science Today

Data science lies at the intersection of computer science, statistics, and business domains. Those three components together will be the framework of data science, with a focus on getting value off from data. The “value” can be obviously monetary, but it can also mean insights or intelligence. To get value from data, you need to study your data, so data science will use mainly the scientific method (the stats part) applied to a large amount of data (you will need the computer science knowledge here to move this data around) with a goal of making money (the business part, of course).

I’m going to talk a bit about each of the components. But it’s important to notice that those three components are not evenly distributed in the life of data scientists. Or even that this uneven distribution of importance occurs at the same ratio for everybody in the profession. Some data scientist never worked outside of an experimental bobble of R/python workstations. Those data scientist may never put a model in production or work with production ready code. While other data scientists may touch nothing closely related to a significance test, but are very strong software developers. And yet, some data scientists work with a strong team of product people or are mainly working with experimental stuff, so they are not worried about the business side of the profession.

Statistics

You can’t do much without statistics. Statistical methods are the base of science and make it possible to explore the data, measure significance, and visualize the results. This used to be called data mining, and luckily for us, it is not used anymore. This name is just sad and undermines the importance of analytics (or descriptive statistics) and statistical inference.

In my opinion, statistics are the most difficult to grasp and the most important of the three areas. This is obviously a biased opinion, but I already told you that in the introduction. My work in experimental science and the academia is quite important to me and definitely influences my opinion.

So why do I think this is so important? This is the only part of data science that will not change. Everything else can change, especially technology and business goals. But statistics will stay, because you can always rely on the method to help you make decisions. So, it’s worth to be good at it. And of course, it’s the part that I feel that I’m good at, so another bias here =).

Take you time to understand the advanced part of statistics, especially how to design experiments and deeply analyse data. The effort will pay off in the long run.

Computer Science

New technology and computer power make it possible to capture and store huge amounts of data. So, you need to be comfortable with algorithms, computers, and tools for dealing with the ever-increasing scale of data. High-performance computing technologies and cloud computing allow us to access databases, wrangling data, and to apply complex machine learning and artificial intelligence algorithms. Machine Learning and IA models can make predictions that are worth a lot of money for companies. To make these predictions useful, they need to move out from the data scientist computer to the world, in a form of applications. Basically, is what is called to put your models into production. For that to work, you need to have skills of a software programmer. You will need to be well versed in programming languages (mostly Python nowadays), algorithms and data structures, version control systems, debugging and testing, and databases.

Business Knowledge

You need to understand the business goal to identify the battles worth fighting. Work with data is expensive and takes time. It’s part of the job to communicate what is possible and what is a waste of time to the business people. Additionally, it’s easy to get distracted testing all the new and shining technologies out there. But it is part of the job to bring value to the company you are working in. So, it’s important to think about yourself as a problem solver and focused on delivering value, not develop with the coolest new algorithm.

Anything else?

I’m glad you asked! If it wasn’t enough to have knowledge of three completely unrelated fields, I would add some other thing that help data scientists.

Math knowledge can be very helpful. You don’t need to know everything, but is worth to grasp a little of linear algebra, calculus, probability, discrete mathematics, and optimization theory.

You can go really far on the career by being an excellent storyteller or communicator. Especially when you are more advanced in your career. Being able to extract the nuggets of gold hidden under mountains of data is a valuable skill in the field. For data scientist in all stages, you can benefit from some knowledge of infovis and graphic design (and it can be a good starting point for aspiring data scientists).

It is clear that the profession is very demanding, and the subjects are very broad. Some people argue that is the data scientist’s job to understand the entire process. The data scientist should do it all: data gathering, data wrangling, model training, writing production ready code, put your model in production, and monitoring the results. Like “real scientists” do. But the reality is that most scientists don’t work in all phases of a research project. Large research projects can be very compartmentalized, and the researchers can divide the work to be more productive. I think the same holds truth to data scientists. Data scientists will be more productive if they specialize and divide the workflow of large projects. But this is a subject for another post.

My Favorite Definitions

There are so many great definitions of what is data science and the work of data scientists out there. They were written by people with much more knowledge than I, and I’m always happy to read them again and to relearn each time. So, to end this post with some great information, I’m going to share some of my favorite quotes here and its sources.

On training new data scientist and the public to deal with data:

“Training the next generation in the fine art of deriving intelligent understanding from data is needed for the success of sciences, communities, projects, agencies, businesses, and economies. This is true for both specialists (scientists) and non-specialists (everyone else: the public, educators and students, workforce). Specialists must learn and apply new data science research techniques in order to advance our understanding of the Universe. Non-specialists require information literacy skills as productive members of the 21st century workforce, integrating foundational skills for lifelong learning in a world increasingly dominated by data.”

Kirk D. Borne and other astrophysicists submit to the Astro2010 Decadal Survey a paper titled “The Revolution in Astronomy Education: Data Science for the Masses“ (PDF).

On the data scientist zen:

“Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: ‘here’s a lot of data, what can you make from it?’”

In 2010, Mike Loukides writes in “What is Data Science?

On the various disciplines that make the data science scope:

“…we thought it would be useful to propose one possible taxonomy… of what a data scientist does, in roughly chronological order: Obtain, Scrub, Explore, Model, and iNterpret…. Data science is clearly a blend of the hackers’ arts… statistics and machine learning… and the expertise in mathematics and the domain of the data for the analysis to be interpretable… It requires creative decisions and open-mindedness in a scientific context.”

In 2010, Hilary Mason and Chris Wiggins write in “A Taxonomy of Data Science

On the definition of the data science for what data scientists do:

“’Data Science’ is defined as what ‘Data Scientists’ do. What Data Scientists do has been very well covered, and it runs the gamut from data collection and munging, through application of statistics and machine learning and related techniques, to interpretation, communication, and visualization of the results. Who Data Scientists are may be the more fundamental question…  I tend to like the idea that Data Science is defined by its practitioners, that it’s a career path rather than a category of activities. In my conversations with people, it seems that people who consider themselves Data Scientists typically have eclectic career paths, that might in some ways seem not to make much sense.”

In 2011, Harlan Harris writes in “Data Science, Moore’s Law, and Moneyball

Just for fun, a twitter definition:

what is data science
Twitter definitions (circa 2014)