The need for using data to solve an issue or optimize business processes is bigger than ever. Business’ use data as a source of knowledge by exploiting one or several tools and technologies from a broad array of potentials to extract insights or build machine learning models to predict future events, in order to in a subsequent step create business value.
The journey towards creating value for your business from data demands many different components so that we can realize that value. Like for many of my Data Science colleagues the most fun I get out of my job as a Data Scientist is by working hands-on with technology and use advanced algorithms for a specific purpose to finish tasks and reach a checkpoint or target, and that is of course an essential part within any data workflow, yet not where the journey begins and not all that is to the journey towards getting the most out of data.
A big part of working in Data Science is about addressing uncertainty. Although uncertainty comes from multiple sources, the primary source of uncertainty is by far the data we work with, where we have unknown complex patterns to discover and use to our advantage. How we address the uncertainty is not always something we’ll be able to plan for yet setting up structured ways of working will help your Data Science team address uncertainty in a more efficient manner than otherwise.
I’ll use this post to describe uncertainty when working with data and motivate defining the parts of a workflow when thinking about handling uncertainty.
To start out, let’s define the Data Science work related to uncertainty before suggesting ways of working within the field. Data Science has been and is very much in a maturing state and from this there has come a lot of good things to apply in how we apply knowledge, technology and other tools.
Your Data Science team works towards creating an outcome (we don’t know exactly what the outcome is yet or what it will contain, and we’ll get back to that later in this post) and that outcome is what they’ll deliver. The outcome can be an insights report, a machine learning model or something else depending on demands and requirements from the business side. Delivery can for example mean that the team hands over a report or put a machine learning model in production. From data, insights and code, your team will build the outcome and deliver it, and then you’ll build a new iteration of that outcome.
In this context there is expectancy of delivery and often demand that this should happen within a certain time interval. A lot of that is about manage expectations and let people behind the requirement, client or other party, know what to expect from data science in relation to what can be expected from fields seen as relatively “close” in its nature, such as development.
To deliver something great using Data Science the team needs to invest time in learning. A big part of learning in Data Science comes from experimentation as Data Science is vastly research oriented. If you are building machine learning models each model you train is a new experiment and if it succeeds or fails when evaluated according to machine learning metrics or/and business requirements, means that you’ll learn something either way. Similarly, when beginning the analytics work required before delivering an insights report you might have some sort of idea of what the output will look like, but really you don’t know that much about the content since you don’t know what you’ll find in the data. You might have ideas or hypotheses about things you'll find in data, but those needs to be confirmed and proved through experimentation.
It is the explorative nature of digging into data and finding the best outcome instead of building towards a preset outcome that distinguishes data science from other fields. The need of exploration of data and experimentation with potentials outcomes has its source in the uncertainty found in data and in the outcome. The uncertainty of patterns like in behavior or in occurrence of events, and uncertainty in how well a build solution will capture or reflect patterns over time as they evolve over time.
Learning happens in many of the tasks involved in creating a data solution. For a data science team learning about the business side is just as important as learning from digging into data, finding patterns and building models. The business objectives set direction which is used to form the expected data solution and break it down into value creating tasks.
Because of the uncertainty of building solutions based on data and the resulting need for research into often complex pattern, the use of hypotheses and work through trial & error, the data solution is best built during an iterative process where agile methods are essential to be able to adapt to new findings and requirements, and to let the solution be defined and evolve over the course of the project. The data solution is the outcome of the project and during the project we’ll know what the solution is and what it can accomplish.
There are methodologies for how the workflow can be structured, the most common being CRISP-DM (Cross-industry Standard Process for Data Mining), either in its original form or altered, and TDSP (Team Data Science Process), which like most Data Science workflows is derived from CRISP-DM. I tend to favor TSDP because of its up to date components like version control and being in-tune with agile methodologies, as well a team and collaboration focus and inclusion of modern technical practicalities used during deployment. Yet CRISP-DM still serves a purpose if used as tool for introducing new colleagues to how to get the most out of working with data, as the older workflow is a bit easier to understand. How that workflow should be structured will probably be adjusted for the needs of your team, like what competences are included, and the needs and requirements specified during the project.
Delivery is about deploying the data solution to production and reaping its benefits or handing over an insights report. It is also about making it possible to integrate changes to current solutions efficiently. The level of benefit is dependent on that your Data Science team has been able to learn enough from data. And you don’t want to see time wasted when the team can be productive and work towards the next iteration of the data solution. What you want to enforce is rapid experimentation, where your team can learn and build from data in faster iterations while pertaining at least the same level of learning. You’ll be able to deliver each iteration faster while potentially learning more since the team may have time to produce more experiments.
The way to achieve this pertains to setting up the workflow. The combination of things that is needed to make this work is not a one size fits all kind of methodology, so using a methodology like TSDP and adjusting it to the needs and setup of your team is a great start. There are other components that are directly tied to handling uncertainty to that workflow besides the mentioned lifecycle, and I’ll describe these below
As mentioned earlier, there are multiple formalizations of how a Data Science workflow should look like, yet the common threat is that the data centered work moves in some sort of manner between stages such as business understanding, data understanding, planning, experimentation and building a data solution and deployment or delivery of that solution, for example a insights report or a machine learning model that can be called through an API. These steps are part of what is called a lifecycle of the project.
Understanding why your team needs to go through these steps and what parts within these steps relates to your project is essential. Also, lining up these steps as a linear or strictly one-way process prohibits the application of agile methodologies. New findings and issues will show up because of inherited uncertainty of patterns found in data. To able to respond to new findings and issues during the development of the data solution the lifecycle should be flexible, so that team members can jump back and forth between steps based on project needs.
Directly connected to the lifecycle is the composition of the team and how the team works together. Data Science is best as a team sport. “Who does what?” is often an easy question to answer and sometimes not. Divide the work. For example, a Data Engineer and a Data Scientist have their own task within the project, since they have their own separate skill sets and can contribute to the project in different ways than the other. Some flexibility is healthy if something unforeseen should happen and a colleague that is tasked with something else can help with another colleagues' task. That is an important part of being an adaptable team.
The team structure and division of tasks is also important part of letting the competence brought in to do the experiments focus on handling uncertainty in data by finding and exploiting the patterns tied to the demands of the business. If there are other issues with the data, such irregularities found when data quality is not at required level, team members work together to solve the issue in their respective part of the value chain.
Setting up infrastructure to be able to realize value involves a lot of Data Engineering. Think about the data storages, databases, pipelines that take data from one place, transport and transform it before loading it into another place, automatized reporting tools, machine learning modules that lets team members train models and serve them to services that need predictions to fulfill their functionality, and more. Within that infrastructure you ‘ll need tools to let you performs rapid experimentation using data and algorithms for the purpose of delivering a data solution.
Set up infrastructure and tools as early as possible in order to avoid members of the team repeatedly having to set things up themselves. Let the team define what works best in terms of learning through experiments and how to deliver the data solution, and then set things up by building the infrastructure for it. In that way more and more time can be spent at learning from data instead of meddling with things that are indeed important but otherwise relatively easy to solve once and then move forward, like setting up the used toolbox.
For experiments, having environments where experiments can be run and an efficient deployment to the whole solution containing the outcome of that experiment is one of the major examples of how rapid experiments and iterative Data Science is best performed. Also, best practices for sharing knowledge from experiments has a lot to do with having a clear directory structure and making sure that outcomes, outcome evaluation results, scripts and other inputs or artefacts from the project are kept in the same place. In that way the knowledge gained from experiments are accessible to everyone in the project.
A big thank you for reading this post. Sharing knowledge is extremely important when working in Data Science and I hope that this gave you some ideas and/or insights on how to improve your Data Science teams work.
Data scientist at Digitalent