Should you run your Data Science projects in Agile?

8 min readApr 1, 2019

TL;DR — Yes!

Introduction

An increasing number of Technology teams are adopting the Agile methodology to build and maintain their Software Products. The adoption of the Agile methodology has been primarily driven by the benefits realized by the Technology community. However, an increasing number of management teams are recognizing the Agile methodology breaking away from the top-down waterfall method as the ‘go to’ for Software development.

In this article, I would like to outline the current state of Data Science initiatives within Legacy Organizations followed by a discussion on how Agile can help solve some of the challenges faced by such Organizations. I have tried to make the concepts relatable to real-life with practical examples as applicable.

The Data Science Silo

Before we set out any further I should define what I mean by Legacy Organizations. These Organizations are established companies which typically belong to non-technology industries such as Banking, Insurance, Retail, Manufacturing, Hospitality, Healthcare, Pharma etc. They are generally over 15 years old. The general Management thinking is on the lines of technology being a cost-center, an after-thought rather than a source or even a driver for innovation and growth.

The above criteria is not strict in the sense that there are and will be exceptions but the vast majority fall within this category.

These Legacy Organizations are facing existential threats on a daily basis. Their strongly held bastions are being shaken up by Venture capital fueled lean and mean startups. They have declared war on these archaic Organizations by becoming the disruptive force that has increased the rate of change in the business landscape to a whole new level (Amazon vs Walmart, Netflix vs Hollywood, AirBnB vs Hotel Chains to name a few). The very survival is at stake for some of these Legacy Organizations.

One of the most powerful weapon in the arsenal of these startups is Data Science. These companies deploy it to influence every aspect of the products or services they offer.

As a response, Management teams within Legacy Organizations have scrambled to emulate these startups and have cobbled together ‘Data Science’ teams. However, these Legacy Organizations often don’t know how to obtain value from their teams leading to mistrust and misplaced expectations between the Data Science and Management teams.

The Data Science teams within these legacy organizations are some of the most diverse within the mono-culture of legacy organizations. They are a mis-mash of people with advanced degrees from some of the top schools in the world. Based on my personal experiences, I find that they bring in their un-corrupted academic integrity, diligence and transparency to the table. They usually start-off with the academic mindset of unconstrained problem formulation and solving only to crash into the rigid corporate stone-walls.

These Data Science teams, as the name suggests, are organized as a separate team within these Organizations and shared across the Organization. It is the application of the shared services concept which is often applied to technology teams within these legacy firms. This is a symptom of management perception of Data Science teams to be an extension or even a type of technology.

How does the Silo Life look like?

Data Scientists within these team Silos spend their time designing multitudes of POCs (Proof of Concepts). These POCs are by design completely alienated from realities of the business world. The main reason being the Data Set that they use. The Data Set is mostly a built-for-purpose fake Data which does not reflect the Volume nor the Veracity of real world data.

The POCs serve two purposes within legacy organizations. First, they keep the Data Scientists busy. This is very important as the enormous brain power kept idle over a long period can lead to attrition. Second, they help executives score brownie points by showing-off their savvy in-touch-with-the-times images. This is often a very lousy spot to find themselves in for Data Scientists as they are expected to consistently produce ‘wow’ and ‘aha!’ moments.

In any case, once the POCs are demonstrated they are often forgotten. There is hardly any effort made to take it to production i.e. the real world.

Even in the optimistic scenario that a few POCs have strong management sponsorship and are selected for deployment to production, they face the death by SDLC (Software Development Life-Cycle) in these Legacy Organizations. These Organizations often have long SDLCs 1–3 years. They follow a strict schedule of release cycles. A quarterly release is considered break-neck speed!

These long time-lines mean that the Models that the Data Science teams built and validated in test environments lose their relevance by the time they reach production. The Models may become outdated as the underlying Data might have structurally changed. This leads to low prediction accuracy, poor performance and low business value overall.

This creates a cycle of negative feedback as the management fails to see business value in data science initiatives, they take fewer risks (fund POCs to go to production) leading to fewer opportunities to realize value from these initiatives.

Unfortunately, due to structural problems and the lack of management vision the immense potential of these data science teams across legacy organizations go untapped leading to a general feeling of mutual mistrust between these teams and Management.

Why Agile methodology makes sense for Data Science Initiatives?

Since Data is generated through the interaction with the real world, keeping the loop from the generation of new data to influencing decisions should be shortened as much as possible.

Three reasons why I think Agile should be the method of choice to ensure success are:

Agile methodology typically has short cycles of iterations. This means that there would be multiple opportunities to iteratively tune, improve and deploy
Being close to the production environment means that there are increased opportunities to closely monitor Model relevance and pull out models from production if they lose their predictive power leading to better quality control
Iterative ‘fail first’ allows for learning from less expensive mistakes to improve continuously and thus creates a positive feedback loop — a virtuous cycle where the management can see evidence of continuous improvement with time

The How?

In this section I would like to use the SCRUM framework to illustrate how Agile methodology could be applied to Data Science Initiatives. For this purpose let’s use a sample business problem to solve called ‘Next best offer’.

A ‘Next best offer’ is a common feature in digital channels where a user is made a highly customized and targeted offer based on interaction with the platform, other platforms(Social Media for example) and customer specific profile information.

Let’s assume that the user is a banking customer and when the user is in the bank website or mobile application specific offers have to be made that are customized to his or her needs.

For example, a credit card with features that are relevant to the specific customer’s needs. A metric used to measure efficacy is Conversion Rate. This is defined as the number of customers who accepted the offer out of the total number of customers who were made an offer.

So, the Data Science initiative is to improve the Conversion Rate of the Next Best Offer system. The current rule based system has a conversion rate of 65%. A system that can consistently beat this rule based system can add significant business value.

A Naive Approach

A Sprint Plan based on a naive approach to applying Agile to this problem looks like this:

Sprint 1 — Achieve 65% Conversion Rate
Sprint 2 — Achieve 70% Conversion Rate
Sprint 3— Achieve 90% Conversion Rate

In reality the results would often look like this:

Sprint 1 — Unable to meet Goal of Sprint 1
Sprint 2— Unable to meet Goals of Sprint 1 and 2
Sprint 3— Unable to meet Goals of Sprint 2 and 3, met goals of Sprint 1

User stories within each Sprint are carried forward into subsequent sprints. This breaks the concept of Agile and we end up where we started and probably worse off in the sense that it creates a state of utmost nastiness called ‘hybrid methodology’.

Legacy Organizations are often susceptible to these hybrid traps wherein instead of realizing the benefits of Agile and avoiding the inefficiencies of waterfall, they get a mix of the worse of both methodologies.

The Scientific Agile SCRUM approach

This leads to the crux of the message of this article. We need to have the right understanding of what constitutes a Data Science initiative.

It is called Data Science for a reason!

The scientific methodology is experimental in nature. Experiments have defined goals which is to either prove or disprove a hypotheses. There is no way for the experimenter to know apriori if the outcomes prove or disprove a hypothesis. This paradigm does not fit the traditional business value frame-works of legacy organizations, at least at the apparent level.

Side Note : based on the cohorts of young professionals I have had the chance to interact with in recent years. I find that the awareness of the Scientific method is dismally low within these cohorts who are educated in ‘business’ but try to use Science (Data Science) to solve business problems. This could be the root of the discordance between Data Scientists and Management.

Let’s see how we can structurally re-organize our approach to fit this paradigm.

Treat all Data Science initiatives as experiments. This implies that the Sprint goals should consist of testing out a number of hypotheses. The SCRUM rules of story size are still applicable to the hypotheses being tested, i.e. the hypotheses should be testable within a Sprint. This may look something like this

Sprint 1 — Given Data Set A1 using Algorithm B1 validate if it has a conversion rate >65%
Sprint 2— Given Data Set A1 using Algorithm B2 validate if it has a conversion rate >65%
Sprint 3— Given Data Set A1 using Algorithm B3 validate if it has a conversion rate >65%

In comparison to the naive approach the outcomes may look like this:

Sprint 1 — The model generated by Algorithm B1 has no predictive power
Sprint 2— The model generated by Algorithm B2 has no predictive power
Sprint 3— The model generated by Algorithm B3 has predictive power with conversion rate >65%

As we can see, in both approaches there was no useful model created until the end of Sprint 3. The fundamental difference is that in the second approach the Sprint goals in the first 2 sprints were met i.e. the so called ‘definition of done’.

A disproved hypothesis is as useful as a proved one

Structurally, this approach sets the cadence to which the Data Science team now responds to Business Reality. It is a matter of weeks instead of years. It implies that it gets easier to pivot and course correct as dictated by changes in customer behavior.

This also brings in accountability and a fairer way to measure performance. The Sprint level Code and Data Snapshots can be easily retrieved and reused. This traceability prevents rework by painting a comprehensive picture of what worked and what did not so that the learnings can be applied to other Data Science initiatives within Legacy Organizations.

As one can see, the Sprint goals are testable hypotheses. This means that there is all-round transparency which can start the virtuous positive cycle of trust and collaboration between Data Science teams and Management. This could enable Legacy Organizations to successfully defend their businesses and thrive in the ever-changing modern business landscape.

Should you run your Data Science projects in Agile?

Introduction

The Data Science Silo

How does the Silo Life look like?

Why Agile methodology makes sense for Data Science Initiatives?

The How?

A Naive Approach

The Scientific Agile SCRUM approach

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Laks Vajjhala

No responses yet