By Tim Roberts
Senior Data & Analytics Consultant at Stellar
Hiring a data scientist is a long-term commitment. Without the right data, technology, processes, leadership, and vision, it might take much longer than expected to get results. Worse still, without the right preparation, your best-laid data science plans might flounder, and your talent could walk out the door. It begs the question – what does success for a “data scientist” look like? To answer that, you need to have a clear idea of what you want them to do within, and for your organisation.
Who is called a data scientist?
What even is a data scientist? What do they normally do? Is there such a thing as a “normal” data scientist? Looking back at the original Harvard Business Review article from October 2012 that thrust the role of “data scientist” into the limelight. Data science covers the need to find insights from huge amounts of data that was piling up within companies such as Yahoo, Facebook and LinkedIn.
These data scientists used distributed processing platforms such as Hadoop. They did “data wrangling” – hunted for sources of data, joined them together and performed cleaning tasks over large data sets. Then they used their subject matter expertise to analyse the data to get insights and share those with decision-makers. So, they could then decide to act on or create new features for operations or customers.
So, what should you require?
Over the last few years, the term ‘data scientist’ has become ubiquitous across a range of sectors and situations. The sections below describe in detail what we consider to be the most important attributes to look for and the conditions required to enable the delivery of successful data science, irrespective of the industry you are in.
Start with the Why
Whatever and wherever you direct the attention of your data scientist, you need to focus on the justification and (generally speaking) the motivation for the effort – i.e. the why. Firstly, it needs the potential to provide value. There’s no point in building advanced models with fancy visuals if the outputs are not particularly useful. Make sure that actions taken from it will improve a metric, an outcome or be able to improve a business decision.
There may be major pain points already high on the organizational agenda – this is usually a good place to start. And the adage “low hanging fruit” does apply. If there are opportunities that can be modelled and implemented that deliver value relatively quickly, then go for it.
When weighing up the options on what initiatives to pursue, first ask yourself some basic questions:
- Will it support departmental or organisational goals?
- Is there a sponsor in the organisation yearning for new insights that they can act on?
- Does the organisation want to change the way a process is performed, improve internal efficiency or maybe provide a new feature or supporting evidence for a new product to customers?
- If you discover a new insight, what can you do with it and can you implement and measure its benefit?
- How could it be incorporated into a business process?
- Who will you have to engage with to make that happen? – are they interested in the idea?
Don’t build something and think about these points after the fact – else you risk ending up with a white elephant. Just as bad is developing something that shows great promise but lacks support from the executive. There’s nothing more demoralizing than seeing hard work go to waste.
Sell the idea
Success also means selling the idea to others – how explainable is it? – this is a good test of the idea. If people immediately start nodding their heads in enthusiastic agreement, this is a good sign. If you are met with blank faces and quizzical looks, then maybe go back to the drawing board or at least polish the pitch.
Science requires experimentation, with data science, it can be a chicken before egg type situation. You won’t know in advance if a data product or model will be accurate enough to be useful. To hedge this risk, have a range of ideas. Without losing sight of the why let your people adapt and pivot. Having the latitude to make course adjustments will help create the right culture. Consult experts in the field to see what might be useful to work on. They will be able to give insight into what’s possible or not. Do some research on what’s happening in other countries, environments and sectors – you might be able to rework the approach in your local context.
Finding a data scientist with a perfect balance of expertise isn’t easy. However, you should be looking for someone who has a reasonable mix of skills across the following 3 areas:
1. Analytical Knowledge and Mindset
An analytical background – an ‘analyst’ is a very broad role description as are so many different forms of analysis. However, most are grounded in maths and statistics. Even basic business statistics such as simple percentages or indexes can be useful for insights. Or something more advanced like probability and decision science or simulations, time series, risk analysis and hypothesis testing. Geospatial analysis or natural language processing – I could go on! While machine learning (I knew it once as “data mining”) is important, as obviously being able to predict the future is super useful – don’t let ML overshadow other types of analysis that can provide value.
An important part of being an analyst is to be able to communicate both approach and insights. Data scientists need to make the numbers relatable and understandable to others in the organisation. Being able to create ways to present the information and communicate it with people is a key skill. However, often overlooked. They also need to be able to explain certain concepts around your methods and reasoning which can be challenging. The term “data storytelling” has become commonly recognized. People are realizing this is an important part of the job. Without communicating what has been done management doubts about feasibility and viability will start to creep in.
2. Technical Expertise
Having skills with technology to seamlessly transfer between multiple languages and technologies to piece together a full data processing, analysis and visualisation solution is a good sign in a potential hire. This could come from different work experience – be it from software engineering, data engineering or perhaps data warehousing. I believe that Data Scientists don’t have to be doing work on “big data” all the time (there may be none available) – but it certainly helps to have this experience.
An ability to manipulate data at will and architect full solutions using technologies and platforms is also a useful skill to have. Sometimes known as “data wrangling” or characterized as having “hacking skills” – these are useful to have as they can accelerate prototype design or quickly prove value. Just be careful these skills don’t get used and abused into building a more permanent unstable solution. Being able to have a rapid prototyping process is a key enabler here. If you can cut down the time it takes to explore if a model will be accurate enough to be useful then this is a big advantage. Speeding up the time to get data into the form ready for analysis or modelling will be a significant chunk of that.
3. Subject Matter Expertise.
Sometimes this is hard to find, due to the variety of systems, architecture, processes and organizational context. Most organisations have a range of systems. Even the way two organisations use the same system can differ. Understanding the systems is one part of the puzzle, what data is stored where. The other part is knowing about the industry or environment, the internal processes and roles that give meaning to the data. It makes it so much easier to do analysis when you can put yourself in the shoes of the decision-maker; to understand why they are asking what they are asking – but also to be able to “follow your nose” to find insights yourself. Curiosity and a dose of intuition can also help to spur an investigation into seemingly opaque datasets. Don’t write things off too fast until they have been investigated, you might miss a good find!
In summary, from the skills and capability perspective, it’s difficult to deliver on the promise of data science without skills in these areas. Without the technical skills to manipulate the data into a form that can be analysed – analysis can’t be done. Without a good understanding of analytical methods and statistics – incorrect result or conclusion can be drawn. The same goes for subject matter expertise – knowing the subject well means the risk of overlooking something important is lowered.
Returning to the question of what do you want your Data Scientist/s to do? They have a wide variety of skills and can do amazing things with data. You want to make sure you point them at the right “target”.
Seasoned data scientists can help with this task by leveraging the previous experience. They will say ‘I’ve done a very similar thing before’. However, as processes, systems and business objectives differ, what worked in one organisation, may not work in another. So again; before you go looking for a data scientist, you should have a clear idea of what you want them to do. In turn, this will help you ask the right questions and help you find the right data scientist, with strengths and experience in the areas that matter for the tasks and outcomes required of them.
Depending on the data and analytical maturity of your organisation, it’s also worth thinking ahead around what role do you want them to play? – do you want them to:
- Be an advanced scout, over unstructured or large datasets that can quickly see if they glean valuable insights?
- Build machine learning models or something more statistical or risk related?
- Analyse specifics, such as language or image data?
- Build a team around them, educate, act as a disruptive change agent or a combination?
In principle – how do you want to use them and what is your plan to help then capitalise on the data stored within your organisation?
As scientists exist to discover, a place to store IP, and document solutions are critical. As overtime things tend to get forgotten, it’s essential to keep a record of what has been done previously. Without history, the same piece of work can be re-done many times over as people come and go. Also, if there is no record to back up the approach, design and method, there’s no institutional memory and great ideas can be lost. A nice wiki is a good start. A place to publish any research done also helps with overall documentation of solutions. So, if they do eventually move on, you aren’t stuck with a gaggle of undocumented solutions.
And don’t just leave them on their own. Leadership must champion what they are doing, not only to other leaders in the organisation but to other stakeholders that may need to lend a hand or other data experts who will need to support anyone new to learn the lay of the land when it comes to where potential useful data lies. It could be a good idea to buddy up to someone with the incoming data scientist to help them, to share knowledge and share the responsibility of delivery with them – and know this transfer of knowledge could go both ways.
A place to store data, a place to create data – this could be in many forms and will depend on the size of the data being processed and what systems currently exist or could be available to use. It could be a normal database or a data lake such as Azure Data Lake or AWS S3 or Hadoop.
Ideally, you need a place for standards and processes about how data is stored, and in what structure to keep things orderly. Again, a wiki is handy here. And a sandpit where data can be plonked temporarily for testing ideas.
Some compute resource to interrogate, transform and clean data. Again, depending on what’s available, some of this could be done in a database, or some distributed data processing systems might be available such as Azure Databricks or AWS EMR. Maybe you have a Spark server internally. Basically, something to run code on.
Source control has always been important – it is linked to keeping track of IP by providing a copy of the source code, keeping track of changes to it and making it easy for multiple people to develop the solution.
A platform to present data to show insights and visualizations to others. And then there are other services required such to schedule and orchestrate data processing runs, and a way to deploy any models to run and integrate with other software in the organisation.
Supporting and maintaining data products in operations will take up more resource as more are produced. The more automation of deployment, testing, optimisation of models the easier this will be.
You may want to start inquiring about what types of these technologies exist in the organisation. Data scientists will also have their own opinions as well.
You might have the best modeller in the world, but if you don’t have the required data that has enough correlation to the variable you are trying to predict, the model won’t be very useful. Data is king. Even the addition of one variable can make big difference, bringing a model from “not good enough”, to “usefully accurate”.
Structure and Quality
Depending on what type of analysis you are doing – data quality will have various impacts on it. If it is for decisions using aggregated data – then a small number of records that are incorrect, or not able to be part of the analysis may not have a material impact on the overall outcome of the decision, but you need to be confident that it doesn’t – and be able to communicate that.
The lower the level of decision – say if you are making decisions at an individual customer level, the data impacts can be higher for some cases. If any of the data is not up to date, or incorrect you may not be able to use certain records or end up making the wrong decisions, potentially for a customer or potential customer. Some impacts of wrong decisions will be worse than others. Sometimes it can be coded around or imputed, but it’s often better to work with the owners of the data and get it updated, and hopefully fixed to prevent errors from happening in the future.
While data scientists are great at transforming and cleaning the data – this is just a means to an end. Ideally, you want them to spend more time discovering insights and presenting them. The easier it is to piece together the required data the better. This could be in a form of having a “single customer view” ready or at least a framework that presents a curated view of this data – encapsulating business logic. It doesn’t need to be in a formal model like a Star Schema or Data Vault if it’s flexible, easy to join between tables and easily understood. Perhaps this exists already as part of your organisation’s data warehouse. The key is the reuse of logic and minimal redundancy.
Ethics, Privacy and Governance
Finally, data scientists need a firm steer on what they can and can’t do with data. This is a critical consideration, as a misstep here can cause costly reputational damage. For instance, clear policy and governance guidelines on data usage should always be linked back to the terms under which said data was provided in the first place. New privacy regulations are upon us which means media and public interest groups will quickly target transgressions. A simple example of misuse is the entitlements and constraints around using customer data in marketing campaigns.
Ethics is another concern, just because you have certain data, doesn’t mean you should be using it. Your organisation should have a well-developed governance function that has visibility of advanced analytical activity and data use. A data scientist should be aware of ethical concerns, but there may be rules or industry regulations of which they are not aware. For this reason, Stellar always recommend executive ownership and oversight across Data Science initiatives.
In summary, to be ready for a commitment to data science, be sure to have assessed how you can enable and provide:
- Defined projects to work on, along with an understanding of what different actions can be taken from new insight
- A clear and unambiguous view of the benefits statement and potential value
- Consult and engage stakeholders in the organisation for support
- A way to manage work and engage with internal customers
- A place to store IP and to publish any research or work that is visible
- That they will have the necessary support from other data and subject matter experts
- Current or planned technology options and investment have been thought through
- Access to relevant and useful data sets and if there are any known data quality issues
- A pro-active approach to ethical motivations decisions – compliance with relevant regulations and that a culture of good governance is woven into advanced data projects.
There is a significant amount of preparation work to get this right. Just be mindful that the more pre-emptive action taken, the better the odds that your investment in data science is successful.