We found this great study done in 2022 by university researchers in Spain regarding the state of holistic data science implementation methodologies. See this link for the original: https://arxiv.org/pdf/2106.07287.pdf
In their words, "the aim of this paper is to conduct a critical review of methodologies that help in managing data science projects,
classifying them according to their focus and evaluating their competences dealing with the existing challenges. As a result of this study we propose a conceptual framework containing
features that a methodology for managing data science projects with a holistic point of view could have. This scheme can be used by other researchers as a roadmap to expand currently used methodologies or to design new ones."
They cover the challenges of successfully implementing business results using data science in a holistic way.
Data science projects often fail to make it into production and have high failure rates. Common challenges include lack of coordination, unclear objectives, unrealistic timelines, technical biases, scope creep, poorly formed teams, data quality issues, lack of reproducibility, and ineffective knowledge management.
The authors reviewed 19 methodologies and classified them as project, team, data/info focused, or integral (covering all areas).
CRISP-DM is the most used methodology but is outdated. Microsoft TDSP is comprehensive but relies heavily on Microsoft tools. Domino Data Lab's methodology integrates software engineering and agile approaches well. RAMSYS enables distributed team collaboration.
The authors propose an integral methodology should cover project, team, and data/info management. They suggest principles like standardizing workflow, enabling reproducibility, defining roles, and focusing on creating knowledge and value.
Pros and cons of some key methodologies:
CRISP-DM
- Pros: Coherent, iterative process; well documented
- Cons: No team management guidance; outdated
Microsoft TDSP
- Pros: Covers project, team, data/info; provides useful templates and utilities
- Cons: Overly dependent on Microsoft tools
Domino Data Lab
- Pros: Integrates software engineering, agile, and team approaches
- Cons: More informative than prescriptive
RAMSYS
- Pros: Enables distributed team collaboration and knowledge sharing
- Cons: Limited data/model sharing capabilities
Agile Data Science
- Pros: Rapid value delivery; realistic feedback cycles
- Cons: Less structured approach; uncertain requirements
Here are some of our takeaways from the paper:
1. What are the main challenges with current data science project methodologies?
The main challenges are that most projects fail to make it to production, have unclear objectives, unrealistic timelines, overly technical focus, poorly coordinated teams, data quality issues, lack of reproducibility, and ineffective knowledge management. There is a lack of comprehensive methodologies to address all these areas.
2. Why is using a methodology important for data science projects?
Using an effective, structured methodology improves coordination, sets clear objectives, manages expectations, enables reproducibility, retains knowledge, ensures data quality, validates solutions, aids collaboration, and ultimately leads to more successful project outcomes. Without methodologies, data science projects tend to fail.
3. What are the key components of an integral data science methodology?
An integral methodology should incorporate project management principles to define the workflow, team management practices to coordinate roles and collaboration, and data/information management to ensure reproducibility, traceability, knowledge retention, and focus on creating value from data. See their table below.
4. What are some pros and cons of established methodologies like CRISP-DM?
CRISP-DM provides a coherent, documented process but lacks team management guidance and is outdated. It needs more integration with software development and agile approaches.
5. How can methodologies help improve coordination and transparency in data science projects?
Methodologies can define roles and responsibilities, standardize team workflows, establish communication practices, create knowledge repositories, and provide templates and utilities to aid information sharing. This improves transparency.
6. Why is it important to focus on creating knowledge and value with data?
Too often projects overemphasize technical performance rather than business value. Methodologies need to prioritize creating knowledge and value to guide decisions and actions based on insights from quality data.
7. How can methodologies incorporate agile and software development best practices?
Iterative delivery, continuous integration, version control, reproducible environments, and cross-functional collaboration are some agile and software methods that can enhance data science methodologies.
8. What are some best practices for effective team management?
Clearly defined roles and responsibilities, collaborative workflows, coding standards, knowledge sharing practices, training, and communication protocols help coordinate data science teams effectively.
9. How can methodologies enable reproducibility and retain knowledge?
Traceability, automated testing, model versioning, comprehensive documentation, knowledge repositories, and an emphasis on capturing process insights improve reproducibility and knowledge retention.
10. What future research is needed on data science methodologies?
More real-world testing and validation of methodologies, developing integral approaches, updating existing practices, and tooling to support methodology adoption are areas needing further research.