Data Scientists Recall Use of Open Source Code Due to Security Concerns


Vulnerabilities in open-source components — such as the widespread flaws disclosed 10 months ago in Log4j 2.0 — have forced data scientists to re-evaluate open-source code frequently used in machine learning analysis and model building.

According to a report by Anaconda, a data science platform company, over the past year, 40% of data scientists, business analysts and students surveyed have reduced their use of open source components, while a third are remained stable and only 7% incorporated more open source code into their projects. The majority of respondents do not report to the IT department (18%), but work within their own data science or research and development group (47%), according to Anaconda”State of Data Science 2022″ report published last week.

While software developers and IT have already begun verifying secure code, concerns about the security of open source software are a relatively new trend for the world of data science, says Peter Wang, co-founder and CEO of Anaconda.

“We see a huge proportion of people working in organizations where IT has created a very strict posture around open source and Python,” he says. “They’re not expert developers. … They’re data scientists and machine learning people who may not be very experienced developers at all, using whatever they could download to do their analysis , and then they turned it over to IT.”

The security of open source components – and the software supply chain in general – has become a primary consideration for software developers, enterprises and national governments over the past two years. In May, for example, the US National Institute of Standards and Technology (NIST) published guidance on address software supply chain risks. In addition, a growing number of software vendors have joined the Linux Foundation’s Open Software Security Foundation (OpenSSF).

While many data science teams scan open source components for vulnerabilities, many build their own software instead. Source: Anaconda’s “2022 State of Data Science” report.

Overall, the maturity of organizations’ security efforts has improved. About half of companies have an open source security policy in place, which leads to better performance in security readiness measures, according to the June survey. Additionally, open source risk control efforts have jumped 51% in the past 12 months, a security maturity study said September 21.

“[W]With the focus on software supply chains, most organizations are taking a risk-based approach to application security,” Jason Schmitt, general manager of Synopsys Software Integrity Group, said in a statement announcing the study. . “Such an approach recognizes that security is not limited to the codebase; this includes the software development process where security reviews and testing “go all over the place” to continually improve security outcomes.”

Developers expand use of open source

Software vendors aren’t seeing any sort of decline in open source usage, according to other data. Instead, development organizations focus on improving the security of open source software and use security as the primary guide in component selection.

In the “2021 State of the Software Supply Chain” report, for example, Sonatype found that the four major open source ecosystems – the Maven Central Repository (Java), Node.js (JavaScript), the Python Package Index (Python), and the NuGet Gallery (.NET) – housed 37 million open source projects and components, a 20% year-over-year increase. The demand for these components is also growing: more than 2.2 trillion components have been downloaded, an annual increase of 73%.

According to Tracy Miranda, head of open source at Chainguard, the self-reported abandonment of open source packages by the data science community likely indicates greater awareness of security issues and less abandonment of open source components. in development.

While data science teams and development teams may have reacted differently to major security issues — like Log4j 2.0 — Companies have little recourse when they move away from an open-source package rather than adopt a different package whose makers have placed more emphasis on security, she says.

“Companies are leveraging open source as a way to increase their speed, so if they’re cutting back, where are they headed? Writing code in-house? Using third-party packaged versions?” Miranda says, adding that instead, “I think we can expect to see companies be more demanding about the quality of the open source they use, especially when it comes to security features. “.

Data scientists are playing catch-up

The disconnect between the two sides is likely due to the different audiences for the different surveys. Anaconda’s survey focused on data science professionals, as seen in their respondents’ choice of programming languages: 58% used Python and 42% used SQL, while only 26% used used JavaScript.

A better measure of software developer sentiment is “2022 Developer Surveywhich revealed that while 58% of “people learning to code” use Python, only 44% of professional developers code in the language. On the other hand, 68% of professional developers use JavaScript, according to StackOverflow’s survey.

Additionally, while data science professionals work at companies that overwhelmingly (87%) allow open source software, about a quarter (26%) have minimal IT oversight of their open source choices, according to the Anaconda report. In 18% of companies, the IT department specifies only about half of the open source components available.

Maintainers of the most critical projects – of which there are hundreds, if not thousands – must use secure dependencies, test their own code, and validate the reliability of contributors. Maintainers should also publish a security dashboard — an initiative created by Google and now managed by the Open Source Security Foundation (OpenSSF)which assigns a security rating to a project based on almost 20 different criteria.

While awareness is likely growing, there’s no quick fix, Miranda says.

“The reality is that the safest options didn’t exist before,” she says. “It makes sense to reduce unnecessary dependencies to reduce the attack surface, but it is difficult to do so once the dependency tree has grown.”


Comments are closed.