scikit-learn Blog

Announcing the launch of the scikit-learn user survey

2024-09-02T00:00:00+00:00

Author: Inessa Pawson , François Goupil

We are excited to announce the launch of the scikit-learn user survey! Scikit-learn continues to evolve thanks to contributions from its diverse user community. As we plan for future releases, we want to ensure we are focusing on what matters most to you — our users.

The goal of this survey is to better understand how users interact with the library, identify any pain points, learn about the features you find most useful, and what’s missing. This is your chance to have a say in how the library grows and adapts to meet the evolving needs of the machine learning community.

The survey will take about 15 minutes of your time. It is available in Arabic, French, English, Japanese, Mandarin, Spanish, and Portuguese. You have the option to remain completely anonymous, and the data collected will be used solely for the purpose of improving scikit-learn.

This user survey is a truly collaborative effort. We would like to thank the teams from probabl, University of Oxford (UK), and POSSEE OpenTeams, as well as many scikit-learn contributors, for their time and effort in designing and translating it.

Once the survey closes, we’ll analyze the responses and publish the findings in a follow-up blog post.

To take the survey, visit: https://forms.gle/p5P7AweCJCbFMzfo6. The survey will remain open until October 14th, 2024, and we encourage you to share it with your colleagues and extended network.

We value every contribution in our community, and we’re committed to making scikit-learn even better. Your feedback is the foundation upon which scikit-learn will continue to grow and evolve. We look forward to hearing from you!

Chan Zuckerberg Initiative considers scikit-learn an Essential Open Source Software

2024-08-06T00:00:00+00:00

Author: Guillaume Lemaitre , Lucy Liu

We are delighted to announce that scikit-learn has been awarded a grant from the Chan Zuckerberg Initiative (CZI)’s Essential Open Source Software for Science (EOSS) program. This grant is funded by Wellcome Trust. As in previous rounds, this cycle supports open-source software projects that are essential to biomedical research. This is the third time that CZI EOSS supports scikit-learn.

In this new grant, we will focus on improving the evaluation and inspection of predictive models.

Predictive models evaluation & inspection

When building a machine learning pipeline for a specific research problem, two key aspects are closely connected: (i) design of the pipeline and (ii) assessment, analysis, and inspection of it. Researchers strive to identify the optimal pipeline, maximizing specific evaluation metrics, while also seeking at explaining the validity and rationale behind the pipeline’s predictions. This is the cornerstone of answering research questions. With this proposal we aim to improve and extend the available scikit-learn tools.

scikit-learn provides building blocks for model evaluation and statistical analysis of results. Originally, this information was presented in a raw format and required expertise from scientists to create intuitive reports for outreach to peers and outsiders. Recently, the scikit-learn community developed displays to easily generate visual figures for communicating such results. However, these displays are still in their early development stages and do not leverage all available statistical analysis tools (i.e., cross-validation) from scikit-learn. Thus, we aim to expand these displays, using the right statistical tools and thus promote the adoption of best practices when reporting results. Additionally, we also intend to create new displays to support common analysis tasks that are not yet covered in scikit-learn.

In the domain of model inspection, we aim to address several areas: (i) model inspection during training, (ii) enhancing user experience through interactive inspection, and (iii) model explainability. First, during the training of a pipeline, researchers are interested in monitoring the internal characteristics of the model, which is a not yet addressed long-standing issue in scikit-learn. We want to build upon some initial work by implementing a “callback” framework that allows users to track these internal parameters. Next, researchers commonly use interactive tools such as Jupyter Notebook to develop pipelines. scikit-learn started some efforts to visually and interactively display pipelines in these environments. However, there is room for improvement in terms of user interaction and accessibility. Finally, as scikit-learn is widely used as a reference package, it is crucial to improve the section of the library dedicated to model explainability. We aim to improve the documentation and user experience with the existing explainability tools, making sure that they use the appropriate tool for their use cases. In addition, we propose to work on a scikit-learn enhancement proposal (SLEP) to define a common API for model explainability within scikit-learn. Ultimately, the goal is to come to a consensus to provide scikit-learn end-users with a consistent experience when using model explainability tools.

On top of all these items, we intend to continue working on the general maintenance of the project, addressing bug reports and performance regressions. As a community-driven project, we also want to dedicate time reviewing external contributions.

Involved people

To execute this project, we plan the following hires:

Lucy Liu (Quansight Labs) will work about half-time on the project, on topic related to displays and feature importance.
We will hire full-time internships to work on the other part of the project. The initial plan is to hire two interns for a period of 6 months each and repeat this process for the next 2 years. We want to provide opportunities to underrepresented groups in the field of machine learning and data science, similarly to previous initiatives (cf. NumFOCUS Small Development Grant).

Past CZI EOSS grants

In the past scikit-learn has been awarded two grants from the CZI EOSS program:

CZI EOSS Cycle 1 helped at creating to the HistGradientBoostingClassifier and HistGradientBoostingRegressor estimators. These estimators are the equivalent of gradient boosting models implemented in LightGBM and XGBoost.
CZI EOSS Cycle 4 extended scikit-learn to work better with missing values and categorical data in several estimators.

Both grants allowed us to maintain and enhance scikit-learn to better serve the community.

Interview with Adam Li, scikit-learn Team Member

2024-07-24T00:00:00+00:00

Author: Reshama Shaikh , Adam Li

BIO: Adam is currently a Postdoctoral Research Scientist at Columbia University in the Causal Artificial Intelligence Lab, directed by Dr. Elias Bareinboim. He is an NSF-funded Computing Innovation Research Fellow. He did his PhD in biomedical engineering, specializing in computational neuroscience and machine learning at Johns Hopkins University working with Dr. Sridevi V. Sarma in the Neuromedical Control Systems group. He also jointly obtained a MS in Applied Mathematics and Statistics with a focus in statistical learning theory, optimization and matrix analysis. He was fortunate to be a NSF-GRFP fellow, Whitaker International Fellow, Chateaubriand Fellow and ARCS Chapter Scholar during his time at JHU. Adam officially joined the scikit-learn team as a maintainer in July 2024.

GitHub: @adam2392
LinkedIn: @adam2392
Website: https://adam2392.github.io

Link to scikit-learn contributions (issues, pull requests):

Tell us about yourself.

I currently live in New York City, where I work on theoretical and applied AI research through the lens of causal inference, statistical modeling, dynamical systems and signal processing. My current research is focused on telling a causal story, specifically in the case one has multiple distributions of data from the same causal system. For example, one may have access to brain recordings from monkeys and humans. Given these heterogeneous datasets, I am interested in answering: what causal relationships can we learn. This is known as the causal discovery problem, where given data, one attempts to learn what causes what. Another problem that I work on that is highly relevant to generative AI is the problem of causal representation learning. Here, I develop theory and train deep neural networks to understand causality among latent factors. Specifically, we demonstrate how to leverage multiple datasets and a causal neural network to generate data that is causally realistic. This can enable more robust data generation from general latent variable models.
How did you first become involved in open source and scikit-learn?

I first got involved in open source as a user. I was making the switch from Matlab to Python and started using packages like numpy and scipy pretty regularly. In my PhD research, I dealt with a lot of electrophysiological data (i.e. EEG brain recordings). I was writing hundreds of lines of code to load and preprocess data, and it was always changing based on different constraints. That was when I discovered MNE-BIDS, a Python package within the MNE framework for reading and writing brain recording data in a structured format. This changed my life because now my preprocessing and data loading code was a few lines of code that adhered to an open standard tested by thousands of researchers. I realized the value of open source, and began contributing in my spare time.
We would love to learn of your open source journey.

I first started contributing to open-source in the MNE organization. This package implements data structures for the processing and analysis of neural recording data (e.g. MEG, EEG, iEEG data). I contributed over 70 pull requests in the MNE-BIDS package, and subsequently was invited to be a maintainer for MNE-BIDS and MNE-Python. Later one, I participated in a Google Summer of Code to port the connectivity submodule within MNE-Python to a new package, known as MNE-Connectivity. I added new data structures, and algorithms for the sake of improving the feature developments for connectivity algorithms among neural recording data. Later on, I also worked with a team on porting a neural network architecture from Matlab to the MNE framework to automatically classify ICA derived components. This became known as MNE-ICALabel. These experiences gave me the experience necessary to work in a large asynchronous team environment that is common in OSS. It also taught me how to begin contributing to an OSS project. This led me to scikit-learn.

I first got involved in scikit-learn as a user, who was heavily interested in the decision tree model in scikit-learn (random forest, randomized trees). Here, I was interested in contributing a new oblique decision tree model that was a generalization of the existing random forest model. However, the code was not easily added to scikit-learn, and currently the decision to include it is inconclusive. Throughout this process, I learned about the challenges and intricacies of maintaining such a large OSS project as scikit-learn. It is not trivial to simply add new features to a large OSS project because code comes with a maintenance cost, and should fit with the current internal design. At this point in time, there were very few maintainers that were able to maintain the tree submodule, and as such new features are included conservatively.

I was eager to improve the project to enable more exciting features for the community, so I began contributing to scikit-learn starting with smaller issues such as documentation improvements, or minor bug fixes to get acquainted with the codebase. I also refactored various Cython code to begin upgrading the codebase, especially in the tree submodule. Throughout this process, I identified other projects the maintainers team were working on, and also contributed there. For example, I added metadata routing to a variety of different functions and estimators in scikit-learn. I also began reviewing PRs for the tree submodule and metadata routing where I had knowledge. I also added missing-value support for extremely randomized tree models (called ExtraTrees in scikit-learn). This allows users to pass in data that contains missing values (encoded as np.nan) to ExtraTrees. Around this time, I was invited to join the maintainer team of scikit-learn. More recently, I have taken on the project to add categorical data support to the decision tree models, which will make random forests and extremely randomized tree models more performant and capable to handle real world settings where there is commonly categorical data.
To which OSS projects and communities do you contribute?

I currently primarily contribute to scikit-learn, PyWhy (a community for causal inference in Python), and also develop my own OSS project: treeple. Treeple is an exciting package that implements different decision tree models beyond those offered in scikit-learn with an efficient Cython implementation stemming from the scikit-learn tree internals.
What do you find alluring about OSS?

OSS is so exciting because of the impact it has. Everyone from private projects to other OSS projects will use OSS. Any fixes to documentation, performance improvements, or new features will potentially impact the workflows of potentially millions of people. This is what makes contributing to OSS so exciting. Moreover, this impact ensures that best practices are usually carried out in these projects, and it’s a great playground to learn from the best, while giving back to the larger community.
What pain points do you observe in community-led OSS?

Right now, community lead OSS moves very slowly in most places. This is for a number of very good reasons: i) not releasing buggy features that may impact millions of people, and ii) backwards compatibility. One of the challenges of maintaining a high-quality OSS project is that you would like to satisfy your users, who may all utilize different components of the project from different versions. As such, many community led OSS projects take a conservative approach when implementing new features and new ideas. However, there may be many exciting better features that are already known by the community, but still lack an OSS implementation.

I think this can be partially solved by increased funding for OSS, so OSS maintainers and developers are able to dedicate more time to maintaining and improving the projects. In addition, I think this can be improved if more developers in the community contribute to said OSS projects. I hope that I have convinced you though that contributing to OSS is impactful and highly educational.
If we discuss how far OS has evolved in 10 years, what would you like to see happen?

I think more interoperability and integrated workflows for projects will make projects that utilize OSS more streamlined and efficient. For example, right now there are different array libraries (e.g. numpy, cupy, xarray, pytorch, etc.), which all support some manner of a n-dimensional array, but with a slightly different API. This makes it very painful to transition across different libraries that use different arrays. In addition, there are multiple dataframe libraries, such as pandas and polars, and this problem of API consistency also arises there.

Some work has been made on the Array-API front to allow different array libraries to serve as backends given a common API. This will enable GPU acceleration for free without a single code change, which is great! This will be exciting because users will eventually only have to write code in a single way, and can then leverage any array/dataframe library that has different advantages and disadvantages based on the user use case.
What are your hobbies, outside of work and open source?

I enjoy running, trying new restaurants and bars, cooking and reading. I’m currently training for a half-marathon, where my goal is to run under 8 minutes per mile. I’m also trying to perfect a salad with an asian-themed dressing. In a past life, I was a bboy (breakdancer) for ten years until I stopped in graduate school because I got busy (and old).

Interview with Yao Xiao, scikit-learn Team Member

2024-07-18T00:00:00+00:00

Author: Reshama Shaikh , Yao Xiao

Yao Xiao recently earned his undergraduate degree in mathematics and computer science. He will be pursuing a Master’s degree in Computational Science and Engineering at Harvard SEAS. Yao joined the scikit-learn team in February 2024.

Tell us about yourself.

My name is Yao Xiao and I live in Shanghai, China. At the time of interview I have just got my Bachelor’s degree in Honors Mathematics and Computer Science at NYU Shanghai, and I’m going to pursue a Master’s degree in Computational Science and Engineering at Harvard SEAS. My current research interests are in networks and systems (e.g. sys4ml and ml4sys), but this may change in the future.
- GitHub: @Charlie
- LinkedIn: @yao-xiao
- Website: https://charlie-xiao.github.io
How did you first become involved in open source and scikit-learn?

In my junior year I took a course at NYU Courant called Open Source Software Development where we needed to make contributions to an open source software as our final project - and I chose scikit-learn.
We would love to learn of your open source journey.

I was lucky to get involved in a pretty easy meta-issue when I first started contributing to scikit-learn. I made quite a few PRs towards that issue, familiarizing myself with the coding standards, contributing workflow etc., and during which I gradually explored the codebase and learned a lot from maintainers how to write better code. After that meta-issue was completed, I decided to continue contributing since I enjoyed the experience, and I started looking through the open issues, tried reproducing and investigating them, then opened PRs for those that I was able to solve. It is the process of familiarizing with more parts of the codebase, being able to make more PRs, so on and so forth. While contributing to scikit-learn, sometimes there are also issues to solve upstream, so I also had opportunities to contribute to projects like pandas and pydata-sphinx-theme. Up till today I’m still far from familiar with the entire scikit-learn project, but I will definitely continue the amazing open-source journey.
To which OSS projects and communities do you contribute?

I have contributed to scikit-learn, pandas, pydata-sphinx-theme, sphinx-gallery. I’m also writing some small softwares that I decide to make open source.
What do you find alluring about OSS?

It is amazing to feel that my code is being used by so many people all around the world through contributing to open source projects. Well it might be inappropriate to say “my code”, but I do feel like making some actual contributions to the community instead of just writing code for myself. Also OSS makes me care about code quality and so on instead of merely making things “work”, which is very important for programmers but not really taught in school.
What pain points do you observe in community-led OSS?

Collaboration can lead to better code but also slows down the development process. Especially when there are not enough reviewers around, issues and PRs can easily get stale or forgotten. But I would say it’s more like a tradeoff rather than a pain point.
If we discuss how far OS has evolved in 10 years, what would you like to see happen?

I couldn’t say about the past 10 years since I’ve only been involved for about one and a half years, but regarding the scientific Python ecosystem I would like to see better coordination across projects (which is already happening). For instance a common interface for array libraries and dataframe libraries would allow downstream dependents to easily provide more flexible support for different input/output types, etc. And as a Chinese I would also hope that open source can thrive in my country some day as well.
What are your favorite resources, books, courses, conferences, etc?

As for physical books I would recommend The Pragmatic Programmer by Andy Hunt and Dave Thomas, and Refactoring: Improving the Design of Existing Code by Martin Fowler and Kent Back. As for courses I like MIT’s The Missing Semester of Your CS Education. In particular about learning Python, The Python Tutorial in the official Python documentation is good enough for me. By the way I want to mention that documentations of most languages and popular packages are very nice and they are the best place to learn the most up-to-date information.
What are your hobbies, outside of work and open source?

I would say my largest hobby is programming (not for school, not for work, just for fun). I’ve recently been fascinated with Tauri and wrote a lot of small desktop applications for myself in my spare time. Apart from this I also love playing the piano and I’m an anime lover, so I often listen to or play piano versions of anime theme songs (mostly arranged by Animenz).

Note on Inline Authorship Information in scikit-learn

2024-05-04T00:00:00+00:00

Author: Adrin Jalali

Historically, scikit-learn’s files have included authorship information similar to the following format:

# Authors: Author1, Author2, ...
# License: BSD 3 clause

However, after a series of discussions which you can see in detail in this issue, we could list the following caveats to the status quo:

Authorship information was not up-to-date and in most cases, but not always, reflect the original authors of the file;
It was unfair to all other contributors who have been contributing to the code-base;
One can check the real authors and the history of the authors of any part of the code-base using git blame and other git tools.

Therefore we came to the conclusion to standardize all authorship information to mention “The scikit-learn developers”, and have the license notice as:

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

The change is to happen gradually in the coming months after April 2024.

My mentored internship at scikit-learn

2023-11-27T00:00:00+00:00

Author: Stefanie Senger , François Goupil

How it is to be an Intern at scikit-learn

My name is Stefanie Senger, and I recently concluded a five-month mentored internship at scikit-learn, that had been funded by NumFocus as a Small Development Grant with a clear focus on fostering diversity in open-source projects. The idea to couple a grant with mentorship traces back to Maren Westermann’s initiative. She envisioned a pathway to integrate more female coders into scikit-learn through internships and support. Scikit-learn would profit from fresh perspectives and some disruption. I was the guinea pig for an initial experiment, as Maren later told me.

Starting the Internship

As someone transitioning from a non-technical background to coding, working on scikit-learn was a big thing for me. I had participated in and taught at a data science boot camp, searching diligently for a first role in the field. I never doubted I could tackle more difficult tech challenges over time, but I knew there was much to learn. Scikit-learn had a heavy-tech aura to me, and when I discovered the internship ad, I just thought: this. I was genuinely taken aback when accepted for the role, though. There are many more experienced people looking for such an opportunity, after all.

When I got to know better both my mentors, Adrin Jalali and Guillaume Lemaitre, it became quickly clear that only effort was required, and I could ask them any question along the way. I felt very welcome in the community, also by the other people I interacted with on GitHub.

What I Worked on

I began by working on documentation and examples such as “Multi-class AdaBoosted Decision Trees,” to make those more comprehensive and helpful for users. Then some maintenance tasks on the code that were repetitive so I could find out what to do from other contributors’ pull requests. Guillaume discovered that one AdaBoost algorithm required deprecation, and it fell on me to execute this. I had never looked at such a huge code base with so many layers of abstraction, and I had to learn quite some more Python to be able to go ahead. I even got the opportunity to present an “Intro to scikit-learn” workshop at EuroSciPy, the European conference on the scientific use of Python in Basel, where I also got to know many other contributors and people from the scikit-learn team at Inria.

Adrin introduced me to the challenging task of implementing a new feature for metadata routing, developed over many years by the scikit-learn community. It allows users to set metadata, such as sample weights, in meta estimators, that can be routed to sub-estimators and other algorithms that are able to consume it. This was partly uncharted territory and meant finding solutions where there was no predefined path and adapting tests to match the expected behavior. In the last two months of my internship, I implemented metadata routing into some meta-estimators, which was tremendously difficult but, once accomplished, has nourished my professional confidence since.

Mentorship in Action

Let me describe how the mentoring worked because Guillaume’s and Adrin’s support was invaluable. They would both literally drop their tasks when I had questions and right away hint me in the right direction. I met Adrin twice a week, and we would co-work while I would throw questions at him. Guillaume was available remotely, and I knew he would jump into a video call with me when I needed help. They both gave reviewing my PRs a priority, and I got feedback on my work regularly.

It was essential to have mentors signaling that it’s okay to be learning and to propose tasks to me. If I had come into the project individually, I might have hesitated to take on most of the issues I ended up working on, fearing that my skills were insufficient and that I would hinder the progress of the project rather than help it. The mentoring setting gave me a justification to try things that I wasn’t sure if I could do.

Becoming a Community Member

Looking ahead, I will continue contributing to scikit-learn. As I’ve gotten to know quite a few of the other contributors in person, I now feel part of the community. I know they care about values like openness and diversity, that I share, and while acknowledging the complexity of the code base, I know what I can learn from taking on issues and the sense of accomplishment when merging my solution into the main branch. And I love contributing to something meaningful, which is something I had always sought.

NVIDIA Is A New Sponsor Of The Scikit-Learn consortium at the Inria Foundation

2023-11-14T00:00:00+00:00

Author: NVIDIA , François Goupil

Sponsored blog post

We are thrilled to announce that NVIDIA has joined the scikit-learn consortium as a corporate partner. As a leading provider of GPU-accelerated computing solutions, we at NVIDIA recognize the importance of machine learning and the role it plays in the growth of many industries and areas of science. Our partnership with the scikit-learn consortium demonstrates our commitment to supporting the development and advancement of open-source software in the machine learning community.

Scikit-learn is a popular open-source Python library for machine learning. One of the strengths of scikit-learn is its ease of use and well-defined API. This makes it a favorite tool among data scientists and machine learning practitioners. Thanks to its active community and continuous development, scikit-learn is constantly evolving and improving.

At NVIDIA, we believe that investing in open-source projects like scikit-learn is important. Afterall, it is a central component of the modern data stack in both science and industry. By financially supporting the scikit-learn consortium, we are contributing to the long-term sustainability of scikit-learn and helping to ensure that it remains an easy to use, reliable and valuable tool for years to come. Furthermore, we hope to help advance the project’s development, improve its performance, and enhance its capabilities for machine learning on GPUs.

Our partnership with the scikit-learn consortium will also enable us to collaborate more closely with the scikit-learn community, and provide us with insights into how we can improve NVIDIA’s RAPIDS open-source libraries to better serve their needs. We are committed to working with the foundation to ensure that scikit-learn remains a powerful and easy to use machine learning library that meets the needs of data science practitioners in science and industry.

NVIDIA’s commitment to scikit-learn goes beyond financial support. We have hired Tim Head, an experienced open-source maintainer, to work full-time on the project. This is not Tim’s first open-source rodeo. He has previously contributed to several high-profile open-source projects, including Project Jupyter. His focus will be reviewing pull requests and coordinating the development of large features. Tim was recently elected as a core maintainer of scikit-learn. His expertise and experience will be invaluable in ensuring the continued growth and success of the project.

In summary, NVIDIA’s partnership with the scikit-learn consortium is an important step in our ongoing commitment to support the development and growth of open-source software in the machine learning community. We are excited to work with the foundation and the community of contributors to help advance the capabilities of scikit-learn and accelerate the development of machine learning applications.

AI helped write this blog post!

scikit-learn 2023 In-person Developer Sprint in Paris, France

2023-09-10T00:00:00+00:00

Author: Reshama Shaikh , François Goupil

During the week of June 19 to 23, 2023, the scikit-learn team held its first developers sprint since 2019! The sprint took place in Paris, France at the Dataiku office. The sprint event was an in-person event and had 32 participants.

The following scikit-learn team members joined the sprint:

Adrin Jalali
Arturo Amor Quiroz
François Goupil (@francoisgoupil)
Frank Charras (@fcharras)
Gael Varoquaux (@GaelVaroquaux)
Guillaume Lemaitre (@glemaitre)
Jérémie du Boisberranger (@jeremiedbb)
Joris Van den Bossche
Julien Jerphanion (@jjerphan)
Loïc Estève
Maren Westermann
Olivier Grisel (@ogrisel)
Roman Yurchak
Thomas Fan
Tim Head (@betatim)

The following community members joined the sprint:

Alexandre Landeau
Alexandre Vigny
Chaine San Buenaventura
Camille Troillard
Denis Engemann
Franck Charras
Harizo Rajaona
Ines (intern at Dataiku)
Jovan Stojanovic
Leo Dreyfus-Schmidt
Léo Grinsztajn
Lilian Boulard
Louis Fouquet
Riccardo Cappuzzo
Samuel Ronsin
Vincent Maladière
Yann Lechelle

scikit-learn Developer Sprint, Paris, June 2023; Photo credit: Copyright: Inria / Photo B. Fourrier, June 2023; (from left to right, back to front): Last Row: Denis Engemann, Riccardo Cappuzzo, François Goupil, Tim Head, Guillaume Lemaitre, Louis Fouquet, Jérémie du Boisberranger, Frank Charras, Léo Grinsztajn, Arturo Amor Quiroz. Middle Row: Thomas Fan, Lilian Boulard, Gaël Varoquaux, Ines, Jovan Stojanovic, Chaine San Buenaventura. First Row: Olivier Grisel, Harizo Rajaona, Vincent Maladière.

Topics covered at the sprint

PR #13649: Monotonic constraints for Tree-based models
Discussed the vision/future directions for the project. What is important to keep the project relevant in the future.
Should we share some points beyond the vision statement?
Thomas F will try and create a vision statement
Discussed what people are keeping an eye on with a two year time scale in mind in terms of technology and developments that are relevant.
Tim: keep improving our documentation (not just expanding it but also “gardening” to keep it readable)
Tim: increase active outreach and communication about new features/improvements and other changes. A lot of cool things in scikit-learn are virtually unknown to the wider public (e.g. Hist grad boosting being on par with lightgbm in terms of performance, …)

What is next?

We are discussing co-locating with OpenML in 2024 in Berlin, Germany to organize another developers’ sprint.

scikit-learn Developer Sprint, Paris, June 2023; Photo credit: Copyright Inria / Photo B. Fourrier, June 2023; (from left to right): Thomas Fan, Olivier Grisel

Interview with Meekail Zain, scikit-learn Team Member

2022-11-30T00:00:00+00:00

Author: Reshama Shaikh , Meekail Zain

Posted by Sangam SwadiK

Meekail Zain is a computer science PhD student at University of Georgia (USA), a member of Quinn Research Group and a software engineer at Quansight. Meekail officially joined the scikit-learn team as a maintainer in October 2022.

Tell us about yourself.

I’m currently attending the University of Georgia, pursuing a PhD in computer science. My area of research predominantly focuses on deep learning, generative modeling, and statistical approaches to clustering. I’m in my third year, and at the time of writing about to begin my comprehensive exams.
- GitHub: @Micky774
- LinkedIn: @meekail-zain
How did you first become involved in open source and scikit-learn?

I first got involved as a user, as most people do. NumPy was a recurring day-to-day library for me, and scikit-learn was a de-facto necessity for several graduate courses. Originally I never really imagined being able to get to a point where I could affect change in these libraries since they seemed so well-established!
We would love to learn of your open source journey.

My journey really kicked off when I went to work at Quansight and received funding through the NASA Roses grant to be able to dedicate time to contributing to scikit-learn. It was a huge jump from what I had known up until that point. I learned Python very informally in order to be able to use PyTorch to develop/deploy models for my research, and had little-to-no experience with things like continuous integration or strong API. At first I felt incredibly intimidated and unqualified, but at the same time absolutely thrilled that I was in a position to learn so many new things! I started working on really simple changes to get used to the contribution workflow — things like removing excess whitespace and fixing typos — and then graduated to slightly more complex tasks. Eventually I got to the point where I started to “understand” small corners of the codebase and could actually offer help on new issues because of that familiarity. After that, I started reviewing others’ pull requests (PRs) and offering feedback in an unofficial capacity, as well as taking on more challenging tasks across the codebase. That process of growth and escalation is still ongoing, and truly I hope it never ends.
To which OSS projects and communities do you contribute?

NumPy, scikit-learn, and scipy. Right now it is heavily skewed towards scikit-learn with numpy being second most, but I’m hoping to take some more time to work on scipy in the near future!
What advice or tips you have for people starting out in your field of work?

Find a way to enjoy the feeling of being surrounded by things that you haven’t yet mastered. If you aim for growth — and indeed I think we all should — then you’ll find that you spend the majority of your time surrounded by things that you don’t quite understand, and the natural reaction to that is frustration and intimidation. If you can somehow convince yourself to also be excited by such an environment, you’ll find yourself growing every single day. Nobody starts off knowing everything :)
What do you find alluring about OSS?

This is a tough one, there are many amazing points. If I had to select just a few, it would be (in no particular order):
- The growth potential
- The community
- The impact
I’ve already discussed the growth potential so I’ll leave it at that.

The community is fantastic as well! On every project the community base has its own unique personality of sorts, and they are all wonderful! It’s amazing being able to see recurring users that post interesting issues, or take a stab at opening more complex PRs (pull requests). There’s a strong sense of companionship with the people that are also trying to improve the same project as you! It’s akin to a very niche club in high school. It’s a wonderful experience finding people obsessed with the same cool project as you are.

Finally, the impact. At the end of the day, the work we do has some serious consequences. Each project is essential to so many different workflows and enables brilliant researchers and software engineers to build complex systems and solutions to cutting edge problems. It’s sometimes surreal to think about how essential some of these projects really are.
What pain points do you observe in community-led OSS?

Consensus is difficult. This is a double-edged sword, since it carries some benefits too. With community-lead OSS, changes at every scale need to meet some kind of consensus. This ensures that the changes are well thought out and provides a layer of safety since the chance of uncaught mistakes propagating goes down with the number of people carefully reviewing changes (for the most part).

For example, in scikit-learn a PR with changes to code needs to meet a lazy consensus where two official reviewers (currently just core developers) explicitly approve, and no other official reviewer officially disapproves. Going a bit further up, a new feature request in a project could require the consensus of several core developers that are well-versed in the topic area. Large systemic changes manifest in the form of SLEPs (scikit-learn enhancement proposals) which require a ⅔ consensus across all core developers. Above even that, there are cross-community discussions where the idea of a “consensus” itself isn’t always really clear.

This system is a critical one, but there are important issues intrinsic to it that need to be addressed. For example, who gets to contribute to a consensus at each scale? What qualifications does one need, and how do we codify that? There’s also the intrinsic tradeoff where the stronger the consensus required, the less likely it is that changes will be adopted. This is by design since wide-reaching changes need to be held to high standards, but it does also mean that occasionally even for narrow-scoped problems no solution will be reached despite options being raised that are better than the status quo.
If we discuss how far OS has evolved in 10 years, what would you like to see happen?

I can’t speak to its evolution in the past 10 years, since I am still fairly new to OSS overall, but I would like to see systematic data-driven analysis on contributor’s needs. Different OSS projects have issued contributor surveys in the past, but in general I think a lot of emphasis is placed on the feedback given from users in meta issues or over community calls. While that is definitely helpful, there’s a lot of extrapolation that takes place when projects try to determine the needs of their contributor base like this.

Some questions I would love to see studied include:
- What distribution does the expertise of the contributor base follow?
- What are the greatest bottlenecks at each level of expertise?
- Aside from expertise, are there other socio-economic or general demographics that exhibit consistent bottlenecks? (e.g. access to hardware)
- How do we create informed and effective DEI policies from this information?
OSS projects thrive and prosper based on their community, so I would love to see more systematic research on community needs and pain points.
What are your favorite resources, books, courses, conferences, etc?

I absolutely adore “Probability and Statistics” by Evans and Rosenthal. It does a fantastic job of constructing a lot of otherwise daunting statistical concepts from very elementary ideas. It is my favorite book to recommend to eager students that do not have a rigorous foundation in probability and statistics, since this book does a great job of building up the reader’s intuition and making everything feel natural and derived, rather than arbitrarily defined.

Regarding conferences, I have to go with SciPy! I was definitely scared going into the conference thinking that I would be the least-qualified person in every room and that I’d have nothing to talk about. I realized very quickly that there is always something to talk about, and qualifications don’t matter. It’s a gathering of super passionate people that are each eager to talk about the things that interest them, so regardless of whether you’re an expert or a beginner, they will happily explain things to you. Every single attendee has some area, no matter how specific, that they can talk about for hours. That genuine interest and excitement felt rejuvenating and reminded me why I love OSS so much.
What are your hobbies, outside of work and open source?

I really enjoy hiking, camping and playing DnD (Dungeons & Dragons)! Camping especially is an important hobby for me since whenever I have a computer in reach I feel inclined to check my GitHub notifications, so the occasional total disconnect for a weekend is a fantastic tool for me to give myself a break with no pressure of “I could work on that new feature right now…”

If you have ever had difficulty with relaxing because of that little voice in your head that says “How dare you relax? You could be doing this and that right now!” then I highly recommend going camping, even just for one night! When that voice strikes during camping, I retort “Ah but you see, I don’t have my laptop, so I can’t work on that right now. All I can do right now is relax.” and suddenly the anxiety washes away :)

Pandas DataFrame Output for sklearn Transformers

2022-11-08T00:00:00+00:00

Author: Sangam SwadiK

Video

Upcoming feature in release 1.2

Starting with the next release of scikit-learn (v1.2), pandas dataframe output will be available for all sklearn transformers! This will make running pipelines on dataframes much easier and provide better ways to track feature names. Previously, mapping a transformed output back into columns would be cumbersome as it might not be a one-to-one mapping in cases of complex preprocessing (e.g., polynomial features).

The pandas dataframe output feature for transformers solves this by tracking features generated from pipelines automatically. The transformer output format can be configured explictly for either numpy or pandas output formats as shown in sklearn.set_config and the sample code below.

from sklearn import set_config
set_config(transform_output = "pandas")

See the sample notebook, pandas-dataframe-output-for-sklearn-transformer.ipynb and documentation for a more detailed example and usage.

Links to documentation and example notebook

Reporting bugs

We’d love your feedback on this. In case of any suggestions or bugs, please report them at scikit-learn issues

Thanks 🙏🏾 to maintainers: Thomas J. Fan, Guillaume Lemaitre , Christian Lorentzen !!