Jekyll2023-12-09T20:30:29+00:00https://blog.scikit-learn.org/feed.xmlscikit-learn BlogThe official blog of scikit-learn, an open source library for machine learning in Python.{"bio"=>"Open source library for machine learning in Python.", "links"=>[{"label"=>"GitHub", "icon"=>"fab fa-fw fa-github-square", "url"=>"https://github.com/scikit-learn"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/scikit_learn"}, {"label"=>"YouTube", "icon"=>"fab fa-fw fa-youtube", "url"=>"https://youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw"}, {"label"=>"LinkedIn", "icon"=>"fab fa-fw fa-linkedin", "url"=>"https://linkedin.com/company/scikit-learn/"}, {"label"=>"Facebook", "icon"=>"fab fa-fw fa-facebook-square", "url"=>"https://facebook.com/scikitlearnofficial/"}, {"label"=>"Instagram", "icon"=>"fab fa-fw fa-instagram", "url"=>"https://instagram.com/scikitlearnofficial/"}]}My mentored internship at scikit-learn2023-11-27T00:00:00+00:002023-11-27T00:00:00+00:00https://blog.scikit-learn.org/diversity/mentoring<div>
<img src="/assets/images/posts_images/" alt="" />
Author:
<a itemprop="sameAs" content="https://github.com/StefanieSenger" href="https://github.com/StefanieSenger" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/stefanie-senger.jpeg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />Stefanie Senger</a>
<a href="mailto:stefanie.senger@posteo.de" title="stefanie.senger@posteo.de"><span><i class="elastic-fai fas fa-envelope"></i></span></a>
,
<a itemprop="sameAs" content="https://github.com/francoisgoupil" href="https://github.com/francoisgoupil" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/francois_goupil.jpeg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />François Goupil</a>
<a href="mailto:francois.goupil@inria.fr" title="francois.goupil@inria.fr"><span><i class="elastic-fai fas fa-envelope"></i></span></a>
<br /><br />
</div>
<h2 id="how-it-is-to-be-an-intern-at-scikit-learn">How it is to be an Intern at scikit-learn</h2>
<p>My name is Stefanie Senger, and I recently concluded a five-month mentored internship at scikit-learn, that had been funded by NumFocus as a Small Development Grant with a clear focus on fostering diversity in open-source projects. The idea to couple a grant with mentorship traces back to Maren Westermann’s initiative. She envisioned a pathway to integrate more female coders into scikit-learn through internships and support. Scikit-learn would profit from fresh perspectives and some disruption. I was the guinea pig for an initial experiment, as Maren later told me.</p>
<h2 id="starting-the-internship">Starting the Internship</h2>
<p>As someone transitioning from a non-technical background to coding, working on scikit-learn was a big thing for me. I had participated in and taught at a data science boot camp, searching diligently for a first role in the field. I never doubted I could tackle more difficult tech challenges over time, but I knew there was much to learn. Scikit-learn had a heavy-tech aura to me, and when I discovered the internship ad, I just thought: this. I was genuinely taken aback when accepted for the role, though. There are many more experienced people looking for such an opportunity, after all.</p>
<p>When I got to know better both my mentors, Adrin Jalali and Guillaume Lemaitre, it became quickly clear that only effort was required, and I could ask them any question along the way. I felt very welcome in the community, also by the other people I interacted with on GitHub.</p>
<h2 id="what-i-worked-on">What I Worked on</h2>
<p>I began by working on documentation and examples such as “Multi-class AdaBoosted Decision Trees,” to make those more comprehensive and helpful for users. Then some maintenance tasks on the code that were repetitive so I could find out what to do from other contributors’ pull requests. Guillaume discovered that one AdaBoost algorithm required deprecation, and it fell on me to execute this. I had never looked at such a huge code base with so many layers of abstraction, and I had to learn quite some more Python to be able to go ahead. I even got the opportunity to present an “Intro to scikit-learn” workshop at EuroSciPy, the European conference on the scientific use of Python in Basel, where I also got to know many other contributors and people from the scikit-learn team at Inria.</p>
<p>Adrin introduced me to the challenging task of implementing a new feature for metadata routing, developed over many years by the scikit-learn community. It allows users to set metadata, such as sample weights, in meta estimators, that can be routed to sub-estimators and other algorithms that are able to consume it. This was partly uncharted territory and meant finding solutions where there was no predefined path and adapting tests to match the expected behavior. In the last two months of my internship, I implemented metadata routing into some meta-estimators, which was tremendously difficult but, once accomplished, has nourished my professional confidence since.</p>
<h2 id="mentorship-in-action">Mentorship in Action</h2>
<p>Let me describe how the mentoring worked because Guillaume’s and Adrin’s support was invaluable. They would both literally drop their tasks when I had questions and right away hint me in the right direction. I met Adrin twice a week, and we would co-work while I would throw questions at him. Guillaume was available remotely, and I knew he would jump into a video call with me when I needed help. They both gave reviewing my PRs a priority, and I got feedback on my work regularly.</p>
<p>It was essential to have mentors signaling that it’s okay to be learning and to propose tasks to me. If I had come into the project individually, I might have hesitated to take on most of the issues I ended up working on, fearing that my skills were insufficient and that I would hinder the progress of the project rather than help it. The mentoring setting gave me a justification to try things that I wasn’t sure if I could do.</p>
<h2 id="becoming-a-community-member">Becoming a Community Member</h2>
<p>Looking ahead, I will continue contributing to scikit-learn. As I’ve gotten to know quite a few of the other contributors in person, I now feel part of the community. I know they care about values like openness and diversity, that I share, and while acknowledging the complexity of the code base, I know what I can learn from taking on issues and the sense of accomplishment when merging my solution into the main branch. And I love contributing to something meaningful, which is something I had always sought.</p>{"bio"=>"Open source library for machine learning in Python.", "links"=>[{"label"=>"GitHub", "icon"=>"fab fa-fw fa-github-square", "url"=>"https://github.com/scikit-learn"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/scikit_learn"}, {"label"=>"YouTube", "icon"=>"fab fa-fw fa-youtube", "url"=>"https://youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw"}, {"label"=>"LinkedIn", "icon"=>"fab fa-fw fa-linkedin", "url"=>"https://linkedin.com/company/scikit-learn/"}, {"label"=>"Facebook", "icon"=>"fab fa-fw fa-facebook-square", "url"=>"https://facebook.com/scikitlearnofficial/"}, {"label"=>"Instagram", "icon"=>"fab fa-fw fa-instagram", "url"=>"https://instagram.com/scikitlearnofficial/"}]}Author: Stefanie Senger , François GoupilNVIDIA Is A New Sponsor Of The Scikit-Learn consortium at the Inria Foundation2023-11-14T00:00:00+00:002023-11-14T00:00:00+00:00https://blog.scikit-learn.org/funding/nvidia-is-a-new-sponsor<div>
<img src="/assets/images/posts_images/NVIDIAxsklearn.jpg" alt="" />
Author:
<a itemprop="sameAs" content="https://developer.nvidia.com/gpu-accelerated-libraries" href="https://developer.nvidia.com/gpu-accelerated-libraries" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/nvidia-logo.png" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />NVIDIA</a>
,
<a itemprop="sameAs" content="https://github.com/francoisgoupil" href="https://github.com/francoisgoupil" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/francois_goupil.jpeg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />François Goupil</a>
<a href="mailto:francois.goupil@inria.fr" title="francois.goupil@inria.fr"><span><i class="elastic-fai fas fa-envelope"></i></span></a>
<br /><br />
</div>
<p><span style="color:red"><em>Sponsored blog post</em> </span></p>
<p>We are thrilled to announce that <a href="https://www.nvidia.com">NVIDIA</a> has joined the <a href="https://scikit-learn.fondation-inria.fr/">scikit-learn consortium</a> as a corporate partner. As a leading provider of GPU-accelerated computing solutions, we at NVIDIA recognize the importance of machine learning and the role it plays in the growth of many industries and areas of science. Our partnership with the scikit-learn consortium demonstrates our commitment to supporting the development and advancement of open-source software in the machine learning community.</p>
<div>
<video preload="auto" autoplay="" loop="" muted="muted" volume="0">
<source src="/assets/videos/NVIDIAxsklearn.mp4" type="video/mp4" />
</video>
</div>
<p><a href="https://scikit-learn.org/stable/">Scikit-learn</a> is a popular open-source Python library for machine learning. One of the strengths of scikit-learn is its ease of use and well-defined API. This makes it a favorite tool among data scientists and machine learning practitioners. Thanks to its active community and continuous development, scikit-learn is constantly evolving and improving.</p>
<p>At NVIDIA, we believe that investing in open-source projects like scikit-learn is important. Afterall, it is a central component of the modern data stack in both science and industry. By financially supporting the scikit-learn consortium, we are contributing to the long-term sustainability of scikit-learn and helping to ensure that it remains an easy to use, reliable and valuable tool for years to come. Furthermore, we hope to help advance the project’s development, improve its performance, and enhance its capabilities for machine learning on GPUs.</p>
<p>Our partnership with the scikit-learn consortium will also enable us to collaborate more closely with the scikit-learn community, and provide us with insights into how we can improve NVIDIA’s <a href="https://developer.nvidia.com/rapids">RAPIDS open-source libraries</a> to better serve their needs. We are committed to working with the foundation to ensure that scikit-learn remains a powerful and easy to use machine learning library that meets the needs of data science practitioners in science and industry.</p>
<p>NVIDIA’s commitment to scikit-learn goes beyond financial support. We have hired <a href="https://betatim.github.io">Tim Head</a>, an experienced open-source maintainer, to work full-time on the project. This is not Tim’s first open-source rodeo. He has previously contributed to several high-profile open-source projects, including Project Jupyter. His focus will be reviewing pull requests and coordinating the development of large features. Tim was recently elected as a core maintainer of scikit-learn. His expertise and experience will be invaluable in ensuring the continued growth and success of the project.</p>
<p>In summary, NVIDIA’s partnership with the scikit-learn consortium is an important step in our ongoing commitment to support the development and growth of open-source software in the machine learning community. We are excited to work with the foundation and the community of contributors to help advance the capabilities of scikit-learn and accelerate the development of machine learning applications.</p>
<p>AI helped write this blog post!</p>{"bio"=>"Open source library for machine learning in Python.", "links"=>[{"label"=>"GitHub", "icon"=>"fab fa-fw fa-github-square", "url"=>"https://github.com/scikit-learn"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/scikit_learn"}, {"label"=>"YouTube", "icon"=>"fab fa-fw fa-youtube", "url"=>"https://youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw"}, {"label"=>"LinkedIn", "icon"=>"fab fa-fw fa-linkedin", "url"=>"https://linkedin.com/company/scikit-learn/"}, {"label"=>"Facebook", "icon"=>"fab fa-fw fa-facebook-square", "url"=>"https://facebook.com/scikitlearnofficial/"}, {"label"=>"Instagram", "icon"=>"fab fa-fw fa-instagram", "url"=>"https://instagram.com/scikitlearnofficial/"}]}Author: NVIDIA , François Goupilscikit-learn 2023 In-person Developer Sprint in Paris, France2023-09-10T00:00:00+00:002023-09-10T00:00:00+00:00https://blog.scikit-learn.org/events/paris-dev-sprint<div>
Author:
<a itemprop="sameAs" content="https://reshamas.github.io" href="https://reshamas.github.io" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/reshama_shaikh.jpeg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />Reshama Shaikh</a>
,
<a itemprop="sameAs" content="https://github.com/francoisgoupil" href="https://github.com/francoisgoupil" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/francois_goupil.jpeg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />François Goupil</a>
<a href="mailto:francois.goupil@inria.fr" title="francois.goupil@inria.fr"><span><i class="elastic-fai fas fa-envelope"></i></span></a>
<br /><br />
</div>
<p>During the week of June 19 to 23, 2023, the scikit-learn team held its first developers sprint since 2019! The sprint took place in Paris, France at the Dataiku office. The sprint event was an in-person event and had 32 participants.</p>
<p>The following <a href="https://scikit-learn.org/stable/about.html">scikit-learn team members</a> joined the sprint:</p>
<ol>
<li>Adrin Jalali</li>
<li>Arturo Amor Quiroz</li>
<li>François Goupil (@francoisgoupil)</li>
<li>Frank Charras (@fcharras)</li>
<li>Gael Varoquaux (@GaelVaroquaux)</li>
<li>Guillaume Lemaitre (@glemaitre)</li>
<li>Jérémie du Boisberranger (@jeremiedbb)</li>
<li>Joris Van den Bossche</li>
<li>Julien Jerphanion (@jjerphan)</li>
<li>Loïc Estève</li>
<li>Maren Westermann</li>
<li>Olivier Grisel (@ogrisel)</li>
<li>Roman Yurchak</li>
<li>Thomas Fan</li>
<li>Tim Head (@betatim)</li>
</ol>
<p>The following community members joined the sprint:</p>
<ol>
<li>Alexandre Landeau</li>
<li>Alexandre Vigny</li>
<li>Chaine San Buenaventura</li>
<li>Camille Troillard</li>
<li>Denis Engemann</li>
<li>Franck Charras</li>
<li>Harizo Rajaona</li>
<li>Ines (intern at Dataiku)</li>
<li>Jovan Stojanovic</li>
<li>Leo Dreyfus-Schmidt</li>
<li>Léo Grinsztajn</li>
<li>Lilian Boulard</li>
<li>Louis Fouquet</li>
<li>Riccardo Cappuzzo</li>
<li>Samuel Ronsin</li>
<li>Vincent Maladière</li>
<li>Yann Lechelle</li>
</ol>
<figure>
<img src="/assets/images/posts_images/2023-paris-sprint/paris_2023.jpg" alt="group of people who participated in the sprint" max-width="20%" max-height="20%" />
<figcaption>
scikit-learn Developer Sprint, Paris, June 2023; Photo credit: <a href=" "> Copyright: Inria / Photo B. Fourrier, June 2023</a>; (from left to right, back to front):
Last Row: Denis Engemann, Riccardo Cappuzzo, François Goupil, Tim Head, Guillaume Lemaitre, Louis Fouquet, Jérémie du Boisberranger, Frank Charras, Léo Grinsztajn, Arturo Amor Quiroz.
Middle Row: Thomas Fan, Lilian Boulard, Gaël Varoquaux, Ines, Jovan Stojanovic, Chaine San Buenaventura.
First Row: Olivier Grisel, Harizo Rajaona, Vincent Maladière.
</figcaption>
</figure>
<h2 id="sponsors">Sponsors</h2>
<ul>
<li>Dataiku provided the space and some of the food, as well as all of the coffee.</li>
<li>The scikit-learn consortium organized the sprint, paid for the lunch, the travel and accommodation expenses.</li>
</ul>
<h2 id="topics-covered-at-the-sprint">Topics covered at the sprint</h2>
<ul>
<li>PR #13649: <a href="https://github.com/scikit-learn/scikit-learn/pull/13649">Monotonic constraints for Tree-based models</a></li>
<li>Discussed the vision/future directions for the project. What is important to keep the project relevant in the future.</li>
<li>Should we share some points beyond the vision statement?</li>
<li>Thomas F will try and create a vision statement</li>
<li>Discussed what people are keeping an eye on with a two year time scale in mind in terms of technology and developments that are relevant.</li>
<li>Tim: keep improving our documentation (not just expanding it but also “gardening” to keep it readable)</li>
<li>Tim: increase active outreach and communication about new features/improvements and other changes. A lot of cool things in scikit-learn are virtually unknown to the wider public (e.g. Hist grad boosting being on par with lightgbm in terms of performance, …)</li>
</ul>
<h3 id="what-is-next">What is next?</h3>
<p>We are discussing co-locating with OpenML in 2024 in Berlin, Germany to organize another developers’ sprint.</p>
<figure>
<img src="/assets/images/posts_images/2023-paris-sprint/thomas_olivier.jpg" alt="group of people who participated in the sprint" max-width="20%" max-height="20%" />
<figcaption>
scikit-learn Developer Sprint, Paris, June 2023; Photo credit: <a href=" "> Copyright Inria / Photo B. Fourrier, June 2023</a>; (from left to right): Thomas Fan, Olivier Grisel
</figcaption>
</figure>{"bio"=>"Open source library for machine learning in Python.", "links"=>[{"label"=>"GitHub", "icon"=>"fab fa-fw fa-github-square", "url"=>"https://github.com/scikit-learn"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/scikit_learn"}, {"label"=>"YouTube", "icon"=>"fab fa-fw fa-youtube", "url"=>"https://youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw"}, {"label"=>"LinkedIn", "icon"=>"fab fa-fw fa-linkedin", "url"=>"https://linkedin.com/company/scikit-learn/"}, {"label"=>"Facebook", "icon"=>"fab fa-fw fa-facebook-square", "url"=>"https://facebook.com/scikitlearnofficial/"}, {"label"=>"Instagram", "icon"=>"fab fa-fw fa-instagram", "url"=>"https://instagram.com/scikitlearnofficial/"}]}Author: Reshama Shaikh , François GoupilInterview with Meekail Zain, scikit-learn Team Member2022-11-30T00:00:00+00:002022-11-30T00:00:00+00:00https://blog.scikit-learn.org/team/meekail-zain-interview<div>
<img src="/assets/images/posts_images/meekail-zain-interview.png" alt="" />
Author:
<a itemprop="sameAs" content="https://reshamas.github.io" href="https://reshamas.github.io" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/reshama_shaikh.jpeg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />Reshama Shaikh</a>
,
<a itemprop="sameAs" content="https://www.linkedin.com/in/meekail-zain-02a412a2/" href="https://www.linkedin.com/in/meekail-zain-02a412a2/" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/meekail-zain.jpg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />Meekail zain</a>
<br /><br />
</div>
<p>Posted by <a href="https://www.linkedin.com/in/sangam-swadi-k/">Sangam SwadiK</a></p>
<p>Meekail Zain is a computer science PhD student at University of Georgia (USA), a member of Quinn Research Group and a software engineer at Quansight. Meekail officially joined the scikit-learn team as a maintainer in October 2022.</p>
<ol>
<li>
<p><strong>Tell us about yourself.</strong></p>
<p>I’m currently attending the University of Georgia, pursuing a PhD in computer science. My area of research predominantly focuses on deep learning, generative modeling, and statistical approaches to clustering. I’m in my third year, and at the time of writing about to begin my comprehensive exams.</p>
<ul>
<li>GitHub: <a href="https://github.com/Micky774">@Micky774</a></li>
<li>LinkedIn: <a href="https://www.linkedin.com/in/meekail-zain-02a412a2/">@meekail-zain</a></li>
</ul>
</li>
<li>
<p><strong>How did you first become involved in open source and scikit-learn?</strong></p>
<p>I first got involved as a user, as most people do. NumPy was a recurring day-to-day library for me, and scikit-learn was a de-facto necessity for several graduate courses. Originally I never really imagined being able to get to a point where I could affect change in these libraries since they seemed so well-established!</p>
</li>
<li>
<p><strong>We would love to learn of your open source journey.</strong></p>
<p>My journey really kicked off when I went to work at Quansight and received funding through the <a href="https://numfocus.medium.com/numfocus-projects-receive-nasa-grants-deee374e7a57">NASA Roses grant</a> to be able to dedicate time to contributing to scikit-learn. It was a huge jump from what I had known up until that point. I learned Python very informally in order to be able to use PyTorch to develop/deploy models for my research, and had little-to-no experience with things like continuous integration or strong API. At first I felt incredibly intimidated and unqualified, but at the same time absolutely thrilled that I was in a position to learn so many new things!
<em><span style="background-color: #CAE9F5;">
I started working on really simple changes to get used to the contribution workflow — things like removing excess whitespace and fixing typos
</span></em>
— and then graduated to slightly more complex tasks. Eventually I got to the point where I started to “understand” small corners of the codebase and could actually offer help on new issues because of that familiarity. After that,<em><span style="background-color: #CAE9F5;"> I started reviewing others’ pull requests (PRs) and offering feedback in an unofficial capacity</span></em>, as well as taking on more challenging tasks across the codebase. That process of growth and escalation is still ongoing, and truly I hope it never ends.</p>
</li>
<li>
<p><strong>To which OSS projects and communities do you contribute?</strong></p>
<p>NumPy, scikit-learn, and scipy. Right now it is heavily skewed towards scikit-learn with numpy being second most, but I’m hoping to take some more time to work on scipy in the near future!</p>
</li>
<li>
<p><strong>What advice or tips you have for people starting out in your field of work?</strong></p>
<p><em><span style="background-color: #CAE9F5;">Find a way to enjoy the feeling of being surrounded by things that you haven’t yet mastered</span></em>. If you aim for growth — and indeed I think we all should — then you’ll find that you spend the majority of your time surrounded by things that you don’t quite understand, and the natural reaction to that is frustration and intimidation. If you can somehow convince yourself to also be excited by such an environment, you’ll find yourself growing every single day. Nobody starts off knowing everything :)</p>
</li>
<li>
<p><strong>What do you find alluring about OSS?</strong></p>
<p>This is a tough one, there are many amazing points. If I had to select just a few, it would be (in no particular order):</p>
<ul>
<li>The growth potential</li>
<li>The community</li>
<li>The impact</li>
</ul>
<p>I’ve already discussed the growth potential so I’ll leave it at that.</p>
<p>The <strong>community</strong> is fantastic as well! On every project the community base has its own unique personality of sorts, and they are all wonderful! It’s amazing being able to see recurring users that post interesting issues, or take a stab at opening more complex PRs (pull requests). There’s a strong sense of companionship with the people that are also trying to improve the same project as you! It’s akin to a very niche club in high school. It’s a wonderful experience finding people obsessed with the same cool project as you are.</p>
<p>Finally, the <strong>impact</strong>. At the end of the day, the work we do has some serious consequences. Each project is essential to so many different workflows and enables brilliant researchers and software engineers to build complex systems and solutions to cutting edge problems. It’s sometimes surreal to think about how essential some of these projects really are.</p>
</li>
<li>
<p><strong>What pain points do you observe in community-led OSS?</strong></p>
<p><em><span style="background-color: #CAE9F5;">Consensus is difficult</span></em>. This is a double-edged sword, since it carries some benefits too. With community-lead OSS, changes at every scale need to meet <em>some</em> kind of consensus.<em><span style="background-color: #CAE9F5;"> This ensures that the changes are well thought out and provides a layer of safety since the chance of uncaught mistakes propagating goes down with the number of people carefully reviewing changes</span></em> (for the most part).</p>
<p>For example, in scikit-learn a PR with changes to code needs to meet a lazy consensus where two official reviewers (currently just core developers) explicitly approve, and no other official reviewer officially disapproves. Going a bit further up, a new feature request in a project could require the consensus of several core developers that are well-versed in the topic area. Large systemic changes manifest in the form of <a href="https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep_template.html">SLEPs</a> (scikit-learn enhancement proposals) which require a ⅔ consensus across all core developers. Above even that, there are cross-community discussions where the idea of a “consensus” itself isn’t always really clear.</p>
<p>This system is a critical one, but there are important issues intrinsic to it that need to be addressed. For example, who gets to contribute to a consensus at each scale? What qualifications does one need, and how do we codify that? There’s also the intrinsic tradeoff where the stronger the consensus required, the less likely it is that changes will be adopted. This is by design since wide-reaching changes need to be held to high standards, but it does also mean that occasionally even for narrow-scoped problems no solution will be reached despite options being raised that are better than the status quo.</p>
</li>
<li>
<p><strong>If we discuss how far OS has evolved in 10 years, what would you like to see happen?</strong></p>
<p>I can’t speak to its evolution in the past 10 years, since I am still fairly new to OSS overall, but <em><span style="background-color: #CAE9F5;">I would like to see systematic data-driven analysis on contributor’s needs</span></em>. Different OSS projects have issued contributor surveys in the past, but in general I think a lot of emphasis is placed on the feedback given from users in meta issues or over community calls. While that is definitely helpful, there’s a lot of extrapolation that takes place when projects try to determine the needs of their contributor base like this.</p>
<p>Some questions I would love to see studied include:</p>
<ul>
<li>What distribution does the expertise of the contributor base follow?</li>
<li>What are the greatest bottlenecks at each level of expertise?</li>
<li>Aside from expertise, are there other socio-economic or general demographics that exhibit consistent bottlenecks? (e.g. access to hardware)</li>
<li>How do we create informed and effective DEI policies from this information?</li>
</ul>
<p><em><span style="background-color: #CAE9F5;">
OSS projects thrive and prosper based on their community, so I would love to see more systematic research on community needs and pain points.</span></em></p>
</li>
<li>
<p><strong>What are your favorite resources, books, courses, conferences, etc?</strong></p>
<p>I absolutely adore <a href="https://www.utstat.toronto.edu/mikevans/jeffrosenthal/">“Probability and Statistics” by Evans and Rosenthal</a>. It does a fantastic job of constructing a lot of otherwise daunting statistical concepts from very elementary ideas. It is my favorite book to recommend to eager students that do not have a rigorous foundation in probability and statistics, since this book does a great job of building up the reader’s intuition and making everything feel natural and derived, rather than arbitrarily defined.</p>
<p>Regarding conferences, I have to go with <a href="https://conference.scipy.org/">SciPy</a>! I was definitely scared going into the conference thinking that I would be the least-qualified person in every room and that I’d have nothing to talk about. I realized very quickly that there is <em>always</em> something to talk about, and qualifications don’t matter. It’s a gathering of super passionate people that are each eager to talk about the things that interest them, so regardless of whether you’re an expert or a beginner, they will <em>happily</em> explain things to you. Every single attendee has some area, no matter how specific, that they can talk about for hours. That genuine interest and excitement felt rejuvenating and reminded me why I love OSS so much.</p>
</li>
<li>
<p><strong>What are your hobbies, outside of work and open source?</strong></p>
<p>I really enjoy hiking, camping and playing DnD (Dungeons & Dragons)! Camping especially is an important hobby for me since whenever I have a computer in reach I feel inclined to check my GitHub notifications, so the occasional total disconnect for a weekend is a fantastic tool for me to give myself a break with no pressure of “I <em>could</em> work on that new feature right now…”</p>
<p>If you have ever had difficulty with relaxing because of that little voice in your head that says “How dare you relax? You could be doing <em>this</em> and <em>that</em> right now!” then I highly recommend going camping, even just for one night! When that voice strikes during camping, I retort “Ah but you see, I don’t have my laptop, so I <em>can’t</em> work on that right now. All I can do right now is relax.” and suddenly the anxiety washes away :)</p>
</li>
</ol>{"bio"=>"Open source library for machine learning in Python.", "links"=>[{"label"=>"GitHub", "icon"=>"fab fa-fw fa-github-square", "url"=>"https://github.com/scikit-learn"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/scikit_learn"}, {"label"=>"YouTube", "icon"=>"fab fa-fw fa-youtube", "url"=>"https://youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw"}, {"label"=>"LinkedIn", "icon"=>"fab fa-fw fa-linkedin", "url"=>"https://linkedin.com/company/scikit-learn/"}, {"label"=>"Facebook", "icon"=>"fab fa-fw fa-facebook-square", "url"=>"https://facebook.com/scikitlearnofficial/"}, {"label"=>"Instagram", "icon"=>"fab fa-fw fa-instagram", "url"=>"https://instagram.com/scikitlearnofficial/"}]}Author: Reshama Shaikh , Meekail zainPandas DataFrame Output for sklearn Transformers2022-11-08T00:00:00+00:002022-11-08T00:00:00+00:00https://blog.scikit-learn.org/technical/pandas-dataframe-output-for-sklearn-transformer<div>
<img src="/assets/images/posts_images/pandas_output_sklearn_transformers.PNG" alt="" />
Author:
<a itemprop="sameAs" content="https://www.linkedin.com/in/sangam-swadi-k/" href="https://www.linkedin.com/in/sangam-swadi-k/" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/sangam_swadik.jpg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />Sangam SwadiK</a>
<br /><br />
</div>
<h2 id="video">Video</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/5bCg8VfX2x8" title="Pandas DataFrame Output for sklearn Transformers" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="upcoming-feature-in-release-12">Upcoming feature in release 1.2</h2>
<p>Starting with the next release of <a href="https://github.com/scikit-learn/scikit-learn">scikit-learn</a> (v1.2), pandas dataframe output will be available for all sklearn transformers! This will make running pipelines on dataframes much easier and provide better ways to track feature names. Previously, mapping a transformed output back into columns would be cumbersome as it might not be a one-to-one mapping in cases of complex preprocessing (e.g., polynomial features).</p>
<p>The pandas dataframe output feature for transformers solves this by tracking features generated from pipelines automatically. The transformer output format can be configured explictly for either <strong>numpy</strong> or <strong>pandas</strong> output formats as shown in <a href="https://scikit-learn.org/dev/modules/generated/sklearn.set_config.html#sklearn.set_config">sklearn.set_config</a> and the sample code below.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">set_config</span>
<span class="n">set_config</span><span class="p">(</span><span class="n">transform_output</span> <span class="o">=</span> <span class="s">"pandas"</span><span class="p">)</span>
</code></pre></div></div>
<p>See the sample notebook, <a href="https://github.com/scikit-learn/blog/blob/main/assets/notebooks/sklearn-pandas-df-output.ipynb">pandas-dataframe-output-for-sklearn-transformer.ipynb</a> and documentation for a more detailed example and usage.</p>
<h2 id="links-to-documentation-and-example-notebook">Links to documentation and example notebook</h2>
<ul>
<li><a href="https://scikit-learn.org/dev/auto_examples/miscellaneous/plot_set_output.html#sphx-glr-auto-examples-miscellaneous-plot-set-output-py">Pandas output for transformers documentation</a></li>
<li><a href="https://github.com/scikit-learn/blog/blob/main/assets/notebooks/sklearn-pandas-df-output.ipynb">pandas-dataframe-output-for-sklearn-transformer.ipynb</a></li>
</ul>
<h2 id="reporting-bugs">Reporting bugs</h2>
<p>We’d love your feedback on this. In case of any suggestions or bugs, please report them at
<a href="https://github.com/scikit-learn/scikit-learn/issues">scikit-learn issues</a></p>
<p>Thanks 🙏🏾 to maintainers: <a href="https://github.com/thomasjpfan"><strong>Thomas J. Fan</strong></a>, <a href="https://github.com/glemaitre"><strong>Guillaume Lemaitre</strong></a> , <a href="https://github.com/lorentzenchr"><strong>Christian Lorentzen</strong></a> !!</p>{"bio"=>"Open source library for machine learning in Python.", "links"=>[{"label"=>"GitHub", "icon"=>"fab fa-fw fa-github-square", "url"=>"https://github.com/scikit-learn"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/scikit_learn"}, {"label"=>"YouTube", "icon"=>"fab fa-fw fa-youtube", "url"=>"https://youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw"}, {"label"=>"LinkedIn", "icon"=>"fab fa-fw fa-linkedin", "url"=>"https://linkedin.com/company/scikit-learn/"}, {"label"=>"Facebook", "icon"=>"fab fa-fw fa-facebook-square", "url"=>"https://facebook.com/scikitlearnofficial/"}, {"label"=>"Instagram", "icon"=>"fab fa-fw fa-instagram", "url"=>"https://instagram.com/scikitlearnofficial/"}]}Author: Sangam SwadiKscikit-learn and Hugging Face join forces2022-10-13T00:00:00+00:002022-10-13T00:00:00+00:00https://blog.scikit-learn.org/updates/community/joining-forces-hugging-face<div>
<img src="/assets/images/posts_images/HFxsklearn.png" alt="" />
Author:
<a itemprop="sameAs" content="https://github.com/LysandreJik" href="https://github.com/LysandreJik" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/lysandre_debut.jpg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />Lysandre Debut</a>
<a href="mailto:lysandre@huggingface.co" title="lysandre@huggingface.co"><span><i class="elastic-fai fas fa-envelope"></i></span></a>
,
<a itemprop="sameAs" content="https://github.com/francoisgoupil" href="https://github.com/francoisgoupil" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/francois_goupil.jpeg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />François Goupil</a>
<a href="mailto:francois.goupil@inria.fr" title="francois.goupil@inria.fr"><span><i class="elastic-fai fas fa-envelope"></i></span></a>
<br /><br />
</div>
<p><a href="hf.co">Hugging Face</a> is happy to announce that we’re partnering with <a href="https://scikit-learn.org/stable/index.html">scikit-learn</a> to further our support of the machine learning tools and ecosystem.</p>
<p>At Hugging Face, we’ve been putting a lot of effort into supporting deep learning, but we believe that machine learning as a whole can benefit from the tools we release. With statistical machine learning being essential in this field and scikit-learn dominating statistical ML, we’re excited to partner and move forward together.</p>
<p>As of September 2022, the Hugging Face Hub already hosts nearly 4,000 tabular classification and tabular regression model checkpoints, and we strive for this trend to continue.</p>
<div>
<video preload="auto" autoplay="" loop="" muted="muted" volume="0">
<source src="/assets/videos/HFxsklearn.mp4" type="video/mp4" />
</video>
</div>
<h2 id="support-to-the-scikit-learn-consortium">Support to the scikit-learn consortium</h2>
<p>Starting June 2022, Hugging Face is now an official sponsor of the scikit-learn consortium . Through this support, Hugging Face actively promotes the development and sustainability of sklearn. As a sponsor of the scikit-learn consortium hosted at the Inria foundation, we’ll now participate in the scikit-learn consortium technical committee</p>
<h2 id="development-support">Development support</h2>
<p>To help sustaining the development of the library , we’re happy to welcome Adrin Jalali and Benjamin Bossan to the Hugging Face team. Adrin is a core developer of scikit-learn as well as <a href="fairlearn.org">fairlearn</a>, while Benjamin is the author of the <a href="https://github.com/skorch-dev/skorch">skorch</a> library and is now a contributor to scikit-learn.</p>
<p>Hugging Face is happy to support the development of scikit-learn through code contributions, issues, pull requests, reviews, and discussions.</p>
<h2 id="integration-to-and-from-the-hugging-face-hub">Integration to and from the Hugging Face Hub</h2>
<p><a href="https://github.com/skops-dev/skops">“Skops”</a> is the name of the framework being actively developed as the link between the scikit-learn and the Hugging Face ecosystems. With Skops, we hope to facilitate essential workflows:</p>
<ul>
<li>The ability to push scikit-learn models on the Hugging Face Hub</li>
<li>The possibility to try out models directly in the browser</li>
<li>The automatic creation of model cards, to improve model documentation and understanding</li>
<li>The ability to collaborate with others on machine learning projects</li>
</ul>
<h3 id="snapshot-of-your-work">Snapshot of your work</h3>
<p>Working at the intersection of scikit-learn and the Hub offers challenges linked to the two platforms. One of these challenges is secure persistence: the ability to serialize models in a secure, safe manner.</p>
<p>scikit-learn models (estimators, predictors, …) are usually saved using pickle, which is notorious for not being a secure format. Sharing scikit-learn models in this format exposes receivers to potentially malicious data which could execute arbitrary code when run.</p>
<p>That’s where secure persistence comes in: as the Hugging Face Hub aims to provide a platform for models, the ability to share safe, secure objects is essential. We’ve been working on adding secure persistence for scikit-learn models in <a href="https://github.com/skops-dev/skops/pull/128">skops#128</a> and <a href="https://github.com/skops-dev/skops/pull/145">skops#145</a>(<a href="https://skops--145.org.readthedocs.build/en/145/persistence.html">doc preview</a>). Instead of serializing using pickle, the object’s contents are put into a zip file with an accompanying schema JSON file.</p>
<p>Read about the Skops library in the following blog post: <a href="https://huggingface.co/blog/skops">Introducing Skops</a>.</p>
<h2 id="improving-interoperability">Improving interoperability</h2>
<p>Skops is an example of an integration of scikit-learn within our tools, but it is not the only example! We will strive to integrate with the rest of our ecosystem so that Hugging Face users may benefit from using scikit-learn tools and vice-versa.</p>
<p>An example is the <code class="language-plaintext highlighter-rouge">evaluate</code> library, dedicated to efficiently evaluating machine learning models and datasets. We aim for this tool to natively support <a href="https://github.com/huggingface/evaluate/issues/297">scikit-learn metrics</a> in its API.</p>
<hr />
<p>Through these efforts, we hope to kickstart a lasting relationship between the two ecosystems and provide simple, efficient bridges to lower the barrier of entry. We believe that educating and sharing models is the best way to foster inclusive machine learning from which all can benefit. We’re excited to partner with scikit-learn for this endeavor.</p>{"bio"=>"Open source library for machine learning in Python.", "links"=>[{"label"=>"GitHub", "icon"=>"fab fa-fw fa-github-square", "url"=>"https://github.com/scikit-learn"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/scikit_learn"}, {"label"=>"YouTube", "icon"=>"fab fa-fw fa-youtube", "url"=>"https://youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw"}, {"label"=>"LinkedIn", "icon"=>"fab fa-fw fa-linkedin", "url"=>"https://linkedin.com/company/scikit-learn/"}, {"label"=>"Facebook", "icon"=>"fab fa-fw fa-facebook-square", "url"=>"https://facebook.com/scikitlearnofficial/"}, {"label"=>"Instagram", "icon"=>"fab fa-fw fa-instagram", "url"=>"https://instagram.com/scikitlearnofficial/"}]}Author: Lysandre Debut , François Goupilscikit-learn Sprint in Salta, Argentina2022-09-29T00:00:00+00:002022-09-29T00:00:00+00:00https://blog.scikit-learn.org/events/salta-sprint<div>
Author:
<a itemprop="sameAs" content="https://jmloyola.github.io/" href="https://jmloyola.github.io/" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/juan-martin-loyola.jpeg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />Juan Martín Loyola</a>
<br /><br />
</div>
<p>In September of 2022, the <a href="https://pythoncientifico.ar/">SciPy Latin America</a> conference took place in Salta, Argentina.
As part of the event, we organized a <a href="https://pythoncientifico.ar/events/sprints/">scikit-learn sprint</a>.
The main idea was to introduce the participants to the open source world and help them make their first contribution.
The sprint event was an in-person event.</p>
<p><img src="https://pythoncientifico.ar/static/assets/images/scipy-la-2022_logo.png" alt="SciPy logo" width="50%" height="50%" /></p>
<h2 id="schedule">Schedule</h2>
<ul>
<li>September 27, 2022 - <strong>Pre-sprint</strong> - 10:00 to 12:00 hs (UTC -3)</li>
<li>September 28, 2022 - <strong>Sprint</strong> - 10:00 to 17:00 hs (UTC -3)</li>
</ul>
<h2 id="repository">Repository</h2>
<p>For more information in Spanish, <a href="https://github.com/jmloyola/sklearn-sprint-argentina-2022">check this repository</a>.
You will find details about the event, instructions to set up the development environment, links with further information and tutorials, and an example git workflow to make a pull request for the project.</p>
<h2 id="photos">Photos</h2>
<figure>
<img src="/assets/images/posts_images/sprint-salta-2022-1.jpg" alt="11 people standing behind some computers and 2 people projected in the screen" max-width="20%" max-height="20%" />
<figcaption>
Group photo of the SciPy Latin America sprint, Salta, Argentina, 2022. Sandra Meneses and Juan Martín Loyola are projected on the screen from a Zoom call. Photo credit: Lucía Torres.
</figcaption>
</figure>
<figure>
<img src="/assets/images/posts_images/sprint-salta-2022-2.jpeg" alt="11 people coding in their computers" max-width="20%" max-height="20%" />
<figcaption>
Participants of the SciPy Latin America sprint working on their computers. Photo credit: Ariel Silvio Norberto Ramos.
</figcaption>
</figure>
<h2 id="acknowledgment">Acknowledgment</h2>
<p>These people made this sprint possible:</p>
<ul>
<li>Ariel Silvio Norberto Ramos, one of the organizers of the SciPy Latin America,</li>
<li><a href="https://www.dataumbrella.org/">Data Umbrella</a>, <a href="https://twitter.com/ScipyLA/status/1573710649963724802">one of the community partners of the event</a>, especially Sandra Meneses and Reshama Shaikh,</li>
<li>The mentors that helped run the sprint.</li>
</ul>{"bio"=>"Open source library for machine learning in Python.", "links"=>[{"label"=>"GitHub", "icon"=>"fab fa-fw fa-github-square", "url"=>"https://github.com/scikit-learn"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/scikit_learn"}, {"label"=>"YouTube", "icon"=>"fab fa-fw fa-youtube", "url"=>"https://youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw"}, {"label"=>"LinkedIn", "icon"=>"fab fa-fw fa-linkedin", "url"=>"https://linkedin.com/company/scikit-learn/"}, {"label"=>"Facebook", "icon"=>"fab fa-fw fa-facebook-square", "url"=>"https://facebook.com/scikitlearnofficial/"}, {"label"=>"Instagram", "icon"=>"fab fa-fw fa-instagram", "url"=>"https://instagram.com/scikitlearnofficial/"}]}Author: Juan Martín LoyolaThe Value of Open Source Sprints, the scikit-learn Experience2022-07-13T00:00:00+00:002022-07-13T00:00:00+00:00https://blog.scikit-learn.org/events/sprints-value<div>
<img src="/assets/images/posts_images/sprints-value2.png" alt="" />
Author:
<a itemprop="sameAs" content="https://reshamas.github.io" href="https://reshamas.github.io" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/reshama_shaikh.jpeg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />Reshama Shaikh</a>
<br /><br />
</div>
<p>With contributions from: Gaël Varoquaux, Andreas Mueller, Olivier Grisel, Julien Jerphanion, Guillaume LeMaitre</p>
<h2 id="top-line-summary">Top Line Summary</h2>
<p>Sprints are <strong>working sessions to contribute to an open source library</strong>. The goals and achievements differ between Developer and Community sprints. The long-term impact of open source sprints, particularly community events, is not easily quantifiable or measurable. Positive outcomes of sprints have slowly been emerging, and for that reason, to realize the value of open source sprints requires playing the “long game”.</p>
<h2 id="introduction">Introduction</h2>
<p>The <a href="https://scikit-learn.org/dev/index.html">scikit-learn</a> project has a long and extraordinary legacy of open source sprints. Since 2010, when its <a href="https://en.wikipedia.org/wiki/Scikit-learn">first public version</a> was released, there have been as many as <a href="https://blog.scikit-learn.org/sprints/">45 sprints organized</a>. The number 45 is a lower bound, since there are likely more sprints that have not been listed.</p>
<p>To date, more than 2400 people have contributed to <a href="https://github.com/scikit-learn/scikit-learn">scikit-learn</a>. The number of contributors to scikit-learn exceeds those of other related libraries such as numpy, scipy and matplotlib, with the exception of <a href="https://github.com/pandas-dev/pandas">pandas</a>, which has a greater number of contributors (See Appendix A).</p>
<p>The public discourse on open source has expanded to explore topics of sustainability, funding models, and diversity and inclusion, to name a few. A <em>reasonable</em>, yet <em>difficult to answer</em> question that has been posed is:</p>
<blockquote>
<p><em><span style="background-color: #CAE9F5;">
What is the effectiveness of sprint models and what is the long-term engagement as a result of these sprints?
</span></em></p>
</blockquote>
<p>Due to technological limitations of GitHub and privacy concerns, we do not hold precise data on how many scikit-learn contributors connected to the project via a sprint. We have no formal data collection process which records statistics on how many sprint participants are recurring or information on their contributions to other open source projects or other long term positive ripple effects. A scientific look at the correlation between the number of sprints and contributors is beyond the scope of this article. What we <em>will examine</em> in this article are the <strong>objectives, results and aspirations</strong> of running the scikit-learn sprints.</p>
<p><span style="background-color: #CAE9F5;">The queries from other open-source projects requesting guidance on sprints and diversity and inclusion have been increasing.</span> We share these experiences and lessons learned with the community, potential funders and open source project maintainers, particularly those projects which are nascent in their quest to build community, sustainability and diversity and inclusion.</p>
<h2 id="outline">Outline</h2>
<p>In this article we examine the following:</p>
<ul>
<li>What is a “sprint”?</li>
<li>What are the differences between “Developer” and “Community” sprints?</li>
<li>What are the goals of the open source sprints?</li>
<li>What value do open source sprints bring to the project and community?</li>
<li>What are the aspirations of the scikit-learn project, in terms of connecting with the community?</li>
</ul>
<h2 id="definition-of-a-scikit-learn-sprint">Definition of a scikit-learn Sprint</h2>
<p>A scikit-learn sprint has traditionally been an event where contributors come together to work on issues in the scikit-learn repository. A sprint can be as short as a few hours, or last over several days, even a week or longer. They may be in-person, online, hybrid or partially asynchronous. Sprints may be organized by the developers of the library, community groups (such as Meetups), scheduled alongside scientific or Python conferences, or even at home with a few friends. They can more simply and less dauntingly be described as
<span style="background-color: #CAE9F5;">
working sessions to contribute to the open source library. <br />
</span></p>
<h2 id="developer-vs-community-sprint">Developer vs Community Sprint</h2>
<p>We distinguish between a Developer (Dev) and Community sprint because the goals and results differ significantly between the two.</p>
<p><strong>Developer (Dev) Sprint</strong></p>
<p>A Developer, or “Dev”, sprint is one that is typically organized by the maintainers of the library. A Dev sprint is one where the developers or maintainers of the library gather to work on issues and to discuss the resolution of ongoing complex issues. This also provides the team an opportunity to focus on tasks related to the long-term roadmap of the project.</p>
<p>The first early Dev sprints were organized at Inria. The first <a href="https://github.com/scikit-learn/scikit-learn/wiki/Past-sprints#granada-19th-21th-dec-2011">major Dev sprint</a> was held in Granada after the NIPS 2011 conference (now renamed NeurIPS). It was the first time that most of the team had met in real life after months or years of online collaboration, and over a dozen developers participated. Later, Dev sprints were often hosted in the offices of partnering tech companies, typically from 3 to 7 days, once a year, in pre-COVID times.</p>
<p><strong>Community Sprint</strong></p>
<p>A Community sprint can be a collaboration by individuals, by affinity communities such as Meetup Groups (Data Umbrella, PyLadies, etc.), by conferences (SciPy, PyCon, PyData Global, JupyterCon, etc.). A Community sprint is one that is with the general public and it may be beginners, experts, or a combination of both.</p>
<p>For scikit-learn, the early Community sprints were alongside the <a href="https://conference.scipy.org">SciPy conferences</a> and the practice has continued for over a decade.</p>
<p>At a Developer sprint, a contributor may work on a PR that has been ongoing for three months. Conversely, Community sprints require curated issues which newcomers can complete in a shorter period of time (such as 1 day, or 1 day with 1-2 months follow-up).</p>
<p>The landscape of Dev and Community sprints with other <a href="https://scientific-python.org/calendars/">scientific python</a> libraries is unknown.</p>
<h2 id="goals-of-the-sprints">Goals of the Sprints</h2>
<h3 id="goals-of-dev-sprints">Goals of Dev Sprints</h3>
<ul>
<li>To get maintainers in one room to efficiently discuss open issues and pull requests</li>
<li>To move along contributions in a synchronous fashion</li>
<li>To foster existing collaborations with external developers synchronously</li>
<li>To build rapport: Maintainers reside in various continents and the in-person sprints build rapport within the team. Social interactions are critical in having a productive team.</li>
<li>To foster collaborations with the project’s corporate sponsors (members of the <a href="https://scikit-learn.org/stable/about.html#funding">scikit-learn Consortium</a>)</li>
</ul>
<h3 id="goals-of-community--beginner-sprints">Goals of Community & Beginner Sprints</h3>
<ul>
<li>To broaden the project’s contributor base</li>
<li>To build community and connect the project maintainers with its users</li>
<li>To obtain interactive feedback from new scikit-learn users and contributors</li>
<li>To onboard new contributors to scikit-learn and PyData generally</li>
<li>To onboard new contributors who would become recurring contributors</li>
<li>To collaborate with community groups to increase diversity of contributor base with intentional outreach</li>
<li>To strengthen and support existing contributors in order to maintain recurring community contributors</li>
</ul>
<h2 id="scikit-learn-team-members-who-connected-to-the-project-via-a-sprint">scikit-learn Team Members Who Connected to the Project Via a Sprint</h2>
<p>It is notable that a number of the current maintainers of the library found their way to the project via a sprint. Additionally, some members of the Contributor Experience Team connected to the scikit-learn project via the sprints.</p>
<h3 id="olivier-grisel">Olivier Grisel</h3>
<p><a href="https://github.com/ogrisel">Olivier Grisel</a> has been a contributor and maintainer for more than 12 years. Olivier met <a href="https://github.com/GaelVaroquaux">Gaël Varoquaux</a> at a local conference organized in Paris by the French speaking Python users group <a href="https://www.afpy.org">AFPy.org</a>. After chatting 5 minutes about toy ML experiments in Python, Gaël invited Olivier to join the <a href="https://web.archive.org/web/20101118052247/http://fseoane.net/blog/2010/scikitslearn-coding-spring-in-paris/">first sprint organized at Inria</a> in March 2010:</p>
<p>Olivier shares:</p>
<blockquote>
<p>At the time, scikit-learn coding sprints gathered only 6 people sitting around a table with some wifi and a coffee machine :)</p>
</blockquote>
<figure>
<img src="/assets/images/posts_images/2010sprint.jpg" alt="5 men sitting around a table coding on their computers" max-width="20%" max-height="20%" />
<figcaption>
First scikit-learn sprint, Paris, March 2010; Photo credit: <a href="https://fa.bianp.net/pages/about.html">Fabian Pedregosa</a>; (from left to right): Fabian Pedregosa, Gael Varoquaux, Olivier Grisel, Alexandre Gramfort
</figcaption>
</figure>
<h3 id="andreas-mueller">Andreas Mueller</h3>
<p><a href="https://github.com/amueller">Andreas Mueller</a> has been a maintainer of the project since 2011. He joined a sprint at a conference because he was a user and wanted to contribute. He <a href="https://mlconf.com/blog/interview-andreas-muller-lecturer-columbia-university-core-contributor-scikit-learn-reshama-shaikh/">shares in a 2017 interview</a>:</p>
<blockquote>
<p>While working on my Ph.D. in computer vision and learning, the scikit-learn library became an essential part of my toolkit. My initial participation in open source began in 2011 at the NeurIPS conference in Granada, Spain, where I had attended a <a href="https://github.com/scikit-learn/scikit-learn/wiki/Past-sprints#granada-19th-21th-dec-2011">scikit-learn sprint</a>. The scikit-learn release manager at the time had to leave, and the project leads asked me to become release manager; that’s how it all got started.</p>
</blockquote>
<figure>
<img src="/assets/images/posts_images/sprint-neurips-2011.jpeg" alt="men at a happy hour" max-width="20%" max-height="20%" />
<figcaption>
NeurIPS Granada, Spain scikit-learn sprint, March 2011; Photo credit: <a href="http://gael-varoquaux.info">Gael Varoquaux</a>; (from left to right): Vlad Niculae, Mathieu Blondel, Bertrand Thirion, James Bergtra, Jake VanderPlas, Andreas Mueller, Alexandre Gramfort
</figcaption>
</figure>
<h3 id="julien-jerphanion">Julien Jerphanion</h3>
<p><a href="https://github.com/jjerphan">Julien Jerphanion</a> participated in a <a href="https://scikit-learn.fondation-inria.fr/scikit-learn-sprint-in-paris/">sprint in February 2019 at AXA</a> as a first time contributor while interning at Dataiku. The sprint provided Julien an opportunity to experience scikit-learn and meet the maintainers. Prior to the sprint, he had only used the library in a few projects. He contributed code, reviews, and documentation since March 2021, joined Inria in April 2021 and in October 2021, Julien became a core developer.</p>
<h3 id="vlad-niculae">Vlad Niculae</h3>
<p>Vlad Niculae’s path to a maintainer was via first a scikit-learn mailing list post, then GSoC (Google Summer of Code) internship, and then the <a href="https://fa.bianp.net/blog/2011/scikit-learn-euroscipy-2011-coding-sprint-day-one/">EuroSciPy 2011</a> sprint.</p>
<blockquote>
<p>The team encouraged me to attend and helped me arrange it. The sprint was instrumental in broadening my focus from the small module I was contributing to the entire library. Co-locating the sprint and the conference this way was great – it gave me a chance to meet not just the scikit-learn team (small at the time, and incredibly welcoming to me!) but also the broader SciPy ecosystem, including NumPy, IPython [Jupyter Notebook] devs. It helped me understand the scientific Python world a lot better.</p>
</blockquote>
<h3 id="other-maintainers">Other Maintainers</h3>
<p>There are <a href="https://scikit-learn.org/dev/about.html#people">other maintainers</a> and emeritus contributors who had participated in a Developer or Community sprint along their journey with the scikit-learn team, such as emeriti Gilles Loupe and Thouis (Ray) Jones.</p>
<h3 id="reshama-shaikh">Reshama Shaikh</h3>
<p><a href="https://github.com/reshamas">Reshama Shaikh</a> has organized nine scikit-learn <a href="https://www.dataumbrella.org/sprints">community sprints</a> from 2017 to 2021. She first contributed code and documentation fixes to scikit-learn in September 2018. In September 2020, she was invited to join the scikit-learn team.</p>
<p>In her PyConDE PyData Berlin keynote from April 2022, <a href="https://blog.dataumbrella.org/pyconde-keynote-reshama">5 Years, 10 Sprints, a scikit-learn Open Source Journey</a>, she shares a history and progression of the Community sprints.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/ZUqJaCWPvmk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h3 id="juan-martín-loyola">Juan Martín Loyola</h3>
<p><a href="https://github.com/jmloyola">Juan Martín Loyola</a> started <a href="https://blog.scikit-learn.org/team/jml-interview/">contributing to scikit-learn</a> as preparation for the <a href="https://blog.dataumbrella.org/data-umbrella-afme2-2021-scikit-learn-sprint-report">Data Umbrella Latin America, June 2021</a> sprint. He continued to contribute prolifically after the sprint, and he was invited to join the team in December 2021. Given his location in Argentina, he will be providing support at the <a href="https://www.scipy.lat/es/scipycon.html">2022 SciPy Latin America</a> sprint.</p>
<h3 id="second-degree-impact">Second Degree Impact</h3>
<p><a href="https://github.com/laurburke">Lauren Burke</a> joined the scikit-learn Communications Team in November 2021 at the recommendation of Reshama Shaikh, and this can be considered a network effect. This demonstrates that sprints can result in valuable contributions other than code.</p>
<h2 id="sprints-observed-impact-and-lessons-learned">Sprints: Observed Impact and Lessons Learned</h2>
<p>There are a number of observed favorable outcomes from the sprints for both the project and contributors.</p>
<p><strong>Onboarding</strong></p>
<p>The sprints help the community discover the open source process and get started with contributing.</p>
<p><strong>Building community</strong></p>
<p>Sprint participants, whether one-time or recurring, become ambassadors for the project.</p>
<p><strong>Open source workflow knowledge</strong></p>
<p>Users learn a range of tools such as: virtual environment setup, version control systems (i.e. Git), testing (flake8, pytest, continuous integration) and unit tests. They also learn software development best practices. For many users of scikit-learn, the sprint is the first time they navigate through the codebase and structure of scikit-learn, dig into functions and learn about errors. They develop experience in collaborative open source workflow. For employers, letting their team contribute to open-source might be a plus as they learn how to collaborate properly and learn about the internals of the library. The sprint experience assists contributors in developing a <a href="https://academiccommons.columbia.edu/doi/10.7916/D89G70BS">wider set of technical skills</a> that can be shared across projects, networking, on to jobs and more.</p>
<p><strong>Overcoming barriers to entry</strong></p>
<p>The sprints, as a “hands-on working session”, provide an avenue for potential contributors to overcome common barriers to entry, particularly “getting started”, and moving from the <em>possibility</em> to an <em>actuality</em> stage.</p>
<p><strong>Providing an avenue for advanced contributions</strong></p>
<p>As sprints provide an on-ramp for new contributors, it similarly provides an opportunity for returning contributors to advance their contributing skills to the next level in a structured environment and with mentorship.</p>
<p><strong>Building confidence</strong></p>
<p>The sprints help to build <span style="background-color: #CAE9F5;">confidence</span> for both new and returning contributors.</p>
<p>Gaël shares:</p>
<blockquote>
<p>I believe those sprints helped resourceful people (like Juan Martín) to gain confidence and provide valuable contributions (especially reviews).</p>
</blockquote>
<p><strong>Increase open-source literacy</strong></p>
<p>The sprints are a forum for users to gain a greater understanding of how an open source project functions and for the user/contributor to learn of an actual contribution, from start to finish.</p>
<p><strong>Value of synchronous interaction</strong></p>
<p>Typically, open source contributions to scikit-learn occur on the GitHub repository in asynchronous fashion, over several weeks or months. The sprints provide real-time synchronous interaction. This experience provides more direct access to technical assistance and feedback to the contributor, which is more efficient and engagin.</p>
<p>Julien shares:</p>
<blockquote>
<p>I think having a setup like this [beginner/community sprint] is valuable for first time contributors because they can synchronously get specific information they would hardly have got otherwise. To me, this allow giving feedback which is immediate, specific and exact, making contributing to open-source enjoyable and preventing frustration: giving such feedback is what we should aim for and in this regard this setup is convenient.</p>
</blockquote>
<h3 id="online-sprints">Online Sprints</h3>
<p>Since the start of the pandemic, Data Umbrella has organized <a href="https://blog.dataumbrella.org/tags/#sprint-report">4 online sprints</a>. Additionally, there were 2 online sprints with <a href="https://www.scipy2020.scipy.org/sprints-schedule">SciPy</a> and <a href="https://wiki.python.org/moin/EuroPython2020/Sprints">EuroPython</a>.</p>
<p>These have been the observed benefits of the online sprints, which began in 2020 due to the global pandemic:</p>
<p><strong>Networking</strong></p>
<p>Online sprints make it easier to meet new people with different backgrounds.</p>
<p><strong>International collaboration</strong></p>
<p>Collaborating with affinity communities can attract more candidates from various backgrounds. In particular, online sprints help break geographical barriers.</p>
<p><strong>Pair programming</strong></p>
<p>The pairing of contributors seems to work well. Pair programming was consistently ranked as a positive experience by online sprint participants.</p>
<p><strong>Increases accessibility</strong></p>
<p>The use of online tools makes it possible to interact with people
who would not have joined community events traditionally organized in
North America or western Europe e.g. because of the travel costs and
complexity of obtaining a visa in time. Attending the online events is probably also less disruptive for people with young children.</p>
<p>For the scikit-learn project itself, it made it possible to “recruit” a couple of new recurring contributors who attend regular office hours after the original sprints.</p>
<p><strong>Office Hours</strong></p>
<p>The scikit-learn project has regular office hours which are hosted on Discord.</p>
<p>Olivier shares:</p>
<blockquote>
<p>Actually the fact that we now have community office hours on Discord is probably a consequence of us attending the Data Umbrella online sprints.</p>
</blockquote>
<blockquote>
<p>I think they [the sprints] were the most interesting online events I attended during
the COVID-19 crisis when all traditional on-site tech events were canceled. In particular the active planning by the Data Umbrella team for participants to work in pairs with audio rooms on Discord + a central help desk audio room worked really well.</p>
</blockquote>
<blockquote>
<p>The pre-sprint and post-sprint office hours also made it possible to limit the time spent on helping fix setup issues compared to what we experience in traditional sprints. They also forced us as maintainers to review and fix our documentation before the event.</p>
</blockquote>
<p><strong>Creation of supplementary resources in different media types</strong></p>
<p>Data Umbrella coordinated the creation of a series of videos and transcripts that provided learning materials for the community to prepare for the sprint. These resources are available to the public and have a wide reach:</p>
<p>This is the <a href="https://www.youtube.com/playlist?list=PLBKcU7Ik-ir-b1fwjNabO3b8ebs9ez5ga">Contributing to scikit-learn</a> list of videos that were created for the sprints:</p>
<ul>
<li>Andreas Mueller: <a href="https://youtu.be/5OL8XoMMOfA">Crash Course in Contributing to scikit-learn</a></li>
<li>Reshama Shaikh: <a href="https://youtu.be/PU1WyDPGePI">Example of scikit-learn Pull Request</a></li>
<li>Andreas Mueller: <a href="https://youtu.be/p_2Uw2BxdhA">Sprint FAQs</a></li>
<li>Thomas Fan: <a href="https://youtu.be/dyxS9KKCNzA">3 Components for Reviewing a Pull Request</a></li>
<li>Melissa Weber Mendonca: <a href="https://youtu.be/tXWscUSYdBs">Sphinx for Python Documentation</a></li>
</ul>
<figure>
<a href="https://www.youtube.com/playlist?list=PLBKcU7Ik-ir-b1fwjNabO3b8ebs9ez5ga"></a>
<img src="/assets/images/posts_images/sprint-videos.png" alt="list of videos" max-width="30%" max-height="30%" />
<figcaption>
Photo credit: <a href="https://reshamas.github.io">Reshama Shaikh</a>
</figcaption>
</figure>
<h2 id="aspirations-for-future-scikit-learn-sprints">Aspirations for Future scikit-learn Sprints</h2>
<p><span style="background-color: #CAE9F5;">
One of the primary goals of the Community sprints was to onboard new contributors who would become recurring contributors. This goal has generally not been realized. scikit-learn is a complex and advanced project, and a one-time sprint does not provide sufficient opportunity and support to sprint participants to become recurring contributors.</span> A few sprint participants have progressed to become returning contributors, and it is a very small number relative to the number of sprint participants.</p>
<p>Onboarding a first-time contributor takes time. People who are contributing for the first time need to go through a lot of information simultaneously regarding both technical and organizational aspects of contributions. People may run into unexpected issues at the start depending on their
setup and experience, might get frustrated and or discouraged and might not
report the problem they are having (thinking it is their fault). Pre-event office hours have been successful at alleviating some of these roadblocks, for those sprint participants who have completed their pre-work.</p>
<p>Here are some adjustments that can be made in the future to reach the goal of recruiting recurring contributors:</p>
<ul>
<li>Provide mentoring</li>
<li>Improve onboarding process</li>
<li>Improve issues definitions</li>
<li>Have sprints alongside tutorials</li>
<li>Expand types of contributions that new contributors can make</li>
<li>Have smaller sprint events</li>
</ul>
<p><strong>Mentoring</strong><br />
Sprints may not be sufficient for onboarding people. Mentoring is needed to take to the next level, and mentoring relationships can be established during sprint events.</p>
<p><strong>Improve the onboarding process</strong></p>
<p>While the scikit-learn project has improved significantly in the past few years as a result of feedback and learnings from the sprints, there is still room for improvement.</p>
<p>The scikit-learn project is complex, the contributor learning curve is steeper, and it has been getting more difficult to contribute to scikit-learn.</p>
<p><strong>Improve issues definitions</strong></p>
<p>There are 1600+ <a href="https://github.com/scikit-learn/scikit-learn/issues">issues</a> in the GitHub repository. Issues can be better defined and it would be valuable to break the issues into smaller steps which would be more approachable.</p>
<p><strong>Sprints alongside tutorials</strong></p>
<p>Scheduling sprints alongside tutorial sessions would be conducive in allowing users to connect the open source tool use cases with the motivation and product vision of scikit-learn.</p>
<p><strong>Expand types of contributions</strong></p>
<p>While the sprints have typically focused on documentation and code contributions, the project needs support in other areas. There is a backlog of <a href="https://github.com/scikit-learn/scikit-learn/issues">open issues</a> (1600+ !) and <a href="https://github.com/scikit-learn/scikit-learn/pulls">open pull requests</a> (650+). The project needs support in triaging issues and reviewing pull requests. It would be beneficial to have sprint contributors work on increasingly complex issues.</p>
<p>Julien shares from personal experience:</p>
<blockquote>
<p>In particular and in my opinion, reviewing pull requests is as valuable as authoring them. I also find it a preferable way to learn about scikit-learn internals rather than solving issues.</p>
</blockquote>
<p><strong>Have smaller sprints</strong></p>
<p>Julien suggests:</p>
<blockquote>
<p>Would sprints with a really small number of people (e.g. 2 mentees per mentor) be
more valuable in the long term? Personally, I would prefer mentoring one or two
people closely instead (ideally in-person) as I think it is more achievable, enjoyable
and fruitful experience (this is something I am trying to do at the moment when I can
get some time but I currently have limited of it).</p>
</blockquote>
<blockquote>
<p>Finally, I would also really treasure having in-person sprints [in Paris] with external (recurring)
contributors (with a specific expertise) on advanced subjects when it is possible in the future.</p>
</blockquote>
<h2 id="conclusion">Conclusion</h2>
<h3 id="connecting-and-supporting-scikit-learn">Connecting and Supporting scikit-learn</h3>
<p>To connect with the scikit-learn project, these are the most active social media platforms:</p>
<ul>
<li>Twitter: <a href="https://twitter.com/scikit_learn">@scikit_learn</a></li>
<li>LinkedIn: <a href="https://www.linkedin.com/company/scikit-learn/">@scikit-learn</a></li>
</ul>
<p>It is most welcome for users to “star” the code repository on GitHub: <a href="https://github.com/scikit-learn/scikit-learn">scikit-learn/scikit-learn</a></p>
<p>Our office hours, in addition to public developers and triage meetings are all posted on our <a href="https://blog.scikit-learn.org/calendar/">Community Calendar</a>.</p>
<p>The next Community sprint may be held at <a href="https://www.euroscipy.org/2022/index.html">EuroScipy 2022</a> in Basel Switzerland in early September. Information on past and <a href="https://blog.scikit-learn.org/sprints/">upcoming sprints</a> are shared on our community site.</p>
<h3 id="contributing-to-scikit-learn">Contributing to scikit-learn</h3>
<p>To contribute to scikit-learn, we have resources available here:</p>
<ul>
<li><a href="https://scikit-learn.org/dev/developers/contributing.html">English</a></li>
<li><a href="https://qu4nt.github.io/sklearn-doc-es/">Spanish</a></li>
</ul>
<p>There are additional resources for contributing:</p>
<ul>
<li><a href="https://www.youtube.com/playlist?list=PLM-1QqX7UksT6tREbR-n9Mhup0OoRBU34">Contributing Videos</a></li>
<li><a href="https://github.com/data-umbrella/data-umbrella-scikit-learn-sprint">English, Spanish and some Portuguese language transcripts</a></li>
</ul>
<h2 id="appendix-a-github-contributors-comparison-of-libraries">Appendix A: GitHub Contributors Comparison of Libraries</h2>
<p>A comparison of the contributor base to other related libraries in the same space (updated July 2022):</p>
<ul>
<li><a href="https://github.com/pandas-dev/pandas">pandas</a>: ~2600</li>
<li><a href="https://github.com/scikit-learn/scikit-learn">scikit-learn</a>: ~2400 contributors</li>
<li><a href="https://github.com/numpy/numpy">numpy</a>: ~1300 contributors</li>
<li><a href="https://github.com/matplotlib/matplotlib">matplotlib</a>: ~1150</li>
<li><a href="https://github.com/scipy/scipy">scipy</a>: ~1170</li>
</ul>
<h2 id="references">References</h2>
<ul>
<li><a href="https://eventfund.codeforscience.org/behind-the-scenes-what-it-takes-to-run-data-umbrellas-scikit-learn-open-source-sprints/">Behind the Scenes: What It Takes to Run Data Umbrella’s scikit-learn Open Source Sprints</a></li>
<li>Data Umbrella <a href="https://blog.dataumbrella.org/tags/#sprint-report">sprint reports</a></li>
<li>Data Umbrella community <a href="https://blog.dataumbrella.org/tags/#sprint-blog">sprint blogs</a></li>
<li><a href="https://blog.dataumbrella.org/mwestermann-sprints-experience">Interview with Maren Westermann: Extending the Impact of the scikit-learn Sprints to the Community</a></li>
<li><a href="https://blog.dataumbrella.org/jmloyola-opensource-experience">Interview with scikit-learn Triage Team Member: Juan Martín Loyola</a></li>
<li>Emily Thompson: <a href="https://medium.com/@ethompso28/planning-a-beginner-open-source-sprint-day-for-data-scientists-163b6aa7087f">Planning a beginner open source sprint day for data scientists
</a></li>
<li>Adrin Jalali: <a href="https://adrin.info/scikit-learn-sprint-at-nairobi-kenya.html">scikit-learn Sprint at Nairobi, Kenya (2019)</a></li>
</ul>{"bio"=>"Open source library for machine learning in Python.", "links"=>[{"label"=>"GitHub", "icon"=>"fab fa-fw fa-github-square", "url"=>"https://github.com/scikit-learn"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/scikit_learn"}, {"label"=>"YouTube", "icon"=>"fab fa-fw fa-youtube", "url"=>"https://youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw"}, {"label"=>"LinkedIn", "icon"=>"fab fa-fw fa-linkedin", "url"=>"https://linkedin.com/company/scikit-learn/"}, {"label"=>"Facebook", "icon"=>"fab fa-fw fa-facebook-square", "url"=>"https://facebook.com/scikitlearnofficial/"}, {"label"=>"Instagram", "icon"=>"fab fa-fw fa-instagram", "url"=>"https://instagram.com/scikitlearnofficial/"}]}Author: Reshama ShaikhInterview with Norbert Preining, scikit-learn Team Member2022-05-22T00:00:00+00:002022-05-22T00:00:00+00:00https://blog.scikit-learn.org/team/norbert-interview<div>
<img src="/assets/images/posts_images/norbert-interview.png" alt="" />
Author:
<a itemprop="sameAs" content="https://reshamas.github.io" href="https://reshamas.github.io" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/reshama_shaikh.jpeg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />Reshama Shaikh</a>
,
<a itemprop="sameAs" content="https://www.preining.info" href="https://www.preining.info" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/norbert.jpeg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />Norbert Preining</a>
<br /><br />
</div>
<p>Norbert Preining joined the scikit-learn Team in June 2021. In this interview, learn more about Norbert’s journey, immersion and passion in open source. His contributions to open source span a lifetime – see where scikit-learn fits into all this.</p>
<ol>
<li>
<p><strong>Tell us about yourself.</strong></p>
<p>I have lived in a few countries, but for about 13 years I call Japan my home. Within Japan I am not in one of the big cities, but in the countryside. For most of my life I have worked in academics, doing research on mathematical logic. For 7 years now I have been in research and development (R&D) teams of companies here in Japan. First Accelia, then Fujitsu, now Mercari. My main research topics are mathematical logic, in particular proof theory and many-valued logics, computability, and software specification and verification. Within my current work areas I am mostly concerned with machine learning in a variety of facets, but mostly in unsupervised learning and most recently in search.</p>
<ul>
<li>GitHub: <a href="https://github.com/norbusan">@norbusan</a></li>
<li>Twitter: <a href="https://twitter.com/norbusan">@norbusan</a></li>
<li>LinkedIn: <a href="https://www.linkedin.com/in/norbertpreining/">@norbertpreining</a></li>
<li>Website: <a href="https://www.preining.info">preining.info</a></li>
</ul>
</li>
<li>
<p><strong>How did you first become involved in open source?</strong></p>
<p>I got my first computer when I was writing my master thesis, and back then a friend installed Linux on it for me. Since then I am a near-exclusive Linux user and learned to love the advantages of open source.</p>
</li>
<li>
<p><strong>We would love to learn of your open source journey.</strong></p>
<p>I started contributing to OSS projects first within <a href="https://www.tug.org/texlive/">TeX Live</a> (the biggest distribution of TeX & friends, available for all major and many minor operating systems) by providing builds for an arcane architecture (alpha-linux). Later on I departed on an adventure to bring TeX Live to Debian. For nearly 20 years I maintained TeX Live and many other packages related (and unrelated) to TeX in Debian (all the versions of TeX Live since 2005 till 2021 have been packaged by me), until this year I passed on the torch to Hilmar Preuße, who has helped me over the last years a lot. During all these years I have also contributed to and headed quite a few other OSS projects.</p>
</li>
<li>
<p><strong>How did you get involved in scikit-learn?</strong></p>
<p>I have been using scikit-learn on and off for my AI/ML projects. During my time at Fujitsu I was the representative of Fujitsu in the scikit-learn Consortium, and started to organize development sprints in Japan, as well as contributing myself code to scikit-learn.</p>
</li>
<li>
<p><strong>Can you share your experience with open source sprints that you have organized or participated in? Any lessons learned?</strong></p>
<p>I have organized two scikit-learn development sprints in Japan (<a href="https://www.fujitsu.com/jp/about/research/article/202104-devsprint.html">Spring 2021</a> and <a href="https://www.fujitsu.com/jp/about/research/article/202111-devsprint2021a.html">Autumn 2021</a>), and participated in similar events a few times. For me the biggest problem is the “advertising part” - where/how to motivate people to participate. Having organized scientific conferences with hundreds to thousands of participants, the actual sprint organization was always rather relaxed a job for me, though. What I liked a lot during development sprints are pair programming options - sitting together with someone else and working on a project together. There is always to learn from someone else, and having access to different perspectives or opinions usually shapes up the coding considerably.</p>
</li>
<li>
<p><strong>To which OSS projects and communities do you contribute?</strong></p>
<p>The biggest contribution was to the TeX Live project, where I am the responsible author for the whole infrastructure, the TeX Live manager, and large parts of the server-sided tooling. Another considerable part is for the Japanese TeX Developer Community, where I have contributed several tools to make life for Japanese users of LaTeX more convenient. Within <a href="https://fossasia.org">FOSSASIA</a>, a global organization dedicated to open source and open hardware with base in Asia, I have worked and led the SUSI.AI project (privacy aware smart assistant and smart speaker based on Raspi). For Debian I have packaged practically everything related to TeX, several other packages, and in the last years I renovated the complete KDE/Plasma stack which was lacking behind. Other contributions of larger parts are to the Shotwell photo editor (the whole comment system), the Linux Onedrive client, and a few more things here and there.</p>
</li>
<li>
<p><strong>What advice or tips you have for people starting out in your field of work?</strong></p>
<p>I am not sure what “my field of work” is, though ;-) If you want to start doing OSS, find a project you are using, and a pain point you want to fix, and start coding. Even without knowing the language in the beginning, one can soon contribute. I never heard of ObjC before, but contributed quite some code to <a href="https://wiki.gnome.org/Apps/Shotwell">Shotwell</a>. I never heard about D before until I started developing features for Onedrive. Just get started and learn on the way.</p>
</li>
<li>
<p><strong>What do you find alluring about OSS?</strong></p>
<p>What I find alluring about OSS is that I can fix things I don’t like. I also like the “give and take” attitude: I receive a lot of things for free, excellent programs often surpassing their commercial counterparts by far. But I can also give back to the community: there are many ways to do this, even as a non-programmer, giving back is possible: improvement of documentation, community work, resource management, good bug reports, …</p>
</li>
<li>
<p><strong>What pain points do you observe in community-led OSS?</strong></p>
<p>Politics has taken a far too great hold in many communities, where protecting stakeholders is more important than protecting the developers. This has the effect that the diversity of opinions is badly strangled in many places. But I guess that is a consequence of the growing importance of OSS, and also reflects the general tendencies in societies.</p>
<p>Another pain point is the well known discrepancy between “<em>take</em> and give” from big companies. Often core components are developed by small groups in their spare time and huge infrastructures rely on that, without sufficiently honoring this fact.</p>
</li>
<li>
<p><strong>If we discuss how far OS has evolved in 10 years, what would you like to see happen?</strong></p>
<p>I would like to see a more robust development system: things like malware injection into Python library or Javascript library repositories need to be dealt with, otherwise trust in open source as a viable and stable alternative will not grow.</p>
<p>Another wish for the next few years - related to scikit-learn in the sense that it is a Python library - is a better development experience with Python. Tooling is still a pain, version incompatibilities between Python releases (even between point releases), loads of tools that all try to do similar things, to name two main pain points. Juggling every day with 3 versions of Python via pyenv, several venvs for projects, and three different tools to install/maintain is what I would like to see disappear.</p>
</li>
<li>
<p><strong>What are your favorite resources, books, courses, conferences, etc?</strong></p>
<p>I love to learn from books, so I have accumulated a lot of technical books, most of mine are from <a href="https://www.manning.com">Manning Publications</a> (I am not affiliated with them!). There are two books I come back to again and again: <a href="https://en.wikipedia.org/wiki/Structure_and_Interpretation_of_Computer_Programs#:~:text=Structure%20and%20Interpretation%20of%20Computer%20Programs%20(SICP)%20is%20a%20computer,Wizard%20Book"%20in%20hacker%20culture.">Structure and Interpretation of Computer Programs</a> (Abelson/Sussman) and the <a href="https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary-edition/">Pragmatic Programmer</a> (Thomas/Hunt). I think the two are pearls to be read again and again. On the more practical side, Knuth’s <a href="https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming">Art of Computer Programming</a> is a great treasure.</p>
<p>I have tried online video courses, but this is less of a format for me. Q&A sites on the net are of course a great resource, but be aware that simply copy paste will lead in most cases to bad bugs (or even worse, bed bugs). Only if I have understood the code from there, I will reuse it in my own programs.</p>
<p>Comment from Reshama:<br />
>When I was first learning Python, I made the painful and <em>very time-consuming</em> mistake of doing that, copying version 2 code into my version 3 script and not understanding why it did not work, initially. Unfortunately, StackOverflow answers do not include versions of libraries.</p>
</li>
<li>
<p><strong>What are your hobbies, outside of work and open source?</strong></p>
<p>I love to go to the mountains, and that is serious mountaineering. I have worked as a professional mountain guide (UIAGM) for some years, mostly in France and Switzerland, and I also do some professional guiding work here in Japan. Besides that, going out with friends into the mountains (rock climbing, ice climbing, ski touring, traditional mountaineering, and the discipline special to Japan/Taiwan/Korea: shower climbing, …) gives my head breathing room. Of course, with a small kid at home the mountains have become a bit smaller, and less present, though. So now with my family there is a lot of camping, going to the seaside, skiing in winter, and traveling (hopefully again soon also outside of Japan!).</p>
</li>
</ol>
<figure>
<img src="/assets/images/posts_images/norbert-japan.png" alt="photo of a man hiking" max-width="50%" max-height="50%" />
<figcaption>
Photo credit: <a href="https://www.preining.info">Norbert Preining</a>
</figcaption>
</figure>{"bio"=>"Open source library for machine learning in Python.", "links"=>[{"label"=>"GitHub", "icon"=>"fab fa-fw fa-github-square", "url"=>"https://github.com/scikit-learn"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/scikit_learn"}, {"label"=>"YouTube", "icon"=>"fab fa-fw fa-youtube", "url"=>"https://youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw"}, {"label"=>"LinkedIn", "icon"=>"fab fa-fw fa-linkedin", "url"=>"https://linkedin.com/company/scikit-learn/"}, {"label"=>"Facebook", "icon"=>"fab fa-fw fa-facebook-square", "url"=>"https://facebook.com/scikitlearnofficial/"}, {"label"=>"Instagram", "icon"=>"fab fa-fw fa-instagram", "url"=>"https://instagram.com/scikitlearnofficial/"}]}Author: Reshama Shaikh , Norbert Preining5 Years, 10 Sprints, A scikit-learn Open Source Journey2022-05-12T00:00:00+00:002022-05-12T00:00:00+00:00https://blog.scikit-learn.org/events/pyconde-keynote-reshama<div>
<img src="/assets/images/posts_images/reshama-pyconde.png" alt="" />
Author:
<a itemprop="sameAs" content="https://reshamas.github.io" href="https://reshamas.github.io" rel="me noopener noreferrer" style="vertical-align:top;"><img src="/assets/images/author_images/reshama_shaikh.jpeg" style="width:1em;margin-right:.5em;border-radius: 50%;" alt="Author Icon" class="orcid-icon" />Reshama Shaikh</a>
<br /><br />
</div>
<h2 id="video">Video</h2>
<iframe width="560" height="315" src="https://www.youtube.com/embed/ZUqJaCWPvmk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="about">About</h2>
<p>We all use open source tools in various capacities, yet knowing how to contribute to open source is not as well known or accessible. The limited knowledge and education surrounding contributing to open source could be one explanation of the low participation rates by underrepresented persons in open source. Open source sprints are hands-on “workshops” or “hackathons” where contributors collaborate to resolve coding and documentation issues posted on a GitHub repository.</p>
<p>Reshama shares how she organized her first open source sprint in 2017, which was in-person and held in New York City. Over the next 5 years, she organized in-person sprints from San Francisco, USA to Nairobi, Kenya, as well as pivoting to online sprints due to the global pandemic. In this keynote, Reshama shares highlights, challenges and lessons learned from the <a href="https://www.dataumbrella.org/sprints">sprints</a>.</p>
<h2 id="about-reshama-shaikh">About Reshama Shaikh</h2>
<p>Reshama is a statistician/data scientist based in New York City. She earned her M.S. in statistics from Rutgers University. She earned her M.B.A. from NYU Stern School of Business where she studied strategy, business analytics and technology management.</p>
<p>Reshama Shaikh is the Director of Data Umbrella. She is also on the Contributor Team for scikit-learn and <a href="https://docs.pymc.io/en/latest/">PyMC</a> and an organizer for <a href="https://www.meetup.com/NYC-PyLadies/">NYC PyLadies</a>.</p>
<h2 id="key-links">Key Links</h2>
<ul>
<li><a href="https://blog.dataumbrella.org/tags/#sprint-report">Sprint Reports</a></li>
<li><a href="https://blog.dataumbrella.org/tags/#sprint-blog">Sprint Blogs</a></li>
</ul>
<h2 id="connecting">Connecting</h2>
<ul>
<li>LinkedIn: <a href="https://www.linkedin.com/in/reshamas/">@reshamas</a></li>
<li>Twitter: <a href="https://twitter.com/reshamas">@reshamas</a></li>
<li>GitHub: <a href="https://github.com/reshamas">@reshamas</a></li>
<li>Medium: <a href="https://medium.com/@reshamas">@reshamas</a></li>
<li>Join the Data Umbrella <a href="https://www.meetup.com/data-umbrella/">Meetup Group</a></li>
<li>Subscribe to the Data Umbrella <a href="https://www.youtube.com/c/DataUmbrella/">YouTube</a></li>
</ul>
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vSc293BLAC9SUrlYV2famYyXREOdz2uHMmUF3KNXTLenVj1gDllxf5wRlVFwI-l8MIzttP6T9_GZu1f/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
<h3 id="keynote-day">Keynote Day</h3>
<p>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/PyConDE?src=hash&ref_src=twsrc%5Etfw">#PyConDE</a> <a href="https://twitter.com/hashtag/PyDataBerlin?src=hash&ref_src=twsrc%5Etfw">#PyDataBerlin</a> <br />I will be delivering my keynote "5 Years, 10 Sprints, a <a href="https://twitter.com/scikit_learn?ref_src=twsrc%5Etfw">@scikit_learn</a> Open Source Journey"<br />🗓️ Tuesday, Apr 12, 2022<br />🕙 10:30-11:15 am ET (16:30 Berlin)<a href="https://twitter.com/hashtag/opensource?src=hash&ref_src=twsrc%5Etfw">#opensource</a> <a href="https://twitter.com/hashtag/MachineLearning?src=hash&ref_src=twsrc%5Etfw">#MachineLearning</a><br />You can still purchase tickets for *online* here:<a href="https://t.co/dzqekTRc9o">https://t.co/dzqekTRc9o</a></p>— Reshama Shaikh (@reshamas) <a href="https://twitter.com/reshamas/status/1513501454606209028?ref_src=twsrc%5Etfw">April 11, 2022</a></blockquote> <script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</p>
<h3 id="keynote-announcement">Keynote Announcement</h3>
<p>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">We are proud to announce <a href="https://twitter.com/reshamas?ref_src=twsrc%5Etfw">@reshamas</a> as keynote speaker for the conference 🥳<a href="https://twitter.com/scikit_learn?ref_src=twsrc%5Etfw">@scikit_learn</a> <a href="https://twitter.com/DataUmbrella?ref_src=twsrc%5Etfw">@DataUmbrella</a> <a href="https://t.co/OnAKESqqX7">https://t.co/OnAKESqqX7</a></p>— PyConDE & PyData Berlin (@PyConDE) <a href="https://twitter.com/PyConDE/status/1508409170457944068?ref_src=twsrc%5Etfw">March 28, 2022</a></blockquote> <script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</p>{"bio"=>"Open source library for machine learning in Python.", "links"=>[{"label"=>"GitHub", "icon"=>"fab fa-fw fa-github-square", "url"=>"https://github.com/scikit-learn"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/scikit_learn"}, {"label"=>"YouTube", "icon"=>"fab fa-fw fa-youtube", "url"=>"https://youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw"}, {"label"=>"LinkedIn", "icon"=>"fab fa-fw fa-linkedin", "url"=>"https://linkedin.com/company/scikit-learn/"}, {"label"=>"Facebook", "icon"=>"fab fa-fw fa-facebook-square", "url"=>"https://facebook.com/scikitlearnofficial/"}, {"label"=>"Instagram", "icon"=>"fab fa-fw fa-instagram", "url"=>"https://instagram.com/scikitlearnofficial/"}]}Author: Reshama Shaikh