Marlene Mhangami
Marlene Mhangami

Marlene Mhangami

How I Made My First PR To Apache Arrow

How I Made My First PR To Apache Arrow

Contributing code to open source project PyArrow

Marlene Mhangami's photo
Marlene Mhangami
·Mar 11, 2022·

5 min read

Subscribe to my newsletter and never miss my upcoming articles

Open source has been great for my career! Contributing is also an excellent way to get know a library better. Since joining Voltron Data I've been learning about an open source project called Apache Arrow and wanted to contribute to it. If you've been nervous about getting started and don't know how, this blog post will take you through everything I did to make my first contribution.

Look for the docs📜

The first step in contributing to a library you're new to is to look at the documentation. Most OS projects have a contributors guide. There's a new and improved arrow guide thats just been published. You can keep following along with my post, but I'd definitely recommend checking out the guide if you run into any errors I don't address. You can check it out here.

Choose an issue to work on🛠️

If a project needs help they'll write an 'issue' for it. In most cases issues are on Github. For @ApacheArrow issues are on Jira. Even If you aren't ready to contribute just yet, looking at a projects issue list lets you know what you can prepare for once you are ready.

This was the issue my 1st issue.

issue.jpeg

Some things to note about it:

-It's a python issue (there were also #r, #cpp, #js issues) -The goodfirstissue label is visible -It says closed because I took the screenshot after but you want to look for open issues

Set up your dev environment🧑🏿‍🔬

Even if you're not exactly sure how you'll solve an issue, the next step should be to set up your dev environment. I usually only think of a solution once I've experimented with the code. For PyArrow setting up looked like forking and cloning the apache arrow repository, setting up a python virtual environment, building the arrow libraries.

Forking the Arrow Github repository

You can find the arrow repo is here. You can find a tutorial by Github on forking here

Cloning the Arrow Github repository 🐑

Cloning a repository creates 2 remote branches origin and upstream in a directory named arrow. In order to do this I used the commands seen in the image below

cloningrepo.jpeg

At this point you should be in a folder named 'arrow'. This would also be a good time to create a Python virtual environment.

Building arrow c++🧱:

In order for you to run arrow on your local machine you need to build both its c++ and python libraries. You can find documentation about the commands you need to do this using different OS' here.

I'm on Mac so I used the commands specified for it. I also used the projects cmake presets to get started quickly, though if you're planning to contribute more frequently building without presets might be best. I found that building the libraries tends to be the trickiest part. If you run into any issues feel free to reach out to me on Twitter or Github. There's also an apache arrow mailing list which you can send questions to as well. Here's the list of commands I used:

buildcpp.jpeg

Building pyarrow🧱: The instructions are here👇🏿

The code to needed to build PyArrow will vary based on your OS. Here's a page that lists the options. I used the commands shown in the image below (as a reminder I'm on a MAC)

buildpyarrow.jpeg

Figuring out how to solve the issue🧐

When I'm trying to solve an issue, my first goal is to understand how the function in question works. To do this I'll

-open up IPython or a jupyter notebook, which is where I'm most comfortable working. -I'll then search Github or my editor to find where the function -I'll call the function and make sure I understand what it does. This also helps me understand the problem -Then I'll try solutions in my branch, read errors, and ask questions about the issue if needed. Feel free to comment on the issue itself to get answers about things you aren't sure about. Here's a picture of a discussion I had about the issue I was working on

comments.jpeg

Running Tests and Linters👩🏿‍🏫

PyArrow uses pytest for unit tests in Python. You'll need to cd into the python/pyarrow folder.The project also uses archery for linting purposes! Different OS projects use different linters so check the docs! Here's the code for all of this.

runningtests.jpeg

Making a Pull Request✨

If you have a working solution try not to overthink! Sharing code means other people can help! I like to check which files have changed to make sure I didn't change any by mistake. I'll add and commit my code with a clear message and then push the changes. Here's the code I used for this

pr.jpeg

Code Reviews✅

Once you've pushed your code, give your PR a descriptive name. Maintainers will then comment on your code. Don't be afraid to receive feedback, even though this can be scary. If you get feedback make changes in your local branch and then and push them to your PR. Here's a look at what my PR looked like once I pushed it and started getting reviews.

reviews.jpeg

Here's a link to the issue and PR , incase its helpful:

Issue: issues.apache.org/jira/browse/ARROW-14242

PR: github.com/apache/arrow/pull/11873

That's it🎉✨ If you're thinking of contributing to OS (not just @ApacheArrow ) hopefully you found this helpful! Let me know on twitter if you end up creating a PR of your own.

 
Share this