State of the art workflows not just for Computer Scientists
The University of Auckland
July, 2024
This is a taster/talking-head, not a hands-on session 🗣️
We only have one hour 🕐 but you can learn a lot
Cameras on, please 📸
Please mute your microphone 🎤
Questions later unless unavoidable now: Zoom chat 💬
Be kind 😊
Recommendation
Goal: make your (research) lives easier
Recommended ResBaz sessions
Note
because of the way we interact with the core-coding task, we won’t see too much of the OS
(all these ResBaz session are running concurrently on Wed. 1-5pm)
We have two main appraoches
command-line interface (aka CLI/terminal/console/shell/BaSH/ZSH/Fish/…):
Jupyter Notbooks, etc.:
Recommended ResBaz session
Introduction to the Command Line (Wed. 10am-3pm)
.py
filesGUI = graphical user interface
.ipynb
files:markdown
syntax
Just 5 lines of code
We can do this in Google Colab, example here
if
all stakeholders agree (Research Data Policy, Ethics Commission, IP Advisor, Funder,…)
Thesis final final really final 2 july.docx
, great for codegit status
, git diff
, git commit -am "present-tense active what I did"
1, git push
A gift 👑 because:
A curse 🤬 because:
But! We are not the first people to run into such challenges
venv
or not to venv
?(Though not crucial, mentioned for the sake of completeness)
venv
?
venv
can have its own packages and versions; as opposed to a global enviroment (so you can run the old and the new version in paralell and test, etc.)venv
per project. If you research and use of Python on one specific machine is just about one project, this might be ignorable; poetry (as mentioned on the previous slide) spins up a virtual environment anyways for you(input data properly stored ✅ we use an IDE (VSC) ✅ use package management ✅)
Kernel
and VSC
Category | Details |
---|---|
Data Input | Download dataset 1 & 2 |
Computer | We use a VM (on Nectar) |
OS | We use Ubuntu 22.4 |
Language | Python |
Libraries | geopandas among others |
IDE | VSC to run a Jupyter Notebook (ssh to VM) |
Code | On GitHub |
Research Outputs | Map published to website/GitHub Action (bit out of scope) |
Rough workflow:
Details:
What I did as a preparation (which exceeds the available time for this session):
data
)
sudo apt install python3.12-venv
python3 -m venv data
source data/bin/activate
Jupyter
and GitHub Pull Requests
exentsions and signed in to GitHub# We download the first dataset
!wget "https://github.com/UoA-eResearch/SA2_2022_population/raw/main/statistical-area-2-2023-generalised_simplified_22.3%25.zip"
# !pip install geopandas
import geopandas as gpd
import pandas as pd
# Load dataset into geopandas df, have a look
sa2 = gpd.read_file("statistical-area-2-2023-generalised_simplified_22.3%.zip").dropna(subset="geometry")
sa2
# we remove Chathan Islands and those with no land area
sa2 = sa2[(sa2.SA22023__1 != "Chatham Islands") & (sa2.LAND_AREA_ > 0)].copy()
sa2
# We download the second dataset
!wget "https://raw.githubusercontent.com/UoA-eResearch/SA2_2022_population/main/population_by_SA2.csv"
# we import a population dataset
population = pd.read_csv("Data2024Assets/population_by_SA2.csv")
population
# Extract ID from Area col by using RegEx
population['SA2'] = population['Area'].str.extract(r'(\d+)')
population
# Add a prefix to the right dataframe's columns (excluding the merge key)
prefix = 'population_in_year_'
population= population.rename(columns={col: prefix + col for col in population.columns[1:10]})
population
# Merge these
sa2 = sa2.merge(population, left_on='SA22023_V1', right_on='SA2')
sa2
Joining the dots for modern data science workflows