Goals for today

This session is mostly about showing efficient means of working with data
- for example, the Python session on Wednesday
The selection of a geospatial dataset is mostly because it
- can be easily visualised and (hopefully) understodd
- sophisticated libraries exist; geopandas and folium that can be used to work with geospatial data
  - similar libraries exist for various research disciplines, so go out and explore!
Overall, we will
- install Python
- install Microsoft Visual Studio Code
- (install Jupyter Notebook)
- pip install geopandas and other libraries
- use git for version control; GitHub as our provider for this
  - for more details see https://www.eventbrite.co.nz/e/introduction-to-version-control-with-git-tickets-908002348467
- download a dataset
- use folium to visualise this data

(ChatGPT’s take on leaflet vs. folium vs. geopandas)

Some background information on the dataset

we use a dataset from StatsNZ, which is the 2022 Census dataset
in plain English:
- this dataset contains information about the population of New Zealand
- for our use-case, we use the Statistical Area 2 (SA2) dataset. This implies that each geographical cluster contains around 1000-4000 people are assigned into one SA2
- SA2 shall represent a ‘community of place’ where people interact together socially and economically
- in other words (but dependent on each use-case), we can assume(!) that the people in one SA2 are similar in terms of their socio-economic status

More details: - we will refer to SA (Statistical Areas, as provided by StatsNZ) - There are 3 levels of SA: SA1, SA2, SA3 - SA1 is the smallest (about 100-500 residents), SA2 (1k-4k residents) SA3 is the largest (5k-50k residents) - We will use the medium one (SA2) for this analysis

Let’s have a look at the tool that we are about to use and why

Python because it is…

… relatively easy to learn and use
… general-purpose programming language, which means it can be used to build just about anything

We download the data

there are various ways of doing this
- we can go to the website and press the download button
- we can use command-line tools directly from here, such as wget or curl to download the data
- here, we will download the data wget from a GitHub repository that hosts an optimised derivative of the original data set

!wget "https://github.com/UoA-eResearch/SA2_2022_population/raw/main/statistical-area-2-2023-generalised_simplified_22.3%25.zip"

--2024-06-20 15:23:11--  https://github.com/UoA-eResearch/SA2_2022_population/raw/main/statistical-area-2-2023-generalised_simplified_22.3%25.zip
Resolving github.com (github.com)... 20.248.137.48
Connecting to github.com (github.com)|20.248.137.48|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/UoA-eResearch/SA2_2022_population/main/statistical-area-2-2023-generalised_simplified_22.3%25.zip [following]
--2024-06-20 15:23:11--  https://raw.githubusercontent.com/UoA-eResearch/SA2_2022_population/main/statistical-area-2-2023-generalised_simplified_22.3%25.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6090234 (5.8M) [application/zip]
Saving to: ‘statistical-area-2-2023-generalised_simplified_22.3%.zip.3’

statistical-area-2- 100%[===================>]   5.81M  5.63MB/s    in 1.0s    

2024-06-20 15:23:12 (5.63 MB/s) - ‘statistical-area-2-2023-generalised_simplified_22.3%.zip.3’ saved [6090234/6090234]

# wget -O output.zip "https://github.com/UoA-eResearch/SA2_2022_population/raw/main/statistical-area-2-2023-generalised_simplified_22.3%25.zip"

We import these libraries

The following is very commonly done, we import libraries/packages/… To minimise typing effort, we give them short names (aliases)

import geopandas as gpd
import pandas as pd

So this failed. Why do you think that happened?

# !pip install geopandas

Requirement already satisfied: geopandas in /opt/homebrew/lib/python3.11/site-packages (0.14.4)
Requirement already satisfied: fiona>=1.8.21 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (1.9.6)
Requirement already satisfied: numpy>=1.22 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (1.24.3)
Requirement already satisfied: packaging in /Users/jbri364/Library/Python/3.11/lib/python/site-packages (from geopandas) (23.2)
Requirement already satisfied: pandas>=1.4.0 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (1.5.3)
Requirement already satisfied: pyproj>=3.3.0 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (3.6.1)
Requirement already satisfied: shapely>=1.8.0 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (2.0.4)
Requirement already satisfied: attrs>=19.2.0 in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (23.1.0)
Requirement already satisfied: certifi in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (2022.12.7)
Requirement already satisfied: click~=8.0 in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (8.1.7)
Requirement already satisfied: click-plugins>=1.0 in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (1.1.1)
Requirement already satisfied: cligj>=0.5 in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (0.7.2)
Requirement already satisfied: six in /Users/jbri364/Library/Python/3.11/lib/python/site-packages (from fiona>=1.8.21->geopandas) (1.16.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/jbri364/Library/Python/3.11/lib/python/site-packages (from pandas>=1.4.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/homebrew/lib/python3.11/site-packages (from pandas>=1.4.0->geopandas) (2023.3)

so let’s try again

import geopandas as gpd
import pandas as pd

We now use Geopandas

the gpd is a special kind of pandas DataFrame
the .dropna is a means of filtering data (here, these are rows) that contain a n/a — in other words: A row that has elements with no data, that would otherwise give us challenges
- on a bigger picture perspective: It is hard to calculate the average of a column if there are missing values in it; avg (2+4+NaN) = NaN
- this is only applied to the geometry column, which is the column that contains the geospatial data

sa2 = gpd.read_file("statistical-area-2-2023-generalised_simplified_22.3%.zip").dropna(subset="geometry")
sa2

	SA22023_V1	SA22023__1	SA22023__2	LAND_AREA_	AREA_SQ_KM	Shape_Leng	geometry
0	100100	North Cape	North Cape	829.188012	829.188012	4.160392e+05	MULTIPOLYGON (((1614055.808 6154067.023, 16141...
1	100200	Rangaunu Harbour	Rangaunu Harbour	274.004295	274.004295	1.490891e+05	MULTIPOLYGON (((1624166.328 6127258.658, 16243...
2	100301	Inlets Far North District	Inlets Far North District	0.000000	623.222037	1.349364e+06	MULTIPOLYGON (((1620023.410 6084653.062, 16197...
3	100400	Karikari Peninsula	Karikari Peninsula	174.440751	174.440751	1.401491e+05	MULTIPOLYGON (((1626838.499 6126965.133, 16270...
4	100500	Tangonge	Tangonge	177.182117	177.182117	1.014669e+05	POLYGON ((1623516.265 6119580.907, 1623492.213...
...	...	...	...	...	...	...	...
2381	363600	Oceanic Marlborough Region	Oceanic Marlborough Region	0.000000	5204.239555	6.030516e+05	POLYGON ((1735069.340 5471565.641, 1731082.409...
2382	363700	Oceanic Southland Region	Oceanic Southland Region	0.000000	22404.339718	2.702862e+06	POLYGON ((1201178.956 5079126.036, 1201206.109...
2383	363800	Oceanic Canterbury Region	Oceanic Canterbury Region	0.000000	11441.928693	1.176106e+06	POLYGON ((1686900.717 5353231.610, 1704528.061...
2384	363900	Oceanic Otago Region	Oceanic Otago Region	0.000000	6538.077281	7.359200e+05	POLYGON ((1453579.263 5021995.144, 1474465.101...
2385	364000	Motunau Island	Motunau Island	0.081474	0.081474	1.427156e+03	POLYGON ((1606233.915 5232188.150, 1606183.115...

2379 rows × 7 columns

let’s perform some sanity checks

# give me only the rows in sa2 that have a geometry of NaN
sa2[sa2["geometry"].isna()]

	SA22023_V1	SA22023__1	SA22023__2	LAND_AREA_	AREA_SQ_KM	Shape_Leng	geometry

# only show me rows where SA22023__1 and SA22023__2 differ
sa2[sa2["SA22023__1"] != sa2["SA22023__2"]]

	SA22023_V1	SA22023__1	SA22023__2	LAND_AREA_	AREA_SQ_KM	Shape_Leng	geometry
10	104301	ÅŒpua (Far North District)	Opua (Far North District)	5.602473	5.827312	19213.437576	POLYGON ((1699472.895 6094086.681, 1699528.640...
14	104800	Mangakahia-HÅ«kerenui	Mangakahia-Hukerenui	659.254330	659.254330	183365.297157	POLYGON ((1709183.206 6068629.847, 1709173.331...
19	105400	MaungatÄ?pere	Maungatapere	174.482190	174.482190	93344.845747	POLYGON ((1711760.225 6046448.026, 1711747.134...
20	105601	Matapouri-TutukÄ?kÄ?	Matapouri-Tutukaka	78.527938	78.587007	86800.477007	MULTIPOLYGON (((1740046.334 6057828.894, 17399...
43	108000	PÄ?taua	Pataua	128.707987	128.707987	129154.856928	MULTIPOLYGON (((1740486.853 6046295.860, 17405...
...	...	...	...	...	...	...	...
2265	347000	WÄ?naka Central	Wanaka Central	7.562054	7.562054	12758.648935	POLYGON ((1296155.366 5043593.460, 1296353.267...
2267	347101	Lake HÄ?wea	Lake Hawea	3.655589	3.655589	9016.577926	POLYGON ((1302855.368 5051689.464, 1302774.822...
2335	357200	Inland water Lake Te Ä€nau	Inland water Lake Te Anau	0.000000	344.402829	361810.765779	POLYGON ((1199680.502 5012394.655, 1199729.683...
2341	357501	Te Ä€nau	Te Anau	6.637305	6.657040	12967.743524	POLYGON ((1186662.007 4956066.127, 1186763.928...
2346	358000	ÅŒhai-Nightcaps	Ohai-Nightcaps	948.798919	948.798919	174165.196492	POLYGON ((1232653.066 4911665.243, 1229777.053...

175 rows × 7 columns

now it becomes apparent what the difference might be; let’s have a look at the original dataset’s column names: - SA22023_V1_00_NAME - SA22023_V1_00_NAME_ASCII So two ways how characters are encoded, this is a common issue in data processing and by giving both options, this facilites the process

We do not want to include the Chatam Islands, as they are not part of the main landmass of New Zealand and only under 800 people live there.

We want to make sure that we don’t have empty SA2 (that is unlikely but for other ‘resolutions’/approaches of classifying geospatial data, there might be a use for that, e.g. Marine Buoy Locations Dataset)

# only show me rows where LAND_AREA_ column is zero
sa2[sa2["LAND_AREA_"] == 0]

	SA22023_V1	SA22023__1	SA22023__2	LAND_AREA_	AREA_SQ_KM	Shape_Leng	geometry
2	100301	Inlets Far North District	Inlets Far North District	0.0	623.222037	1.349364e+06	MULTIPOLYGON (((1620023.410 6084653.062, 16197...
16	105001	Inlets other Whangarei District	Inlets other Whangarei District	0.0	38.949878	2.012673e+05	MULTIPOLYGON (((1737958.662 6046745.046, 17380...
24	111000	Oceanic Auckland Region West	Oceanic Auckland Region West	0.0	2384.034569	2.793983e+05	POLYGON ((1724020.922 5929713.851, 1724095.069...
30	112001	Inlets other Auckland	Inlets other Auckland	0.0	122.490416	5.958464e+05	MULTIPOLYGON (((1791662.172 5909541.528, 17916...
44	108400	Inlet WhangÄ?rei Harbour	Inlet Whangarei Harbour	0.0	103.520564	1.658690e+05	POLYGON ((1722383.512 6044397.233, 1722876.556...
...	...	...	...	...	...	...	...
2380	363500	Oceanic Nelson Region	Oceanic Nelson Region	0.0	787.574671	1.537628e+05	POLYGON ((1630452.435 5482785.080, 1638439.661...
2381	363600	Oceanic Marlborough Region	Oceanic Marlborough Region	0.0	5204.239555	6.030516e+05	POLYGON ((1735069.340 5471565.641, 1731082.409...
2382	363700	Oceanic Southland Region	Oceanic Southland Region	0.0	22404.339718	2.702862e+06	POLYGON ((1201178.956 5079126.036, 1201206.109...
2383	363800	Oceanic Canterbury Region	Oceanic Canterbury Region	0.0	11441.928693	1.176106e+06	POLYGON ((1686900.717 5353231.610, 1704528.061...
2384	363900	Oceanic Otago Region	Oceanic Otago Region	0.0	6538.077281	7.359200e+05	POLYGON ((1453579.263 5021995.144, 1474465.101...

83 rows × 7 columns

we can combine this into one prompt and see the result

sa2 = sa2[(sa2.SA22023__1 != "Chatham Islands") & (sa2.LAND_AREA_ > 0)].copy()
sa2

	SA22023_V1	SA22023__1	SA22023__2	LAND_AREA_	AREA_SQ_KM	Shape_Leng	geometry
0	100100	North Cape	North Cape	829.188012	829.188012	416039.151064	MULTIPOLYGON (((1614055.808 6154067.023, 16141...
1	100200	Rangaunu Harbour	Rangaunu Harbour	274.004295	274.004295	149089.110095	MULTIPOLYGON (((1624166.328 6127258.658, 16243...
3	100400	Karikari Peninsula	Karikari Peninsula	174.440751	174.440751	140149.099270	MULTIPOLYGON (((1626838.499 6126965.133, 16270...
4	100500	Tangonge	Tangonge	177.182117	177.182117	101466.900183	POLYGON ((1623516.265 6119580.907, 1623492.213...
5	100900	Rangitihi	Rangitihi	84.539982	84.539982	60636.302194	POLYGON ((1629314.862 6117970.968, 1629371.783...
...	...	...	...	...	...	...	...
2374	362601	Heidelberg	Heidelberg	1.308968	1.308968	5367.870180	POLYGON ((1245892.259 4848174.153, 1245922.973...
2375	363001	Clifton-Kew	Clifton-Kew	3.770832	3.770832	11890.253381	POLYGON ((1243666.689 4847254.124, 1243677.447...
2376	363100	Woodend-Greenhills	Woodend-Greenhills	173.600766	173.600766	166674.352079	MULTIPOLYGON (((1248612.765 4823630.510, 12486...
2378	363301	Bluff	Bluff	10.136949	10.290040	23162.004392	POLYGON ((1244729.237 4827495.819, 1244667.973...
2385	364000	Motunau Island	Motunau Island	0.081474	0.081474	1427.156234	POLYGON ((1606233.915 5232188.150, 1606183.115...

2295 rows × 7 columns

Section Break!

Well done, we have now our data in a format that we can work with. For each SA2, we have the geospatial extent (the geometry: the name, area, length, a polygon that surrounds it/boxes it in)

Next: we want to enhance the dataset with the population per SA2 While we downloaded the first dataset from a GitHub repository (but that might have been any other server) using the wget command-line tool, we will now use the pandas library to read in a CSV file from a local directory

population = pd.read_csv("Data2024Assets/population_by_SA2.csv")
population

	Area	1996	2001	2006	2013	2018	2019	2020	2021	2022
0	100100 North Cape	1710.0	1520.0	1380.0	1470.0	1660.0	1690.0	1750.0	1800.0	1820.0
1	100200 Rangaunu Harbour	2050.0	2100.0	2070.0	2200.0	2410.0	2470.0	2580.0	2620.0	2640.0
2	100300 Inlets Far North district	190.0	140.0	70.0	60.0	50.0	50.0	50.0	40.0	40.0
3	100400 Karikari Peninsula	860.0	1040.0	970.0	1280.0	1300.0	1300.0	1360.0	1400.0	1410.0
4	100500 Tangonge	940.0	1010.0	1090.0	1240.0	1180.0	1180.0	1230.0	1240.0	1260.0
...	...	...	...	...	...	...	...	...	...	...
2256	400013 Snares Islands	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2257	400014 Oceanic Antipodes Islands	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2258	400015 Antipodes Islands	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2259	400016 Ross Dependency	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2260	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

2261 rows × 10 columns

this brings us to a common challenge with combining datasets from different sources

the SA2 dataset has relevant data split into several columns
- SA22023_V1 is a predefined ID; this is a common practice in data processing! It is unique
- SA22023__1 is easily understandable name
- SA22023__2 is the same name but in ASCII encoding (special characters matter, remmeber?)
the population dataset has similar information combined in the AREA column
it starts with the numeric code
then the human readable name

Do you have any idea how we can efficiently work with such datasets?

ZOOM POLL

for now, we assume that the unique idea is a lot easier to find the corresponding data in the two datasets
if someone has already defined such a ‘key’, we don’t have to worry about the encoding of the name (special characters and all)

# Extract ID from Area col
population['SA2'] = population['Area'].str.extract(r'(\d+)')
population

	Area	1996	2001	2006	2013	2018	2019	2020	2021	2022	SA2
0	100100 North Cape	1710.0	1520.0	1380.0	1470.0	1660.0	1690.0	1750.0	1800.0	1820.0	100100
1	100200 Rangaunu Harbour	2050.0	2100.0	2070.0	2200.0	2410.0	2470.0	2580.0	2620.0	2640.0	100200
2	100300 Inlets Far North district	190.0	140.0	70.0	60.0	50.0	50.0	50.0	40.0	40.0	100300
3	100400 Karikari Peninsula	860.0	1040.0	970.0	1280.0	1300.0	1300.0	1360.0	1400.0	1410.0	100400
4	100500 Tangonge	940.0	1010.0	1090.0	1240.0	1180.0	1180.0	1230.0	1240.0	1260.0	100500
...	...	...	...	...	...	...	...	...	...	...	...
2256	400013 Snares Islands	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	400013
2257	400014 Oceanic Antipodes Islands	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	400014
2258	400015 Antipodes Islands	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	400015
2259	400016 Ross Dependency	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	400016
2260	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

2261 rows × 11 columns

Wait! What?

let’s have a look what we did here
we created a new column SA22023_V1 in the population dataset
we used the Area column as an input
- for example 100100 North Cape
- but we didn’t copy the whole content from Area into SA2, we only took some of it
- NOW: This is a very important step and applies to so many other datasets and research questions! Please pay special attention! Always think of the outliers!
  - we could have simply assumed that each of these IDs is 6 digits long
  - but that might not be the case
  - in a different context: University of Auckland has an ID of often (!) 3 characters followed by 3 numbers
    - often! is the key to the key (pun intended) here
    - if some of these names, for example, are taken frequently, 3 characters might not be enough
    - etc
- so, what do we do to minimise the risk of errors?
  - we use something called Regex or Regular Expression
  - keeping the scope of this session in mind, think of it as a defined set of rules that we can apply to a string to find/match things
```
r'(\d+)': This is a raw string containing a regular expression.
\d: Matches any digit (0-9).
+: Matches one or more of the preceding element (in this case, one or more digits).
(): Capturing group, indicating that we want to extract the digits.
```
- by using this, we can extract digits (any number of them greater 0) from a string
- and we put them together in one capture group

Let’s compare the file types in the columns - Area that came with the dataset - and SA2 that we just created - we will use the first row for the sake of simplicity; important side-note: Python is zero-indexed, more in the Python session on Wednesday

# Selecting the first row
first_row = population.iloc[0]

# Inspecting datatypes of the first row elements
first_row_dtypes = first_row.apply(type)
print(first_row_dtypes)

Area              <class 'str'>
1996    <class 'numpy.float64'>
2001    <class 'numpy.float64'>
2006    <class 'numpy.float64'>
2013    <class 'numpy.float64'>
2018    <class 'numpy.float64'>
2019    <class 'numpy.float64'>
2020    <class 'numpy.float64'>
2021    <class 'numpy.float64'>
2022    <class 'numpy.float64'>
SA2               <class 'str'>
Name: 0, dtype: object

this means that both are ‘strings’ that contain a number; this is often another common challenge!

Now let’s merge the two datasets; but wait, before we do that, let’s have a look at the column names and what this implies

ZOOM Poll: Do you see any challange if we had these

SA22023_V1 ,SA22023_1 ,SA220232 ,LAND_AREA_ ,AREA_SQ_KM ,Shape_Leng ,geometry ,Area ,1996 ,2001 ,2006 ,2013 ,2018 ,2019 ,2020 ,2021 ,2022 ,SA2

we would just have some numbers as column names while this might be totally acceptable for us, now, what if we look at the data 5 years from now, or someone else does? there are two ways to address this: - we can rename the columns - we can accompany our dataset and code with a readme file (you might have participates in the session Keeping Your Spreadsheets Tidy) - which one to pick depends - for Mechanical engineering, it might be unnecessarily hard to keep all the units (that might in turn have special characters or fractions, etc.) in the column names, here a readme.txt file is far superior - for our given use-case, we will rename the columns by adding some additional text (to the front, i.e. a prefix) - let’s find the relevant column numbers (remember: Python is zero-indexed!)

# Display the index number of each column
for index, column in enumerate(population.columns):
    print(f'Index: {index}, Column: {column}')

Index: 0, Column: Area
Index: 1, Column: 1996
Index: 2, Column: 2001
Index: 3, Column: 2006
Index: 4, Column: 2013
Index: 5, Column: 2018
Index: 6, Column: 2019
Index: 7, Column: 2020
Index: 8, Column: 2021
Index: 9, Column: 2022
Index: 10, Column: SA2

# Add a prefix to the right dataframe's columns (excluding the merge key)
prefix = 'population_in_year_'
population= population.rename(columns={col: prefix + col for col in population.columns[1:10]})
population

	Area	population_in_year_1996	population_in_year_2001	population_in_year_2006	population_in_year_2013	population_in_year_2018	population_in_year_2019	population_in_year_2020	population_in_year_2021	population_in_year_2022	SA2
0	100100 North Cape	1710.0	1520.0	1380.0	1470.0	1660.0	1690.0	1750.0	1800.0	1820.0	100100
1	100200 Rangaunu Harbour	2050.0	2100.0	2070.0	2200.0	2410.0	2470.0	2580.0	2620.0	2640.0	100200
2	100300 Inlets Far North district	190.0	140.0	70.0	60.0	50.0	50.0	50.0	40.0	40.0	100300
3	100400 Karikari Peninsula	860.0	1040.0	970.0	1280.0	1300.0	1300.0	1360.0	1400.0	1410.0	100400
4	100500 Tangonge	940.0	1010.0	1090.0	1240.0	1180.0	1180.0	1230.0	1240.0	1260.0	100500
...	...	...	...	...	...	...	...	...	...	...	...
2256	400013 Snares Islands	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	400013
2257	400014 Oceanic Antipodes Islands	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	400014
2258	400015 Antipodes Islands	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	400015
2259	400016 Ross Dependency	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	400016
2260	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

2261 rows × 11 columns

Ready, steady: Merge… Wait…

OK, almost there
Let’s wait again
How to merge the two datasets and how to make sure that the right data is in the right place?
In other words: How can we explicitly state that we want to look for the column SA22023_V1 in the SA2 dataset and the column we named SA2 in the population] dataset?
By specifying it!

sa2 = sa2.merge(population, left_on='SA22023_V1', right_on='SA2')
sa2

	SA22023_V1	SA22023__1	SA22023__2	LAND_AREA_	AREA_SQ_KM	Shape_Leng	geometry	Area	population_in_year_1996	population_in_year_2001	population_in_year_2006	population_in_year_2013	population_in_year_2018	population_in_year_2019	population_in_year_2020	population_in_year_2021	population_in_year_2022	SA2
0	100100	North Cape	North Cape	829.188012	829.188012	416039.151064	MULTIPOLYGON (((1614055.808 6154067.023, 16141...	100100 North Cape	1710.0	1520.0	1380.0	1470.0	1660.0	1690.0	1750.0	1800.0	1820.0	100100
1	100200	Rangaunu Harbour	Rangaunu Harbour	274.004295	274.004295	149089.110095	MULTIPOLYGON (((1624166.328 6127258.658, 16243...	100200 Rangaunu Harbour	2050.0	2100.0	2070.0	2200.0	2410.0	2470.0	2580.0	2620.0	2640.0	100200
2	100400	Karikari Peninsula	Karikari Peninsula	174.440751	174.440751	140149.099270	MULTIPOLYGON (((1626838.499 6126965.133, 16270...	100400 Karikari Peninsula	860.0	1040.0	970.0	1280.0	1300.0	1300.0	1360.0	1400.0	1410.0	100400
3	100500	Tangonge	Tangonge	177.182117	177.182117	101466.900183	POLYGON ((1623516.265 6119580.907, 1623492.213...	100500 Tangonge	940.0	1010.0	1090.0	1240.0	1180.0	1180.0	1230.0	1240.0	1260.0	100500
4	100900	Rangitihi	Rangitihi	84.539982	84.539982	60636.302194	POLYGON ((1629314.862 6117970.968, 1629371.783...	100900 Rangitihi	800.0	850.0	830.0	890.0	970.0	970.0	980.0	1000.0	1030.0	100900
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1589	361000	Invercargill Central	Invercargill Central	9.182431	9.182431	19256.716523	POLYGON ((1243482.779 4850331.790, 1243525.689...	361000 Invercargill Central	1300.0	1020.0	1190.0	1160.0	1440.0	1460.0	1480.0	1470.0	1460.0	361000
1590	361100	Gladstone (Invercargill City)	Gladstone (Invercargill City)	1.269458	1.269458	5366.924897	POLYGON ((1243388.753 4852028.683, 1243411.913...	361100 Gladstone (Invercargill city)	2630.0	2500.0	2530.0	2460.0	2450.0	2470.0	2480.0	2470.0	2490.0	361100
1591	361300	Avenal	Avenal	1.427733	1.427733	5326.010603	POLYGON ((1243418.578 4851429.187, 1243466.565...	361300 Avenal	1280.0	1250.0	1180.0	1170.0	1310.0	1340.0	1370.0	1340.0	1350.0	361300
1592	363100	Woodend-Greenhills	Woodend-Greenhills	173.600766	173.600766	166674.352079	MULTIPOLYGON (((1248612.765 4823630.510, 12486...	363100 Woodend-Greenhills	710.0	600.0	630.0	620.0	610.0	630.0	650.0	660.0	660.0	363100
1593	364000	Motunau Island	Motunau Island	0.081474	0.081474	1427.156234	POLYGON ((1606233.915 5232188.150, 1606183.115...	364000 Motunau Island	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	364000

1594 rows × 18 columns

so, from now on…

we work with our sa2 dataframe remember, this is special a special kind of dataframe (or common shorthand df) ZOOM POLL: What is the difference between a gpd and a df?

print(type(sa2))

<class 'geopandas.geodataframe.GeoDataFrame'>

Because this is this special kind of dataframe, we can take a lot (I really mean it: A LOT) of shortcuts; such as plotting the data as a map with just these 3 lines

m = sa2.explore("population_in_year_2022", legend=True)
m.save("index_folium.html")
m

Make this Notebook Trusted to load map: File -> Trust Notebook

you might get an error ImportError: The 'folium', 'matplotlib' and 'mapclassify' packages are required for 'explore()'. You can install them using 'conda install -c conda-forge folium matplotlib mapclassify' or 'pip install folium matplotlib mapclassify'. so do what is suggested pip install folium matplotlib mapclassify

Next steps

list what we want to do