Goals for today

(ChatGPT’s take on leaflet vs. folium vs. geopandas)

Some background information on the dataset

  • we use a dataset from StatsNZ, which is the 2022 Census dataset
  • in plain English:
    • this dataset contains information about the population of New Zealand
    • for our use-case, we use the Statistical Area 2 (SA2) dataset. This implies that each geographical cluster contains around 1000-4000 people are assigned into one SA2
    • SA2 shall represent a ‘community of place’ where people interact together socially and economically
    • in other words (but dependent on each use-case), we can assume(!) that the people in one SA2 are similar in terms of their socio-economic status

More details: - we will refer to SA (Statistical Areas, as provided by StatsNZ) - There are 3 levels of SA: SA1, SA2, SA3 - SA1 is the smallest (about 100-500 residents), SA2 (1k-4k residents) SA3 is the largest (5k-50k residents) - We will use the medium one (SA2) for this analysis

Let’s have a look at the tool that we are about to use and why

Python because it is…

  • … relatively easy to learn and use
  • … general-purpose programming language, which means it can be used to build just about anything

We download the data

  • there are various ways of doing this
    • we can go to the website and press the download button image.png
    • we can use command-line tools directly from here, such as wget or curl to download the data
    • here, we will download the data wget from a GitHub repository that hosts an optimised derivative of the original data set
!wget "https://github.com/UoA-eResearch/SA2_2022_population/raw/main/statistical-area-2-2023-generalised_simplified_22.3%25.zip"
--2024-06-20 15:23:11--  https://github.com/UoA-eResearch/SA2_2022_population/raw/main/statistical-area-2-2023-generalised_simplified_22.3%25.zip
Resolving github.com (github.com)... 20.248.137.48
Connecting to github.com (github.com)|20.248.137.48|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/UoA-eResearch/SA2_2022_population/main/statistical-area-2-2023-generalised_simplified_22.3%25.zip [following]
--2024-06-20 15:23:11--  https://raw.githubusercontent.com/UoA-eResearch/SA2_2022_population/main/statistical-area-2-2023-generalised_simplified_22.3%25.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6090234 (5.8M) [application/zip]
Saving to: ‘statistical-area-2-2023-generalised_simplified_22.3%.zip.3’

statistical-area-2- 100%[===================>]   5.81M  5.63MB/s    in 1.0s    

2024-06-20 15:23:12 (5.63 MB/s) - ‘statistical-area-2-2023-generalised_simplified_22.3%.zip.3’ saved [6090234/6090234]
# wget -O output.zip "https://github.com/UoA-eResearch/SA2_2022_population/raw/main/statistical-area-2-2023-generalised_simplified_22.3%25.zip"

We import these libraries

The following is very commonly done, we import libraries/packages/… To minimise typing effort, we give them short names (aliases)

import geopandas as gpd
import pandas as pd

So this failed. Why do you think that happened?

# !pip install geopandas
Requirement already satisfied: geopandas in /opt/homebrew/lib/python3.11/site-packages (0.14.4)
Requirement already satisfied: fiona>=1.8.21 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (1.9.6)
Requirement already satisfied: numpy>=1.22 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (1.24.3)
Requirement already satisfied: packaging in /Users/jbri364/Library/Python/3.11/lib/python/site-packages (from geopandas) (23.2)
Requirement already satisfied: pandas>=1.4.0 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (1.5.3)
Requirement already satisfied: pyproj>=3.3.0 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (3.6.1)
Requirement already satisfied: shapely>=1.8.0 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (2.0.4)
Requirement already satisfied: attrs>=19.2.0 in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (23.1.0)
Requirement already satisfied: certifi in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (2022.12.7)
Requirement already satisfied: click~=8.0 in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (8.1.7)
Requirement already satisfied: click-plugins>=1.0 in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (1.1.1)
Requirement already satisfied: cligj>=0.5 in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (0.7.2)
Requirement already satisfied: six in /Users/jbri364/Library/Python/3.11/lib/python/site-packages (from fiona>=1.8.21->geopandas) (1.16.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/jbri364/Library/Python/3.11/lib/python/site-packages (from pandas>=1.4.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/homebrew/lib/python3.11/site-packages (from pandas>=1.4.0->geopandas) (2023.3)

so let’s try again

import geopandas as gpd
import pandas as pd

We now use Geopandas

  • the gpd is a special kind of pandas DataFrame
  • the .dropna is a means of filtering data (here, these are rows) that contain a n/a — in other words: A row that has elements with no data, that would otherwise give us challenges
    • on a bigger picture perspective: It is hard to calculate the average of a column if there are missing values in it; avg (2+4+NaN) = NaN
    • this is only applied to the geometry column, which is the column that contains the geospatial data
sa2 = gpd.read_file("statistical-area-2-2023-generalised_simplified_22.3%.zip").dropna(subset="geometry")
sa2
SA22023_V1 SA22023__1 SA22023__2 LAND_AREA_ AREA_SQ_KM Shape_Leng geometry
0 100100 North Cape North Cape 829.188012 829.188012 4.160392e+05 MULTIPOLYGON (((1614055.808 6154067.023, 16141...
1 100200 Rangaunu Harbour Rangaunu Harbour 274.004295 274.004295 1.490891e+05 MULTIPOLYGON (((1624166.328 6127258.658, 16243...
2 100301 Inlets Far North District Inlets Far North District 0.000000 623.222037 1.349364e+06 MULTIPOLYGON (((1620023.410 6084653.062, 16197...
3 100400 Karikari Peninsula Karikari Peninsula 174.440751 174.440751 1.401491e+05 MULTIPOLYGON (((1626838.499 6126965.133, 16270...
4 100500 Tangonge Tangonge 177.182117 177.182117 1.014669e+05 POLYGON ((1623516.265 6119580.907, 1623492.213...
... ... ... ... ... ... ... ...
2381 363600 Oceanic Marlborough Region Oceanic Marlborough Region 0.000000 5204.239555 6.030516e+05 POLYGON ((1735069.340 5471565.641, 1731082.409...
2382 363700 Oceanic Southland Region Oceanic Southland Region 0.000000 22404.339718 2.702862e+06 POLYGON ((1201178.956 5079126.036, 1201206.109...
2383 363800 Oceanic Canterbury Region Oceanic Canterbury Region 0.000000 11441.928693 1.176106e+06 POLYGON ((1686900.717 5353231.610, 1704528.061...
2384 363900 Oceanic Otago Region Oceanic Otago Region 0.000000 6538.077281 7.359200e+05 POLYGON ((1453579.263 5021995.144, 1474465.101...
2385 364000 Motunau Island Motunau Island 0.081474 0.081474 1.427156e+03 POLYGON ((1606233.915 5232188.150, 1606183.115...

2379 rows × 7 columns

let’s perform some sanity checks

# give me only the rows in sa2 that have a geometry of NaN
sa2[sa2["geometry"].isna()]
SA22023_V1 SA22023__1 SA22023__2 LAND_AREA_ AREA_SQ_KM Shape_Leng geometry
# only show me rows where SA22023__1 and SA22023__2 differ
sa2[sa2["SA22023__1"] != sa2["SA22023__2"]]
SA22023_V1 SA22023__1 SA22023__2 LAND_AREA_ AREA_SQ_KM Shape_Leng geometry
10 104301 ÅŒpua (Far North District) Opua (Far North District) 5.602473 5.827312 19213.437576 POLYGON ((1699472.895 6094086.681, 1699528.640...
14 104800 Mangakahia-HÅ«kerenui Mangakahia-Hukerenui 659.254330 659.254330 183365.297157 POLYGON ((1709183.206 6068629.847, 1709173.331...
19 105400 MaungatÄ?pere Maungatapere 174.482190 174.482190 93344.845747 POLYGON ((1711760.225 6046448.026, 1711747.134...
20 105601 Matapouri-TutukÄ?kÄ? Matapouri-Tutukaka 78.527938 78.587007 86800.477007 MULTIPOLYGON (((1740046.334 6057828.894, 17399...
43 108000 PÄ?taua Pataua 128.707987 128.707987 129154.856928 MULTIPOLYGON (((1740486.853 6046295.860, 17405...
... ... ... ... ... ... ... ...
2265 347000 WÄ?naka Central Wanaka Central 7.562054 7.562054 12758.648935 POLYGON ((1296155.366 5043593.460, 1296353.267...
2267 347101 Lake HÄ?wea Lake Hawea 3.655589 3.655589 9016.577926 POLYGON ((1302855.368 5051689.464, 1302774.822...
2335 357200 Inland water Lake Te Ānau Inland water Lake Te Anau 0.000000 344.402829 361810.765779 POLYGON ((1199680.502 5012394.655, 1199729.683...
2341 357501 Te Ānau Te Anau 6.637305 6.657040 12967.743524 POLYGON ((1186662.007 4956066.127, 1186763.928...
2346 358000 ÅŒhai-Nightcaps Ohai-Nightcaps 948.798919 948.798919 174165.196492 POLYGON ((1232653.066 4911665.243, 1229777.053...

175 rows × 7 columns

now it becomes apparent what the difference might be; let’s have a look at the original dataset’s column names: - SA22023_V1_00_NAME - SA22023_V1_00_NAME_ASCII So two ways how characters are encoded, this is a common issue in data processing and by giving both options, this facilites the process

We do not want to include the Chatam Islands, as they are not part of the main landmass of New Zealand and only under 800 people live there.

alt

We want to make sure that we don’t have empty SA2 (that is unlikely but for other ‘resolutions’/approaches of classifying geospatial data, there might be a use for that, e.g. Marine Buoy Locations Dataset)

# only show me rows where LAND_AREA_ column is zero
sa2[sa2["LAND_AREA_"] == 0]
SA22023_V1 SA22023__1 SA22023__2 LAND_AREA_ AREA_SQ_KM Shape_Leng geometry
2 100301 Inlets Far North District Inlets Far North District 0.0 623.222037 1.349364e+06 MULTIPOLYGON (((1620023.410 6084653.062, 16197...
16 105001 Inlets other Whangarei District Inlets other Whangarei District 0.0 38.949878 2.012673e+05 MULTIPOLYGON (((1737958.662 6046745.046, 17380...
24 111000 Oceanic Auckland Region West Oceanic Auckland Region West 0.0 2384.034569 2.793983e+05 POLYGON ((1724020.922 5929713.851, 1724095.069...
30 112001 Inlets other Auckland Inlets other Auckland 0.0 122.490416 5.958464e+05 MULTIPOLYGON (((1791662.172 5909541.528, 17916...
44 108400 Inlet WhangÄ?rei Harbour Inlet Whangarei Harbour 0.0 103.520564 1.658690e+05 POLYGON ((1722383.512 6044397.233, 1722876.556...
... ... ... ... ... ... ... ...
2380 363500 Oceanic Nelson Region Oceanic Nelson Region 0.0 787.574671 1.537628e+05 POLYGON ((1630452.435 5482785.080, 1638439.661...
2381 363600 Oceanic Marlborough Region Oceanic Marlborough Region 0.0 5204.239555 6.030516e+05 POLYGON ((1735069.340 5471565.641, 1731082.409...
2382 363700 Oceanic Southland Region Oceanic Southland Region 0.0 22404.339718 2.702862e+06 POLYGON ((1201178.956 5079126.036, 1201206.109...
2383 363800 Oceanic Canterbury Region Oceanic Canterbury Region 0.0 11441.928693 1.176106e+06 POLYGON ((1686900.717 5353231.610, 1704528.061...
2384 363900 Oceanic Otago Region Oceanic Otago Region 0.0 6538.077281 7.359200e+05 POLYGON ((1453579.263 5021995.144, 1474465.101...

83 rows × 7 columns

we can combine this into one prompt and see the result

sa2 = sa2[(sa2.SA22023__1 != "Chatham Islands") & (sa2.LAND_AREA_ > 0)].copy()
sa2
SA22023_V1 SA22023__1 SA22023__2 LAND_AREA_ AREA_SQ_KM Shape_Leng geometry
0 100100 North Cape North Cape 829.188012 829.188012 416039.151064 MULTIPOLYGON (((1614055.808 6154067.023, 16141...
1 100200 Rangaunu Harbour Rangaunu Harbour 274.004295 274.004295 149089.110095 MULTIPOLYGON (((1624166.328 6127258.658, 16243...
3 100400 Karikari Peninsula Karikari Peninsula 174.440751 174.440751 140149.099270 MULTIPOLYGON (((1626838.499 6126965.133, 16270...
4 100500 Tangonge Tangonge 177.182117 177.182117 101466.900183 POLYGON ((1623516.265 6119580.907, 1623492.213...
5 100900 Rangitihi Rangitihi 84.539982 84.539982 60636.302194 POLYGON ((1629314.862 6117970.968, 1629371.783...
... ... ... ... ... ... ... ...
2374 362601 Heidelberg Heidelberg 1.308968 1.308968 5367.870180 POLYGON ((1245892.259 4848174.153, 1245922.973...
2375 363001 Clifton-Kew Clifton-Kew 3.770832 3.770832 11890.253381 POLYGON ((1243666.689 4847254.124, 1243677.447...
2376 363100 Woodend-Greenhills Woodend-Greenhills 173.600766 173.600766 166674.352079 MULTIPOLYGON (((1248612.765 4823630.510, 12486...
2378 363301 Bluff Bluff 10.136949 10.290040 23162.004392 POLYGON ((1244729.237 4827495.819, 1244667.973...
2385 364000 Motunau Island Motunau Island 0.081474 0.081474 1427.156234 POLYGON ((1606233.915 5232188.150, 1606183.115...

2295 rows × 7 columns

Section Break!

Well done, we have now our data in a format that we can work with. For each SA2, we have the geospatial extent (the geometry: the name, area, length, a polygon that surrounds it/boxes it in)

Next: we want to enhance the dataset with the population per SA2 While we downloaded the first dataset from a GitHub repository (but that might have been any other server) using the wget command-line tool, we will now use the pandas library to read in a CSV file from a local directory

population = pd.read_csv("Data2024Assets/population_by_SA2.csv")
population
Area 1996 2001 2006 2013 2018 2019 2020 2021 2022
0 100100 North Cape 1710.0 1520.0 1380.0 1470.0 1660.0 1690.0 1750.0 1800.0 1820.0
1 100200 Rangaunu Harbour 2050.0 2100.0 2070.0 2200.0 2410.0 2470.0 2580.0 2620.0 2640.0
2 100300 Inlets Far North district 190.0 140.0 70.0 60.0 50.0 50.0 50.0 40.0 40.0
3 100400 Karikari Peninsula 860.0 1040.0 970.0 1280.0 1300.0 1300.0 1360.0 1400.0 1410.0
4 100500 Tangonge 940.0 1010.0 1090.0 1240.0 1180.0 1180.0 1230.0 1240.0 1260.0
... ... ... ... ... ... ... ... ... ... ...
2256 400013 Snares Islands 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2257 400014 Oceanic Antipodes Islands 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2258 400015 Antipodes Islands 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2259 400016 Ross Dependency 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2260 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2261 rows × 10 columns

this brings us to a common challenge with combining datasets from different sources

  • the SA2 dataset has relevant data split into several columns
    • SA22023_V1 is a predefined ID; this is a common practice in data processing! It is unique
    • SA22023__1 is easily understandable name
    • SA22023__2 is the same name but in ASCII encoding (special characters matter, remmeber?)
  • the population dataset has similar information combined in the AREA column
  • it starts with the numeric code
  • then the human readable name

Do you have any idea how we can efficiently work with such datasets?

ZOOM POLL

  • for now, we assume that the unique idea is a lot easier to find the corresponding data in the two datasets
  • if someone has already defined such a ‘key’, we don’t have to worry about the encoding of the name (special characters and all)
# Extract ID from Area col
population['SA2'] = population['Area'].str.extract(r'(\d+)')
population
Area 1996 2001 2006 2013 2018 2019 2020 2021 2022 SA2
0 100100 North Cape 1710.0 1520.0 1380.0 1470.0 1660.0 1690.0 1750.0 1800.0 1820.0 100100
1 100200 Rangaunu Harbour 2050.0 2100.0 2070.0 2200.0 2410.0 2470.0 2580.0 2620.0 2640.0 100200
2 100300 Inlets Far North district 190.0 140.0 70.0 60.0 50.0 50.0 50.0 40.0 40.0 100300
3 100400 Karikari Peninsula 860.0 1040.0 970.0 1280.0 1300.0 1300.0 1360.0 1400.0 1410.0 100400
4 100500 Tangonge 940.0 1010.0 1090.0 1240.0 1180.0 1180.0 1230.0 1240.0 1260.0 100500
... ... ... ... ... ... ... ... ... ... ... ...
2256 400013 Snares Islands 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 400013
2257 400014 Oceanic Antipodes Islands 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 400014
2258 400015 Antipodes Islands 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 400015
2259 400016 Ross Dependency 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 400016
2260 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2261 rows × 11 columns

Wait! What?

  • let’s have a look what we did here
  • we created a new column SA22023_V1 in the population dataset
  • we used the Area column as an input
    • for example 100100 North Cape
    • but we didn’t copy the whole content from Area into SA2, we only took some of it
    • NOW: This is a very important step and applies to so many other datasets and research questions! Please pay special attention! Always think of the outliers!
      • we could have simply assumed that each of these IDs is 6 digits long
      • but that might not be the case
      • in a different context: University of Auckland has an ID of often (!) 3 characters followed by 3 numbers
        • often! is the key to the key (pun intended) here
        • if some of these names, for example, are taken frequently, 3 characters might not be enough
        • etc
    • so, what do we do to minimise the risk of errors?
      • we use something called Regex or Regular Expression
      • keeping the scope of this session in mind, think of it as a defined set of rules that we can apply to a string to find/match things
    r'(\d+)': This is a raw string containing a regular expression.
    \d: Matches any digit (0-9).
    +: Matches one or more of the preceding element (in this case, one or more digits).
    (): Capturing group, indicating that we want to extract the digits.
    • by using this, we can extract digits (any number of them greater 0) from a string
    • and we put them together in one capture group

Let’s compare the file types in the columns - Area that came with the dataset - and SA2 that we just created - we will use the first row for the sake of simplicity; important side-note: Python is zero-indexed, more in the Python session on Wednesday

# Selecting the first row
first_row = population.iloc[0]

# Inspecting datatypes of the first row elements
first_row_dtypes = first_row.apply(type)
print(first_row_dtypes)
Area              <class 'str'>
1996    <class 'numpy.float64'>
2001    <class 'numpy.float64'>
2006    <class 'numpy.float64'>
2013    <class 'numpy.float64'>
2018    <class 'numpy.float64'>
2019    <class 'numpy.float64'>
2020    <class 'numpy.float64'>
2021    <class 'numpy.float64'>
2022    <class 'numpy.float64'>
SA2               <class 'str'>
Name: 0, dtype: object

this means that both are ‘strings’ that contain a number; this is often another common challenge!

Now let’s merge the two datasets; but wait, before we do that, let’s have a look at the column names and what this implies

ZOOM Poll: Do you see any challange if we had these

  • SA22023_V1 ,SA22023_1 ,SA220232 ,LAND_AREA_ ,AREA_SQ_KM ,Shape_Leng ,geometry ,Area ,1996 ,2001 ,2006 ,2013 ,2018 ,2019 ,2020 ,2021 ,2022 ,SA2

we would just have some numbers as column names while this might be totally acceptable for us, now, what if we look at the data 5 years from now, or someone else does? there are two ways to address this: - we can rename the columns - we can accompany our dataset and code with a readme file (you might have participates in the session Keeping Your Spreadsheets Tidy) - which one to pick depends - for Mechanical engineering, it might be unnecessarily hard to keep all the units (that might in turn have special characters or fractions, etc.) in the column names, here a readme.txt file is far superior - for our given use-case, we will rename the columns by adding some additional text (to the front, i.e. a prefix) - let’s find the relevant column numbers (remember: Python is zero-indexed!)

# Display the index number of each column
for index, column in enumerate(population.columns):
    print(f'Index: {index}, Column: {column}')
Index: 0, Column: Area
Index: 1, Column: 1996
Index: 2, Column: 2001
Index: 3, Column: 2006
Index: 4, Column: 2013
Index: 5, Column: 2018
Index: 6, Column: 2019
Index: 7, Column: 2020
Index: 8, Column: 2021
Index: 9, Column: 2022
Index: 10, Column: SA2
# Add a prefix to the right dataframe's columns (excluding the merge key)
prefix = 'population_in_year_'
population= population.rename(columns={col: prefix + col for col in population.columns[1:10]})
population
Area population_in_year_1996 population_in_year_2001 population_in_year_2006 population_in_year_2013 population_in_year_2018 population_in_year_2019 population_in_year_2020 population_in_year_2021 population_in_year_2022 SA2
0 100100 North Cape 1710.0 1520.0 1380.0 1470.0 1660.0 1690.0 1750.0 1800.0 1820.0 100100
1 100200 Rangaunu Harbour 2050.0 2100.0 2070.0 2200.0 2410.0 2470.0 2580.0 2620.0 2640.0 100200
2 100300 Inlets Far North district 190.0 140.0 70.0 60.0 50.0 50.0 50.0 40.0 40.0 100300
3 100400 Karikari Peninsula 860.0 1040.0 970.0 1280.0 1300.0 1300.0 1360.0 1400.0 1410.0 100400
4 100500 Tangonge 940.0 1010.0 1090.0 1240.0 1180.0 1180.0 1230.0 1240.0 1260.0 100500
... ... ... ... ... ... ... ... ... ... ... ...
2256 400013 Snares Islands 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 400013
2257 400014 Oceanic Antipodes Islands 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 400014
2258 400015 Antipodes Islands 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 400015
2259 400016 Ross Dependency 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 400016
2260 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2261 rows × 11 columns

Ready, steady: Merge… Wait…

  • OK, almost there
  • Let’s wait again
  • How to merge the two datasets and how to make sure that the right data is in the right place?
  • In other words: How can we explicitly state that we want to look for the column SA22023_V1 in the SA2 dataset and the column we named SA2 in the population] dataset?
  • By specifying it!
sa2 = sa2.merge(population, left_on='SA22023_V1', right_on='SA2')
sa2
SA22023_V1 SA22023__1 SA22023__2 LAND_AREA_ AREA_SQ_KM Shape_Leng geometry Area population_in_year_1996 population_in_year_2001 population_in_year_2006 population_in_year_2013 population_in_year_2018 population_in_year_2019 population_in_year_2020 population_in_year_2021 population_in_year_2022 SA2
0 100100 North Cape North Cape 829.188012 829.188012 416039.151064 MULTIPOLYGON (((1614055.808 6154067.023, 16141... 100100 North Cape 1710.0 1520.0 1380.0 1470.0 1660.0 1690.0 1750.0 1800.0 1820.0 100100
1 100200 Rangaunu Harbour Rangaunu Harbour 274.004295 274.004295 149089.110095 MULTIPOLYGON (((1624166.328 6127258.658, 16243... 100200 Rangaunu Harbour 2050.0 2100.0 2070.0 2200.0 2410.0 2470.0 2580.0 2620.0 2640.0 100200
2 100400 Karikari Peninsula Karikari Peninsula 174.440751 174.440751 140149.099270 MULTIPOLYGON (((1626838.499 6126965.133, 16270... 100400 Karikari Peninsula 860.0 1040.0 970.0 1280.0 1300.0 1300.0 1360.0 1400.0 1410.0 100400
3 100500 Tangonge Tangonge 177.182117 177.182117 101466.900183 POLYGON ((1623516.265 6119580.907, 1623492.213... 100500 Tangonge 940.0 1010.0 1090.0 1240.0 1180.0 1180.0 1230.0 1240.0 1260.0 100500
4 100900 Rangitihi Rangitihi 84.539982 84.539982 60636.302194 POLYGON ((1629314.862 6117970.968, 1629371.783... 100900 Rangitihi 800.0 850.0 830.0 890.0 970.0 970.0 980.0 1000.0 1030.0 100900
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1589 361000 Invercargill Central Invercargill Central 9.182431 9.182431 19256.716523 POLYGON ((1243482.779 4850331.790, 1243525.689... 361000 Invercargill Central 1300.0 1020.0 1190.0 1160.0 1440.0 1460.0 1480.0 1470.0 1460.0 361000
1590 361100 Gladstone (Invercargill City) Gladstone (Invercargill City) 1.269458 1.269458 5366.924897 POLYGON ((1243388.753 4852028.683, 1243411.913... 361100 Gladstone (Invercargill city) 2630.0 2500.0 2530.0 2460.0 2450.0 2470.0 2480.0 2470.0 2490.0 361100
1591 361300 Avenal Avenal 1.427733 1.427733 5326.010603 POLYGON ((1243418.578 4851429.187, 1243466.565... 361300 Avenal 1280.0 1250.0 1180.0 1170.0 1310.0 1340.0 1370.0 1340.0 1350.0 361300
1592 363100 Woodend-Greenhills Woodend-Greenhills 173.600766 173.600766 166674.352079 MULTIPOLYGON (((1248612.765 4823630.510, 12486... 363100 Woodend-Greenhills 710.0 600.0 630.0 620.0 610.0 630.0 650.0 660.0 660.0 363100
1593 364000 Motunau Island Motunau Island 0.081474 0.081474 1427.156234 POLYGON ((1606233.915 5232188.150, 1606183.115... 364000 Motunau Island 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 364000

1594 rows × 18 columns

so, from now on…

we work with our sa2 dataframe remember, this is special a special kind of dataframe (or common shorthand df) ZOOM POLL: What is the difference between a gpd and a df?

print(type(sa2))
<class 'geopandas.geodataframe.GeoDataFrame'>

Because this is this special kind of dataframe, we can take a lot (I really mean it: A LOT) of shortcuts; such as plotting the data as a map with just these 3 lines

m = sa2.explore("population_in_year_2022", legend=True)
m.save("index_folium.html")
m
Make this Notebook Trusted to load map: File -> Trust Notebook

you might get an error ImportError: The 'folium', 'matplotlib' and 'mapclassify' packages are required for 'explore()'. You can install them using 'conda install -c conda-forge folium matplotlib mapclassify' or 'pip install folium matplotlib mapclassify'. so do what is suggested pip install folium matplotlib mapclassify

Next steps

  • list what we want to do