we use a dataset from StatsNZ, which is the 2022 Census dataset
in plain English:
this dataset contains information about the population of New Zealand
for our use-case, we use the Statistical Area 2 (SA2) dataset. This implies that each geographical cluster contains around 1000-4000 people are assigned into one SA2
SA2 shall represent a ‘community of place’ where people interact together socially and economically
in other words (but dependent on each use-case), we can assume(!) that the people in one SA2 are similar in terms of their socio-economic status
More details: - we will refer to SA (Statistical Areas, as provided by StatsNZ) - There are 3 levels of SA: SA1, SA2, SA3 - SA1 is the smallest (about 100-500 residents), SA2 (1k-4k residents) SA3 is the largest (5k-50k residents) - We will use the medium one (SA2) for this analysis
Let’s have a look at the tool that we are about to use and why
Python because it is…
… relatively easy to learn and use
… general-purpose programming language, which means it can be used to build just about anything
We download the data
there are various ways of doing this
we can go to the website and press the download button
we can use command-line tools directly from here, such as wget or curl to download the data
here, we will download the data wget from a GitHub repository that hosts an optimised derivative of the original data set
The following is very commonly done, we import libraries/packages/… To minimise typing effort, we give them short names (aliases)
import geopandas as gpdimport pandas as pd
So this failed. Why do you think that happened?
# !pip install geopandas
Requirement already satisfied: geopandas in /opt/homebrew/lib/python3.11/site-packages (0.14.4)
Requirement already satisfied: fiona>=1.8.21 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (1.9.6)
Requirement already satisfied: numpy>=1.22 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (1.24.3)
Requirement already satisfied: packaging in /Users/jbri364/Library/Python/3.11/lib/python/site-packages (from geopandas) (23.2)
Requirement already satisfied: pandas>=1.4.0 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (1.5.3)
Requirement already satisfied: pyproj>=3.3.0 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (3.6.1)
Requirement already satisfied: shapely>=1.8.0 in /opt/homebrew/lib/python3.11/site-packages (from geopandas) (2.0.4)
Requirement already satisfied: attrs>=19.2.0 in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (23.1.0)
Requirement already satisfied: certifi in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (2022.12.7)
Requirement already satisfied: click~=8.0 in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (8.1.7)
Requirement already satisfied: click-plugins>=1.0 in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (1.1.1)
Requirement already satisfied: cligj>=0.5 in /opt/homebrew/lib/python3.11/site-packages (from fiona>=1.8.21->geopandas) (0.7.2)
Requirement already satisfied: six in /Users/jbri364/Library/Python/3.11/lib/python/site-packages (from fiona>=1.8.21->geopandas) (1.16.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/jbri364/Library/Python/3.11/lib/python/site-packages (from pandas>=1.4.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/homebrew/lib/python3.11/site-packages (from pandas>=1.4.0->geopandas) (2023.3)
so let’s try again
import geopandas as gpdimport pandas as pd
We now use Geopandas
the gpd is a special kind of pandas DataFrame
the .dropna is a means of filtering data (here, these are rows) that contain a n/a — in other words: A row that has elements with no data, that would otherwise give us challenges
on a bigger picture perspective: It is hard to calculate the average of a column if there are missing values in it; avg (2+4+NaN) = NaN
this is only applied to the geometry column, which is the column that contains the geospatial data
# give me only the rows in sa2 that have a geometry of NaNsa2[sa2["geometry"].isna()]
SA22023_V1
SA22023__1
SA22023__2
LAND_AREA_
AREA_SQ_KM
Shape_Leng
geometry
# only show me rows where SA22023__1 and SA22023__2 differsa2[sa2["SA22023__1"] != sa2["SA22023__2"]]
SA22023_V1
SA22023__1
SA22023__2
LAND_AREA_
AREA_SQ_KM
Shape_Leng
geometry
10
104301
ÅŒpua (Far North District)
Opua (Far North District)
5.602473
5.827312
19213.437576
POLYGON ((1699472.895 6094086.681, 1699528.640...
14
104800
Mangakahia-HÅ«kerenui
Mangakahia-Hukerenui
659.254330
659.254330
183365.297157
POLYGON ((1709183.206 6068629.847, 1709173.331...
19
105400
MaungatÄ?pere
Maungatapere
174.482190
174.482190
93344.845747
POLYGON ((1711760.225 6046448.026, 1711747.134...
20
105601
Matapouri-TutukÄ?kÄ?
Matapouri-Tutukaka
78.527938
78.587007
86800.477007
MULTIPOLYGON (((1740046.334 6057828.894, 17399...
43
108000
PÄ?taua
Pataua
128.707987
128.707987
129154.856928
MULTIPOLYGON (((1740486.853 6046295.860, 17405...
...
...
...
...
...
...
...
...
2265
347000
WÄ?naka Central
Wanaka Central
7.562054
7.562054
12758.648935
POLYGON ((1296155.366 5043593.460, 1296353.267...
2267
347101
Lake HÄ?wea
Lake Hawea
3.655589
3.655589
9016.577926
POLYGON ((1302855.368 5051689.464, 1302774.822...
2335
357200
Inland water Lake Te Ānau
Inland water Lake Te Anau
0.000000
344.402829
361810.765779
POLYGON ((1199680.502 5012394.655, 1199729.683...
2341
357501
Te Ānau
Te Anau
6.637305
6.657040
12967.743524
POLYGON ((1186662.007 4956066.127, 1186763.928...
2346
358000
ÅŒhai-Nightcaps
Ohai-Nightcaps
948.798919
948.798919
174165.196492
POLYGON ((1232653.066 4911665.243, 1229777.053...
175 rows × 7 columns
now it becomes apparent what the difference might be; let’s have a look at the original dataset’s column names: - SA22023_V1_00_NAME - SA22023_V1_00_NAME_ASCII So two ways how characters are encoded, this is a common issue in data processing and by giving both options, this facilites the process
We do not want to include the Chatam Islands, as they are not part of the main landmass of New Zealand and only under 800 people live there.
We want to make sure that we don’t have empty SA2 (that is unlikely but for other ‘resolutions’/approaches of classifying geospatial data, there might be a use for that, e.g. Marine Buoy Locations Dataset)
# only show me rows where LAND_AREA_ column is zerosa2[sa2["LAND_AREA_"] ==0]
SA22023_V1
SA22023__1
SA22023__2
LAND_AREA_
AREA_SQ_KM
Shape_Leng
geometry
2
100301
Inlets Far North District
Inlets Far North District
0.0
623.222037
1.349364e+06
MULTIPOLYGON (((1620023.410 6084653.062, 16197...
16
105001
Inlets other Whangarei District
Inlets other Whangarei District
0.0
38.949878
2.012673e+05
MULTIPOLYGON (((1737958.662 6046745.046, 17380...
24
111000
Oceanic Auckland Region West
Oceanic Auckland Region West
0.0
2384.034569
2.793983e+05
POLYGON ((1724020.922 5929713.851, 1724095.069...
30
112001
Inlets other Auckland
Inlets other Auckland
0.0
122.490416
5.958464e+05
MULTIPOLYGON (((1791662.172 5909541.528, 17916...
44
108400
Inlet WhangÄ?rei Harbour
Inlet Whangarei Harbour
0.0
103.520564
1.658690e+05
POLYGON ((1722383.512 6044397.233, 1722876.556...
...
...
...
...
...
...
...
...
2380
363500
Oceanic Nelson Region
Oceanic Nelson Region
0.0
787.574671
1.537628e+05
POLYGON ((1630452.435 5482785.080, 1638439.661...
2381
363600
Oceanic Marlborough Region
Oceanic Marlborough Region
0.0
5204.239555
6.030516e+05
POLYGON ((1735069.340 5471565.641, 1731082.409...
2382
363700
Oceanic Southland Region
Oceanic Southland Region
0.0
22404.339718
2.702862e+06
POLYGON ((1201178.956 5079126.036, 1201206.109...
2383
363800
Oceanic Canterbury Region
Oceanic Canterbury Region
0.0
11441.928693
1.176106e+06
POLYGON ((1686900.717 5353231.610, 1704528.061...
2384
363900
Oceanic Otago Region
Oceanic Otago Region
0.0
6538.077281
7.359200e+05
POLYGON ((1453579.263 5021995.144, 1474465.101...
83 rows × 7 columns
we can combine this into one prompt and see the result
Well done, we have now our data in a format that we can work with. For each SA2, we have the geospatial extent (the geometry: the name, area, length, a polygon that surrounds it/boxes it in)
Next: we want to enhance the dataset with the population per SA2 While we downloaded the first dataset from a GitHub repository (but that might have been any other server) using the wget command-line tool, we will now use the pandas library to read in a CSV file from a local directory
population = pd.read_csv("Data2024Assets/population_by_SA2.csv")population
Area
1996
2001
2006
2013
2018
2019
2020
2021
2022
0
100100 North Cape
1710.0
1520.0
1380.0
1470.0
1660.0
1690.0
1750.0
1800.0
1820.0
1
100200 Rangaunu Harbour
2050.0
2100.0
2070.0
2200.0
2410.0
2470.0
2580.0
2620.0
2640.0
2
100300 Inlets Far North district
190.0
140.0
70.0
60.0
50.0
50.0
50.0
40.0
40.0
3
100400 Karikari Peninsula
860.0
1040.0
970.0
1280.0
1300.0
1300.0
1360.0
1400.0
1410.0
4
100500 Tangonge
940.0
1010.0
1090.0
1240.0
1180.0
1180.0
1230.0
1240.0
1260.0
...
...
...
...
...
...
...
...
...
...
...
2256
400013 Snares Islands
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2257
400014 Oceanic Antipodes Islands
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2258
400015 Antipodes Islands
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2259
400016 Ross Dependency
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2260
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2261 rows × 10 columns
this brings us to a common challenge with combining datasets from different sources
the SA2 dataset has relevant data split into several columns
SA22023_V1 is a predefined ID; this is a common practice in data processing! It is unique
SA22023__1 is easily understandable name
SA22023__2 is the same name but in ASCII encoding (special characters matter, remmeber?)
the population dataset has similar information combined in the AREA column
it starts with the numeric code
then the human readable name
Do you have any idea how we can efficiently work with such datasets?
ZOOM POLL
for now, we assume that the unique idea is a lot easier to find the corresponding data in the two datasets
if someone has already defined such a ‘key’, we don’t have to worry about the encoding of the name (special characters and all)
# Extract ID from Area colpopulation['SA2'] = population['Area'].str.extract(r'(\d+)')population
Area
1996
2001
2006
2013
2018
2019
2020
2021
2022
SA2
0
100100 North Cape
1710.0
1520.0
1380.0
1470.0
1660.0
1690.0
1750.0
1800.0
1820.0
100100
1
100200 Rangaunu Harbour
2050.0
2100.0
2070.0
2200.0
2410.0
2470.0
2580.0
2620.0
2640.0
100200
2
100300 Inlets Far North district
190.0
140.0
70.0
60.0
50.0
50.0
50.0
40.0
40.0
100300
3
100400 Karikari Peninsula
860.0
1040.0
970.0
1280.0
1300.0
1300.0
1360.0
1400.0
1410.0
100400
4
100500 Tangonge
940.0
1010.0
1090.0
1240.0
1180.0
1180.0
1230.0
1240.0
1260.0
100500
...
...
...
...
...
...
...
...
...
...
...
...
2256
400013 Snares Islands
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
400013
2257
400014 Oceanic Antipodes Islands
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
400014
2258
400015 Antipodes Islands
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
400015
2259
400016 Ross Dependency
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
400016
2260
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2261 rows × 11 columns
Wait! What?
let’s have a look what we did here
we created a new column SA22023_V1 in the population dataset
we used the Area column as an input
for example 100100 North Cape
but we didn’t copy the whole content from Area into SA2, we only took some of it
NOW: This is a very important step and applies to so many other datasets and research questions! Please pay special attention! Always think of the outliers!
we could have simply assumed that each of these IDs is 6 digits long
but that might not be the case
in a different context: University of Auckland has an ID of often (!) 3 characters followed by 3 numbers
often! is the key to the key (pun intended) here
if some of these names, for example, are taken frequently, 3 characters might not be enough
etc
so, what do we do to minimise the risk of errors?
we use something called Regex or Regular Expression
keeping the scope of this session in mind, think of it as a defined set of rules that we can apply to a string to find/match things
r'(\d+)': This is a raw string containing a regular expression.
\d: Matches any digit (0-9).
+: Matches one or more of the preceding element (in this case, one or more digits).
(): Capturing group, indicating that we want to extract the digits.
by using this, we can extract digits (any number of them greater 0) from a string
and we put them together in one capture group
Let’s compare the file types in the columns - Area that came with the dataset - and SA2 that we just created - we will use the first row for the sake of simplicity; important side-note: Python is zero-indexed, more in the Python session on Wednesday
# Selecting the first rowfirst_row = population.iloc[0]# Inspecting datatypes of the first row elementsfirst_row_dtypes = first_row.apply(type)print(first_row_dtypes)
we would just have some numbers as column names while this might be totally acceptable for us, now, what if we look at the data 5 years from now, or someone else does? there are two ways to address this: - we can rename the columns - we can accompany our dataset and code with a readme file (you might have participates in the session Keeping Your Spreadsheets Tidy) - which one to pick depends - for Mechanical engineering, it might be unnecessarily hard to keep all the units (that might in turn have special characters or fractions, etc.) in the column names, here a readme.txt file is far superior - for our given use-case, we will rename the columns by adding some additional text (to the front, i.e. a prefix) - let’s find the relevant column numbers (remember: Python is zero-indexed!)
# Display the index number of each columnfor index, column inenumerate(population.columns):print(f'Index: {index}, Column: {column}')
# Add a prefix to the right dataframe's columns (excluding the merge key)prefix ='population_in_year_'population= population.rename(columns={col: prefix + col for col in population.columns[1:10]})population
Area
population_in_year_1996
population_in_year_2001
population_in_year_2006
population_in_year_2013
population_in_year_2018
population_in_year_2019
population_in_year_2020
population_in_year_2021
population_in_year_2022
SA2
0
100100 North Cape
1710.0
1520.0
1380.0
1470.0
1660.0
1690.0
1750.0
1800.0
1820.0
100100
1
100200 Rangaunu Harbour
2050.0
2100.0
2070.0
2200.0
2410.0
2470.0
2580.0
2620.0
2640.0
100200
2
100300 Inlets Far North district
190.0
140.0
70.0
60.0
50.0
50.0
50.0
40.0
40.0
100300
3
100400 Karikari Peninsula
860.0
1040.0
970.0
1280.0
1300.0
1300.0
1360.0
1400.0
1410.0
100400
4
100500 Tangonge
940.0
1010.0
1090.0
1240.0
1180.0
1180.0
1230.0
1240.0
1260.0
100500
...
...
...
...
...
...
...
...
...
...
...
...
2256
400013 Snares Islands
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
400013
2257
400014 Oceanic Antipodes Islands
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
400014
2258
400015 Antipodes Islands
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
400015
2259
400016 Ross Dependency
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
400016
2260
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2261 rows × 11 columns
Ready, steady: Merge… Wait…
OK, almost there
Let’s wait again
How to merge the two datasets and how to make sure that the right data is in the right place?
In other words: How can we explicitly state that we want to look for the column SA22023_V1 in the SA2 dataset and the column we named SA2 in the population] dataset?
we work with our sa2 dataframe remember, this is special a special kind of dataframe (or common shorthand df) ZOOM POLL: What is the difference between a gpd and a df?
print(type(sa2))
<class 'geopandas.geodataframe.GeoDataFrame'>
Because this is this special kind of dataframe, we can take a lot (I really mean it: A LOT) of shortcuts; such as plotting the data as a map with just these 3 lines
m = sa2.explore("population_in_year_2022", legend=True)m.save("index_folium.html")m
Make this Notebook Trusted to load map: File -> Trust Notebook