Making selections of features#

One key element in geographic analysis using vector data is to do some kind of analysis with individual features. However, in many instances we don’t want to do an analysis for all features of a layer, but only for a subset, for which a condition is true. We thus a entering the world of feature selections of features. In Geography, the two most common types of feature selections are (a) selections based on attributes of the features, and (b) selections based on the geographic context of a given feature. I will provide examples for both examples. Before that, let’s load again the layer containing boundaries of eight countries of central Europe.

import os
from osgeo import ogr, osr
folder = "D:/OneDrive - Conservation Biogeography Lab/_TEACHING/__Classes-Modules_HUB/M8_Geoprocessing-with-python/data/"
# Open the shapefile
ds = ogr.Open(folder + "EU_vector/gadm36.shp")
# Get the first layer in the file
lyr = ds.GetLayer()
# Get the name of the attributes
[field.name for field in lyr.schema]
['fid', 'iso_a2', 'NAME', 'FIPS_10_', 'ISO_A3', 'WB_A2', 'WB_A3']

Selections by attribute#

The first type of selection that we are going to look at is the selection based on an attribute. You probably know this feature from numerous applications in ArcGIS or QGIS - here, we are doing the same thing, yet using scripting syntax. The layer we opened has only one attribute ('NAME_0'), for which we will do a couple of selections using the functiion .SetAttributeFilter(), which carries one argument: the rule by which we want to make the selection. For this, we need to keep in mind two important aspects:

  1. The query by which we select features is very similar to SQL syntax,

  2. The query itself has to be prepared in form of a string.

Below, we will show a couple of examples. After each selection, we make a test of whether the selection worked. We can do that by assessing the number of features before and after the selection.

Important

Be aware that this test only gives an answer to the question whether the syntax you were using to make the selection. Whether this makes sense semantically, you need to make sure by thinking carefully about selection query. Often it helps in the beginning to do the same selection manually in QGIS to see whether your query works correctly. Over time and with some experience, you probably won’t need those checks anymore

Let’s make a selection based on a country name:

print(lyr.GetFeatureCount())
lyr.SetAttributeFilter("NAME = 'Germany'")
print(lyr.GetFeatureCount())
9
1

You probably notice that the entire selection query is a string that, indicated by "". Likewise, since we are selecting by an attribute defined as string (i.e., text), that text itself needs to be formatted as such. Since the entire query already is indicated by "", we are using '' as the indication for the string inside the attribute. This is an important rule, and you will need to pay attention to that. It becomes more clear, when we select based on a more complext search query, here for example by the name of two countries. We will need to apply some SQL syntax to achieve that:

lyr.SetAttributeFilter("NAME IN ('Germany', 'Netherlands')")
print(lyr.GetFeatureCount())
2

This looks correct. Here, we make use of the SQL rule IN, but we could also run this using a slightly different rule, here by using the rule OR:

lyr.SetAttributeFilter("NAME  = 'Germany' OR NAME = 'Netherlands'")
print(lyr.GetFeatureCount())
2

For the present shapefile, these are probably the only meaningfull selections by attributes.

Practice!

Open a different shapefile, possibly one that you have used in a different context. Play around with different selection queries, e.g. by varying between attributes where the values are stored as string vs. integers. It is all about practice - without repeatedly doing it you won’t be able to master these tasks!

if we want to remove an attribute filter, we can pass the argument None to the selector. Though, when your goal is to make another selection, there is no need to remove a previous filter:

lyr.SetAttributeFilter(None)
print(lyr.GetFeatureCount())
9

Selections by geometries (i.e., spatial filter, select by location)#

The second way we can make selections of feature is via spatial locations. This is the more geographic spatial filter is a useful tool to reduce the number of features prior to doing any analysis with them. It is comparable to a select by location function in ArcGIS / QGIS. The genereal question that is asked in this type of analysis is: return me the number of features in a layer that intersect with the feature of another layer. This means, that we have now two elements to think of: (1) we need a second layer from which we select features based on their location, (2) we need the geometry of a feature of the first layer.

In the example below, we take the country-borders as above and in addition the World Database on Protected Areas (WDPA). For simplicity reasons, both shapefiles come with the same spatial reference. Let’s load the shapefile and get the count of polygons stored in the layer:

# We get a second layer
ds2 = ogr.Open(folder + "EU_vector/WDPA_sub.shp")
lyr2 = ds2.GetLayer()
# Count the number of features in lyr2
PAcount = lyr2.GetFeatureCount()
print(PAcount)
41158

As described before, we now need a geometry to apply a spatial filter. From the previous chapter we know how to do that (we spare here the visualization of the coordinates):

feat1 = lyr[0]
ctr = feat1.GetField("NAME")
geom1 = feat1.geometry()
print(ctr)
Germany

So, now we do is we apply the function .SetSpatialFilter() and pass as the only argument the geometry. We can check whether the selection worked by counting again the number of features in lyr2 after the selection

# Make a spatial selection based on the geometry from lyr1
lyr2.SetSpatialFilter(geom1)
print(lyr2.GetFeatureCount())
23196

We can see that inside the geometry of Germany, we have 23,196 protected areas. Now, you can expect that one could do this also for all countries by iterating over the all features of the gadm layer. We do this here by running a for-loop:

for feat in lyr:
    name = feat.GetField('NAME')
    geom = feat.geometry()
    lyr2.SetSpatialFilter(geom)
    cnt = lyr2.GetFeatureCount()
    print("Country:", name, ", # PAs:", str(cnt))
Country: Germany , # PAs: 23196
Country: Netherlands , # PAs: 592
Country: Austria , # PAs: 1762
Country: Liechtenstein , # PAs: 52
Country: Belgium , # PAs: 2029
Country: Switzerland , # PAs: 6894
Country: Czechia , # PAs: 3941
Country: Luxembourg , # PAs: 215
Country: Poland , # PAs: 3260

Combining attribute and spatial filter#

You probably can guess that we can combine both spatial and attribute filter. How you combine these two depends solely on your target summary that you would want to have. In the example below we want to only return the protected areas of IUCN category Ia. This information is stored in the attribute IUCN_CAT. We can do this in at least two ways:

  1. by iterating over each country, applying first a spatial filter followed by a attribute filter (or vice versa)

  2. by applying the attribute filter first, before iterating over each country.

Both operations will yield the same results, but we can see which of the two is faster. I use the magic tool %%timeit that is part of ipynb. I apply this here doing only one run/repetition, assuming that there is no stochastic element that may influence the calculation (plus: I don’t want to have the print() statements repeated.)

%%timeit -n1 -r1
# Version 1:
for feat in lyr:
    # Get name and geometry
    name = feat.GetField('NAME')
    geom = feat.geometry()
    # Set Spatial filter
    lyr2.SetSpatialFilter(geom)
    # Set attribute filter
    lyr2.SetAttributeFilter("IUCN_CAT = 'Ia'")
    # Produce output
    cnt = lyr2.GetFeatureCount()
    print("Country:", name, ", # PAs:", str(cnt))
Country: Germany , # PAs: 2
Country: Netherlands , # PAs: 0
Country: Austria , # PAs: 1
Country: Liechtenstein , # PAs: 0
Country: Belgium , # PAs: 0
Country: Switzerland , # PAs: 551
Country: Czechia , # PAs: 18
Country: Luxembourg , # PAs: 0
Country: Poland , # PAs: 14
8.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -n1 -r1
# Version 2:
# Set attribute filter
lyr2.SetAttributeFilter("IUCN_CAT = 'Ia'")
for feat in lyr:
    # Get name and geometry
    name = feat.GetField('NAME')
    geom = feat.geometry()
    # Set Spatial filter
    lyr2.SetSpatialFilter(geom)
    # Produce output
    cnt = lyr2.GetFeatureCount()
    print("Country:", name, ", # PAs:", str(cnt))
Country: Germany , # PAs: 2
Country: Netherlands , # PAs: 0
Country: Austria , # PAs: 1
Country: Liechtenstein , # PAs: 0
Country: Belgium , # PAs: 0
Country: Switzerland , # PAs: 551
Country: Czechia , # PAs: 18
Country: Luxembourg , # PAs: 0
Country: Poland , # PAs: 14
11.1 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Version two was roughly 2.5s faster, which is for a total execution time of ~10s a lot. Imagine that you have to run this over ~200 countries and a PA shapefile of more than 3.000.000 elements. Then, it can make a difference of a couple of hours, particularly if we think of additional selection.

Important!

Always think about the order when you make selections. Doing an operation repeatedly on a smaller number of elements inside a loop may be more costly than doing it once for a much larger number of elements outside / prior to doing a loop.