Example - Categorical Data

[1]:

import geopandas
import pandas

from geocube.api.core import make_geocube

%matplotlib inline

Load in soil data

[2]:

ssurgo_data = geopandas.read_file("../../test/test_data/input/soil_data_group.geojson")

[3]:

# original data
ssurgo_data[ssurgo_data.hzdept_r==15].plot(column='sandtotal_r')

[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9e210d3be0>

Generate categories for categorical data

If your data is only a subset of all of the data, the list of categories you get will likely not be complete.

NOTE: The categories will be made unique and sorted internally if they are not already.

[4]:

# this is only a subset of all of the classes
ssurgo_data.drclassdcd.drop_duplicates().values.tolist()

[4]:

['Somewhat poorly drained',
 'Poorly drained',
 'Well drained',
 'Excessively drained']

[5]:

# complete list of categories
drclasses_complete = [
    'Poorly drained',
    'Somewhat poorly drained',
    'Excessively drained',
    'Subaqueous',
    'Well drained',
    'Somewhat excessively drained',
    'Very poorly drained',
    'Moderately well drained'
]

[6]:

categorical_enums = {'drclassdcd': drclasses_complete}

Convert data to grid

See docs for make_geocube

[7]:

out_grid = make_geocube(
    vector_data=ssurgo_data,
    output_crs="epsg:32615",
    group_by='hzdept_r',
    resolution=(-100, 100),
    categorical_enums=categorical_enums
)

[8]:

out_grid

[8]:

<xarray.Dataset>
Dimensions:                (drclassdcd_categories: 9, hzdept_r: 11, x: 15, y: 19)
Coordinates:
  * y                      (y) float64 4.597e+06 4.597e+06 ... 4.595e+06
  * x                      (x) float64 7.004e+05 7.004e+05 ... 7.018e+05
  * hzdept_r               (hzdept_r) float64 0.0 5.0 15.0 ... 105.0 120.0 150.0
  * drclassdcd_categories  (drclassdcd_categories) object 'Excessively drained' ... 'nodata'
    spatial_ref            int64 0
Data variables:
    drclassdcd             (hzdept_r, y, x) float64 -1.0 -1.0 -1.0 ... 7.0 7.0
    hzdepb_r               (hzdept_r, y, x) float64 nan nan nan ... 180.0 180.0
    claytotal_r            (hzdept_r, y, x) float64 nan nan nan ... 21.0 21.0
    sandtotal_r            (hzdept_r, y, x) float64 nan nan nan ... 10.0 10.0
    silttotal_r            (hzdept_r, y, x) float64 nan nan nan ... 69.0 69.0
Attributes:
    grid_mapping:  spatial_ref

[9]:

# mask nodata and plot
clay_slice = out_grid.claytotal_r.sel(hzdept_r=15)
clay_slice.where(clay_slice!=out_grid.claytotal_r.rio.nodata).plot()

[9]:

<matplotlib.collections.QuadMesh at 0x7f9e1e823748>

../_images/examples_categorical_13_1.png

Dealing with categorical data

Because the data needs to be numerical for conversion from vector to raster, the code displays the categories as numbers. To convert back to strings, you will need to use the categories provided to convert back.

[10]:

drclassdcd_slice = out_grid.drclassdcd.sel(hzdept_r=15)
drclassdcd_slice.where(drclassdcd_slice!=out_grid.drclassdcd.rio.nodata).plot()

[10]:

<matplotlib.collections.QuadMesh at 0x7f9e1e76bc50>

../_images/examples_categorical_16_1.png

[11]:

drclassdcd_string = out_grid['drclassdcd_categories'][out_grid['drclassdcd'].astype(int)]\
    .drop('drclassdcd_categories')

[12]:

out_grid['drclassdcd'] = drclassdcd_string
out_grid

[12]:

<xarray.Dataset>
Dimensions:                (drclassdcd_categories: 9, hzdept_r: 11, x: 15, y: 19)
Coordinates:
  * y                      (y) float64 4.597e+06 4.597e+06 ... 4.595e+06
  * x                      (x) float64 7.004e+05 7.004e+05 ... 7.018e+05
  * hzdept_r               (hzdept_r) float64 0.0 5.0 15.0 ... 105.0 120.0 150.0
  * drclassdcd_categories  (drclassdcd_categories) object 'Excessively drained' ... 'nodata'
    spatial_ref            int64 0
Data variables:
    drclassdcd             (hzdept_r, y, x) object 'nodata' ... 'Well drained'
    hzdepb_r               (hzdept_r, y, x) float64 nan nan nan ... 180.0 180.0
    claytotal_r            (hzdept_r, y, x) float64 nan nan nan ... 21.0 21.0
    sandtotal_r            (hzdept_r, y, x) float64 nan nan nan ... 10.0 10.0
    silttotal_r            (hzdept_r, y, x) float64 nan nan nan ... 69.0 69.0
Attributes:
    grid_mapping:  spatial_ref

[13]:

pdf = out_grid.drop(['spatial_ref', 'drclassdcd_categories']).to_dataframe()
pdf.head()

[13]:

			drclassdcd	hzdepb_r	claytotal_r	sandtotal_r	silttotal_r
hzdept_r	x	y
0.0	700350.0	4597050.0	nodata	NaN	NaN	NaN	NaN
		4596950.0	Well drained	5.0	26.0	38.0	36.0
		4596850.0	Well drained	5.0	26.0	38.0	36.0
		4596750.0	Well drained	5.0	26.0	38.0	36.0
		4596650.0	Well drained	5.0	26.0	38.0	36.0

Make sure all categories are represented

To do this, convert the column type to categorical beforehand and make sure that you include all of the possible categories.

[14]:

cat_dtype = pandas.api.types.CategoricalDtype(out_grid.drclassdcd_categories.values)
pdf['drclassdcd'] = pdf['drclassdcd'].astype(cat_dtype)

[15]:

training_df = pandas.get_dummies(pdf, columns=['drclassdcd'])
training_df.head()

[15]:

			hzdepb_r	claytotal_r	sandtotal_r	silttotal_r	drclassdcd_Excessively drained	drclassdcd_Moderately well drained	drclassdcd_Poorly drained	drclassdcd_Somewhat excessively drained	drclassdcd_Somewhat poorly drained	drclassdcd_Subaqueous	drclassdcd_Very poorly drained	drclassdcd_Well drained	drclassdcd_nodata
hzdept_r	x	y
0.0	700350.0	4597050.0	NaN	NaN	NaN	NaN	0	0	0	0	0	0	0	0	1
		4596950.0	5.0	26.0	38.0	36.0	0	0	0	0	0	0	0	1	0
		4596850.0	5.0	26.0	38.0	36.0	0	0	0	0	0	0	0	1	0
		4596750.0	5.0	26.0	38.0	36.0	0	0	0	0	0	0	0	1	0
		4596650.0	5.0	26.0	38.0	36.0	0	0	0	0	0	0	0	1	0

[16]:

training_df.columns

[16]:

Index(['hzdepb_r', 'claytotal_r', 'sandtotal_r', 'silttotal_r',
       'drclassdcd_Excessively drained', 'drclassdcd_Moderately well drained',
       'drclassdcd_Poorly drained', 'drclassdcd_Somewhat excessively drained',
       'drclassdcd_Somewhat poorly drained', 'drclassdcd_Subaqueous',
       'drclassdcd_Very poorly drained', 'drclassdcd_Well drained',
       'drclassdcd_nodata'],
      dtype='object')