Comparing Classification Models with GLOBE Data

Comparing Classification Models with GLOBE Data#

Test and compare different classification models using GLOBE data.

!pip install -q geemap geopandas

import ee
import geemap
import geopandas as gpd
import pandas as pd
from IPython.display import display, Image

# Authenticate Google Earth Engine
ee.Authenticate()
ee.Initialize(project='emerge-lessons')

# Define land cover classes
muc_level1_map = {
    '0': 'Closed Forest', '1': 'Woodland', '2': 'Shrubland or Thicket',
    '3': 'Dwarf-Shrubland or Dwarf-Thicket', '4': 'Herbaceous Vegetation',
    '5': 'Barren', '6': 'Wetland', '7': 'Open Water',
    '8': 'Cultivated Land', '9': 'Urban'
}

custom_palette = [
    '#004b23', '#2d6a4f', '#52b788', '#95d5b2', '#d8f3dc',
    '#d4a373', '#40E0D0', '#1E90FF', '#8B4513', '#808080'
]

def prepare_data():
    """Loads GLOBE data, prepares features, and fetches satellite embeddings."""
    print("Loading and preparing ground truth data...")

    # Load and clean tabular data
    url = 'https://github.com/geo-di-lab/emerge-lessons/raw/refs/heads/main/docs/data/globe_land_cover.zip'
    gdf = gpd.read_file(url).dropna(subset=['MucCode', 'geometry'])

    # Format labels
    gdf['Level1Code'] = gdf['MucCode'].astype(str).str[1]
    gdf['label'] = gdf['Level1Code'].astype(int)
    gdf['MucDescription'] = gdf['Level1Code'].map(muc_level1_map)
    gdf_subset = gdf[['label', 'Level1Code', 'MucDescription', 'geometry']]
    fc = geemap.geopandas_to_ee(gdf_subset)

    # Define region (Florida)
    fl_url = 'https://github.com/geo-di-lab/emerge-lessons/raw/refs/heads/main/docs/data/florida_boundary.geojson'
    region = geemap.gdf_to_ee(gpd.read_file(fl_url)[['geometry']]).geometry()
    local_fc = fc.filterBounds(region)

    # Load satellite embeddings
    print("Fetching V1 Annual AlphaEarth Satellite Embeddings...")
    embeddings = (ee.ImageCollection('GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL')
                  .filterDate('2024-01-01', '2025-01-01')
                  .mosaic()
                  .clip(region))

    return local_fc, embeddings, region

# Run the preparation
local_fc, embeddings, region = prepare_data()

Loading and preparing ground truth data...
Fetching V1 Annual AlphaEarth Satellite Embeddings...

def extract_and_split(embeddings_img, point_features, split_ratio=0.8):
    """Samples image regions and splits into train/test sets."""
    print("Extracting features (this may take a moment)...")

    sampled_data = embeddings_img.sampleRegions(
        collection=point_features,
        properties=['label'],
        scale=10,
        tileScale=4
    )

    # Randomize and split
    with_random = sampled_data.randomColumn('random')
    train_set = with_random.filter(ee.Filter.lt('random', split_ratio))
    test_set = with_random.filter(ee.Filter.gte('random', split_ratio))

    return train_set, test_set, embeddings_img.bandNames()

# Create training and testing datasets
training, testing, band_names = extract_and_split(embeddings, local_fc)
print(f"Bands extracted: {band_names.length().getInfo()} features available per point.")

Extracting features (this may take a moment)...
Bands extracted: 64 features available per point.

Random Forest Classifier#

Random Forest is an “ensemble” model. It builds multiple decision trees (in this case, 50) and merges their predictions to get a more stable result.

print("Training Random Forest Classifier...")

# Initialize and train the model
rf_classifier = ee.Classifier.smileRandomForest(50).train(
    features=training,
    classProperty='label',
    inputProperties=band_names
)

# Validate the model against our testing data
rf_validated = testing.classify(rf_classifier)

# Calculate and print the accuracy
rf_confusion_matrix = rf_validated.errorMatrix('label', 'classification')
rf_accuracy = rf_confusion_matrix.accuracy().getInfo()

print(f"Random Forest Overall Accuracy: {rf_accuracy:.4f}")

Training Random Forest Classifier...
Random Forest Overall Accuracy: 0.4737

CART (Classification and Regression Trees)#

CART is a single decision tree. It looks at the embedding bands and finds the best places to “split” the data based on simple thresholds (e.g., “If Band 1 > 0.5, it’s a forest”). While they are very fast and easy to interpret, single decision trees are prone to “overfitting,” meaning they might memorize the training data rather than learning general patterns.

print("Training CART Classifier...")

# Initialize and train the model
cart_classifier = ee.Classifier.smileCart().train(
    features=training,
    classProperty='label',
    inputProperties=band_names
)

# Validate the model
cart_validated = testing.classify(cart_classifier)

# Calculate and print the accuracy
cart_confusion_matrix = cart_validated.errorMatrix('label', 'classification')
cart_accuracy = cart_confusion_matrix.accuracy().getInfo()

print(f"CART Overall Accuracy: {cart_accuracy:.4f}")

Training CART Classifier...
CART Overall Accuracy: 0.3158

Support Vector Machine (SVM)#

A Support Vector Machine works by finding the optimal “boundary” or “line” (called a hyperplane) that separates different classes in our multi-dimensional embedding space. It tries to maximize the gap between these classes to make predictions as clear as possible. While SVMs are incredibly powerful for complex boundaries, they can be computationally expensive and slower to train on massive satellite datasets compared to tree-based models.

print("Training Support Vector Machine Classifier...")

# Initialize and train the model
svm_classifier = ee.Classifier.libsvm().train(
    features=training,
    classProperty='label',
    inputProperties=band_names
)

# Validate the model against our testing data
svm_validated = testing.classify(svm_classifier)

# Calculate and print the accuracy
svm_confusion_matrix = svm_validated.errorMatrix('label', 'classification')
svm_accuracy = svm_confusion_matrix.accuracy().getInfo()

print(f"SVM Overall Accuracy: {svm_accuracy:.4f}")

Training Support Vector Machine Classifier...
SVM Overall Accuracy: 0.5263

Gradient Boosted Trees#

Like Random Forest, Gradient Boosting is an “ensemble” method that uses multiple decision trees. However, while Random Forest builds all its trees independently at the same time, Gradient Boosting builds them sequentially. Each new tree looks at the errors made by the previous trees and actively tries to correct them. This often leads to higher overall accuracy, but it takes longer to train and can be more prone to memorizing the training data (overfitting) if not carefully tuned.

print("Training Gradient Boosted Trees Classifier...")

# Initialize and train the model (using 50 trees to match our Random Forest)
gbt_classifier = ee.Classifier.smileGradientTreeBoost(50).train(
    features=training,
    classProperty='label',
    inputProperties=band_names
)

# Validate the model
gbt_validated = testing.classify(gbt_classifier)

# Calculate and print the accuracy
gbt_confusion_matrix = gbt_validated.errorMatrix('label', 'classification')
gbt_accuracy = gbt_confusion_matrix.accuracy().getInfo()

print(f"Gradient Boosted Trees Overall Accuracy: {gbt_accuracy:.4f}")

Training Gradient Boosted Trees Classifier...
Gradient Boosted Trees Overall Accuracy: 0.5789

How to Loop Through Multiple Models#

To quickly loop through multiple models, we can use the code below to compare our results.

# Define a dictionary of models to test
models_to_test = {
    "Random Forest (50 Trees)": ee.Classifier.smileRandomForest(50),
    "Gradient Boosted Trees (50)": ee.Classifier.smileGradientTreeBoost(50),
    "Support Vector Machine": ee.Classifier.libsvm(),
    "CART (Decision Tree)": ee.Classifier.smileCart(),
}

# Create empty dictionaries to store our results
model_accuracies = {}
trained_models = {}

# Loop through each model
for model_name, classifier in models_to_test.items():
    print(f"Training {model_name}...")

    # Train
    trained_clf = classifier.train(
        features=training,
        classProperty='label',
        inputProperties=band_names
    )

    # Validate
    validated = testing.classify(trained_clf)
    acc = validated.errorMatrix('label', 'classification').accuracy().getInfo()

    # Store results
    model_accuracies[model_name] = acc
    trained_models[model_name] = trained_clf

# Create a comparison table
df_results = pd.DataFrame(
    list(model_accuracies.items()),
    columns=['Model', 'Overall Accuracy']
).sort_values(by='Overall Accuracy', ascending=False).reset_index(drop=True)

print("\nFinal Model Comparison")
display(df_results)

Training Random Forest (50 Trees)...
Training Gradient Boosted Trees (50)...
Training Support Vector Machine...
Training CART (Decision Tree)...

Final Model Comparison

	Model	Overall Accuracy
0	Gradient Boosted Trees (50)	0.578947
1	Support Vector Machine	0.526316
2	Random Forest (50 Trees)	0.473684
3	CART (Decision Tree)	0.315789