Exercises: predicting ADHD diagnosis - Machine Learning in Neuroimaging

Welcome to this second hands-on exercise. You will build a complete machine learning pipeline to predict a diagnosis of Attention-Deficit/Hyperactivity Disorder (ADHD) from functional connectivity MRI data.

You will use the ADHD200 dataset, available via nilearn, which contains resting-state fMRI data for subjects with and without an ADHD diagnosis.

This exercise will have you apply the same steps covered in class:

Exploring phenotypic data
Extracting functional connectivity features
Training and evaluating an SVM classifier

Important instructions for automated grading:

Compliance: Do not rename the result variables (e.g., q1_n_subjects, challenge_accuracy) given in the commented lines. These names are essential for automated grading.
Execution: Make sure your code runs without errors from top to bottom.
Parameters: Follow exactly the parameters specified in each question (atlas, random_state, etc.). The autograder compares your results to reference values computed with those exact parameters.
Data: Use only the variables pheno, func, and confounds provided in the setup cell (Section 0), which are already aligned with each other.

Good luck!

Section 0: Setup and data loading¶

Run these setup cells. They download the data and prepare the variables you will use throughout the exercise. Do not modify them.

# Setup — do not modify
import warnings
warnings.filterwarnings('ignore')

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from nilearn import datasets, plotting
from nilearn.maskers import NiftiLabelsMasker
from nilearn.connectome import ConnectivityMeasure

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Libraries imported successfully.")

# Load the ADHD200 dataset — do not modify
data_dir = './nilearn_data'

adhd_dataset = datasets.fetch_adhd(n_subjects=40, data_dir=data_dir)

# Raw phenotypic data (may not cover all subjects)
# Note: convert to DataFrame to ensure compatibility with pandas methods
pheno_raw = pd.DataFrame(adhd_dataset.phenotypic)

print(f"Dataset loaded: {len(adhd_dataset.func)} functional images")
print(f"Raw phenotypic data: {pheno_raw.shape[0]} subjects")

# Data alignment — do not modify
#
# Some subjects have fMRI images but no phenotypic data.
# We keep only subjects present in both sources.

func_ids = [int(os.path.basename(f).split('_')[0]) for f in adhd_dataset.func]
pheno_ids = set(pheno_raw['Subject'].values)
matched_idx = [i for i, fid in enumerate(func_ids) if fid in pheno_ids]
matched_fids = [func_ids[i] for i in matched_idx]

# Aligned data ready to use
pheno    = pheno_raw.set_index('Subject').loc[matched_fids].reset_index()
func     = [adhd_dataset.func[i] for i in matched_idx]
confounds = [adhd_dataset.confounds[i] for i in matched_idx]

print(f"Subjects with both imaging AND phenotypic data: {len(func)}")
print(f"Available phenotypic columns: {list(pheno.columns)}")

Part 1: Exploring phenotypic data¶

In this section, you will explore the clinical and demographic variables in the ADHD200 dataset.

Question 1: Number of subjects¶

Using the pheno variable provided in Section 0, determine the total number of subjects who have both imaging and phenotypic data.

Store this integer in q1_n_subjects.

# Answer 1

# Your code here

# q1_n_subjects = ...

Question 2: Diagnosis distribution¶

The adhd column in the pheno DataFrame contains the diagnosis: 1 for ADHD, 0 for control.

Count the number of ADHD subjects and the number of control subjects.

Store the results in q2_n_adhd (integer) and q2_n_controls (integer).

# Answer 2

# Your code here

# q2_n_adhd = ...
# q2_n_controls = ...

Question 3: Target variable¶

Create the target vector y for machine learning: a NumPy array of integers (dtype int) containing 1 for ADHD subjects and 0 for controls, in the order of the pheno variable.

Store the result in q3_y.

Hint: use the adhd column of pheno and convert it to a NumPy integer array.

# Answer 3

# Your code here

# q3_y = ...

Question 4: Mean age by group¶

Calculate the mean age (column age) separately for ADHD subjects and for controls.

Store the mean age of ADHD subjects in q4_mean_age_adhd (float) and the mean age of controls in q4_mean_age_controls (float).

Also create a plot (barplot or boxplot) comparing the age distribution between the two groups.

# Answer 4

# Your code here

# q4_mean_age_adhd = ...
# q4_mean_age_controls = ...

Question 5: Research — Sex distribution¶

Check the docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

The sex column contains 'M' for male and 'F' for female.

Using groupby, count the number of male subjects ('M') in each diagnostic group (ADHD vs control).

Store the number of males among ADHD subjects in q5_n_males_adhd (integer) and among controls in q5_n_males_controls (integer).

Hint: first filter on sex == 'M', then count by adhd group.

# Answer 5

# Your code here

# q5_n_males_adhd = ...
# q5_n_males_controls = ...

Part 2: Extracting functional connectivity features¶

In this section, you will extract functional connectivity matrices for each subject using a brain atlas. These matrices will serve as the features (predictor variables) for your ML model.

You will use:

The BASC multiscale atlas at 12 ROIs resolution
nilearn’s NiftiLabelsMasker to extract time series
nilearn’s ConnectivityMeasure to compute correlations

Note: Extraction may take a few minutes. Once done, features will be saved to disk to avoid recomputing.

# Load the BASC 12-ROI atlas — do not modify
basc = datasets.fetch_atlas_basc_multiscale_2015(
    data_dir=data_dir, resolution=12
)
atlas_img = basc.maps
print(f"Atlas loaded: {atlas_img}")

Question 6: Extracting time series¶

Initialize a NiftiLabelsMasker with the following parameters (required for automated grading):

labels_img=atlas_img
standardize='zscore_sample'
detrend=True
resampling_target='data'
verbose=0

Use this masker to extract the time series for the first subject (func[0], confounds[0]). Store the time series in q6_timeseries.

Then, store the number of regions of interest (ROIs) in q6_n_rois (integer). This is the second dimension of q6_timeseries.

# Answer 6

# Your code here

# q6_timeseries = ...
# q6_n_rois = ...

Question 7: Vectorized connectivity matrix¶

Using ConnectivityMeasure with the following parameters:

kind='correlation'
vectorize=True
discard_diagonal=True

Compute the connectivity vector for the first subject from q6_timeseries.

Store the vector in q7_corr_vector and the number of features (length of the vector) in q7_n_features (integer).

Hint: ConnectivityMeasure.fit_transform expects a list of time series arrays.

# Answer 7

# Your code here

# q7_corr_vector = ...
# q7_n_features = ...

Question 8: Extraction for all subjects¶

Repeat the extraction for all subjects in the func list (using the corresponding confounds). Build the feature matrix X by stacking the connectivity vectors.

Store the resulting matrix in q8_X_features (2D NumPy array).

To save time on re-runs, save/load code is provided in the next cell.

Hint: use a for loop and the same approach as for subject 0. You can reuse the masker and ConnectivityMeasure objects from Q6 and Q7.

# Answer 8
# Initialize a new ConnectivityMeasure with the same parameters as in Question 7

# Your code here — build q8_X_features with a loop over all subjects

# q8_X_features = ...

Question 9: Checking the matrix shape¶

Verify that the matrix q8_X_features has the correct shape: it should have as many rows as subjects, and as many columns as connectivity features.

Store the shape of the matrix in q9_X_shape (tuple).

Also visualize the matrix with plt.imshow to visually inspect the features (axes: subjects on the y-axis, features on the x-axis).

# Answer 9

# Your code here

# q9_X_shape = ...

Part 3: Machine learning pipeline¶

You will now train an SVM classifier on the extracted connectivity features to predict ADHD diagnosis.

Use q8_X_features as the feature matrix and q3_y as the target vector.

Question 10: Train/test split¶

Split the data into a training set and a test set using train_test_split with the following required parameters:

test_size=0.2
shuffle=True
stratify=q3_y (to preserve the ADHD/control ratio in both sets)
random_state=0

Store the number of training subjects in q10_n_train (integer) and the number of test subjects in q10_n_test (integer).

Visualize the ADHD/control distribution in each split to verify the balance.

# Answer 10

# Your code here
# X_train, X_test, y_train, y_test = train_test_split(...)

# q10_n_train = ...
# q10_n_test = ...

Question 11: Data standardization¶

Apply a StandardScaler to the data:

Fit the scaler only on X_train
Transform X_train → X_train_scl
Transform X_test → X_test_scl using the same scaler (without re-fitting)

Store the overall mean of X_train_scl in q11_mean_train_scl (float).

Reminder: after standardizing on the training data, the mean of X_train_scl should be very close to zero. Why shouldn’t the scaler be re-fitted on X_test?

# Answer 11

# Your code here

# q11_mean_train_scl = ...

Question 12: Cross-validation on the training set¶

Initialize an SVM classifier with the following required parameters:

kernel='linear'
class_weight='balanced'

Obtain predictions via 3-fold cross-validation (cv=3) on the training set.

Store the cross-validation predictions in q12_y_pred_cv (NumPy array).

Hint: cross_val_predict(estimator, X, y, cv=3)

# Answer 12

# Your code here

# q12_y_pred_cv = ...

Question 13: Cross-validation accuracy¶

Calculate the overall accuracy of the model on the training set by comparing q12_y_pred_cv and y_train.

Store the accuracy in q13_cv_accuracy (float between 0 and 1).

Also display the full classification report with classification_report.

Reflection: does the model perform better than chance (50%)? What explains modest performance with only 24 training subjects and 66 features?

# Answer 13

# Your code here

# q13_cv_accuracy = ...

Question 14: Confusion matrix¶

Compute the confusion matrix from y_train and q12_y_pred_cv.

Store the matrix in q14_cm (2×2 NumPy array).

Visualize it with sns.heatmap (using labels ['Control', 'ADHD'] for the axes).

Reminder: the diagonal represents correct predictions. Which type of error (false positive or false negative) is most frequent?

# Answer 14

# Your code here

# q14_cm = ...

Challenge: Final evaluation on held-out data¶

This is the moment of truth! You will train the model on all training data and evaluate it on the test set you have held out from the start.

This step simulates real clinical use of a model: the model sees the test data only once, after all design decisions have been made.

Follow these steps:

Train the SVM classifier (same parameters as Q12) on X_train_scl and y_train
Predict labels for X_test_scl
Compute accuracy on the test set

Store the final predictions in challenge_y_pred (NumPy array) and the final accuracy in challenge_accuracy (float).

Reflection: compare the test set accuracy to the cross-validation accuracy. Are they similar? What does this comparison tell you about the model’s generalization? With only 6 subjects in the test set, how reliable is this estimate?

# Challenge

# Your code here
# Step 1: train the final model on X_train_scl, y_train
# Step 2: predict on X_test_scl
# Step 3: compute accuracy

# challenge_y_pred = ...
# challenge_accuracy = ...

Going further (ungraded)¶

If you have finished, explore these directions:

Different atlas: redo the extraction with resolution=36 (630 features). Is the model better or worse? Why?
Permutation test: use permutation_test_score to check whether the model’s performance is statistically significant despite the small sample size.
Weight visualization: train the final model, retrieve svc.coef_, and use correlation_measure.inverse_transform to display the most important connections as a connectome.
Regression: replace the binary diagnosis with a continuous measure (e.g., symptom score dsm_iv_tot) and use an SVR (Support Vector Regressor) instead of an SVC.