Benchmarks

The bertblocks.benchmarks package provides evaluation suites for finetuned models.

Running Evaluations

bertblocks.benchmarks.run_eval(
task_modules: list[type[TaskModule]],
pretrained_model_name_or_path: str,
pretrained_tokenizer_name_or_path: str | None = None,
max_seq_length: int = 256,
max_epochs: int = 3,
learning_rate: float = 2e-05,
weight_decay: float = 0.01,
train_batch_size: int = 32,
eval_batch_size: int = 64,
task_config: dict[str, dict[str, Any]] | None = None,
) DataFrame[source]

Run evaluation on a list of task modules.

Parameters:
  • task_modules – List of TaskModule subclasses to evaluate.

  • pretrained_model_name_or_path – HuggingFace model name or path.

  • pretrained_tokenizer_name_or_path – HuggingFace tokenizer name or path. If None, uses pretrained_model_name_or_path.

  • max_seq_length – Maximum sequence length for tokenization.

  • max_epochs – Number of training epochs per task.

  • learning_rate – Learning rate for AdamW optimizer.

  • weight_decay – Weight decay for AdamW optimizer.

  • train_batch_size – Batch size for training.

  • eval_batch_size – Batch size for evaluation.

  • task_config – Optional per-task hyperparameter overrides. Keys are task class names, values are dicts with optional keys: learning_rate, epochs, weight_decay.

Returns:

Name, Group, Type, Metric, Score

Return type:

DataFrame with columns

Task Modules

class bertblocks.benchmarks.base.TaskModule(
pretrained_model_name_or_path: str,
pretrained_tokenizer_name_or_path: str,
max_seq_length: int | None = 512,
learning_rate: float | None = 1e-05,
weight_decay: float | None = 1e-06,
train_batch_size: int | None = 128,
eval_batch_size: int | None = 128,
num_workers: int | None = 2,
max_epochs: int = 3,
)[source]

Bases: ABC, LightningModule

Base LightningModule for evaluation tasks.

configure_optimizers() Optimizer[source]

Set up the optimizer.

on_test_epoch_end() None[source]

Perform operations at end of test epoch.

on_test_epoch_start() None[source]

Perform operations at start of test epoch.

on_test_start() None[source]

Perform operations at start of test step.

on_validation_epoch_end() None[source]

Perform operations at end of validation epoch.

on_validation_epoch_start() None[source]

Perform operations at start of validation epoch.

on_validation_start() None[source]

Perform operations a start of validation step.

abstractmethod prepare_data() None[source]

Prepare the dataset objecct needed for the task. To be implemented by task subclasses.

test_dataloader() Any[source]

Create test set dataloader.

test_step(batch: dict[str, Tensor]) None[source]

Perform test step.

train_dataloader() Any[source]

Create train set dataloader.

training_step(batch: dict[str, Tensor]) Tensor[source]

Perform train step.

val_dataloader() Any[source]

Create validation set dataloader.

validation_step(batch: dict[str, Tensor | Any]) None[source]

Perform validation step.

GLUE

class bertblocks.benchmarks.glue.GLUETaskModule(
pretrained_model_name_or_path: str,
pretrained_tokenizer_name_or_path: str,
max_seq_length: int | None = 512,
learning_rate: float | None = 1e-05,
weight_decay: float | None = 1e-06,
train_batch_size: int | None = 128,
eval_batch_size: int | None = 128,
num_workers: int | None = 2,
max_epochs: int = 3,
)[source]

Bases: TaskModule

Base class for GLUE benchmark tasks.

prepare_data() None[source]

Obtain the data corresponding to the task.

Individual GLUE tasks: CoLA, SST2, MRPC, QQP, STSB, MNLI, QNLI, RTE, WNLI.

SuperGLEBer

class bertblocks.benchmarks.supergleber.SuperGLEBerTaskModule(
pretrained_model_name_or_path: str,
pretrained_tokenizer_name_or_path: str,
max_seq_length: int | None = 512,
learning_rate: float | None = 1e-05,
weight_decay: float | None = 1e-06,
train_batch_size: int | None = 128,
eval_batch_size: int | None = 128,
num_workers: int | None = 2,
max_epochs: int = 3,
)[source]

Bases: TaskModule

Base class for superGLEBER tasks.

prepare_data() None[source]

Obtain the data corresponding to the task.