Configuration¶

All architectural choices in BertBlocks are controlled through a single BertBlocksConfig dataclass. This page explains the key parameters and how to use preset configurations.

Creating a Configuration¶

import bertblocks as bb

config = bb.BertBlocksConfig(
    vocab_size=30522,
    hidden_size=768,
    num_blocks=12,
    num_attention_heads=12,
)

All parameters have sensible defaults. See the BertBlocksConfig API reference for the full list.

Key Parameters¶

Model dimensions¶

Parameter	Description	Default
`vocab_size`	Vocabulary size	–
`hidden_size`	Hidden dimension	`768`
`intermediate_size`	FFN intermediate dimension	`3072`
`num_blocks`	Number of transformer layers	`12`
`num_attention_heads`	Number of attention heads	`12`
`max_position_embeddings`	Maximum sequence length	`512`

Architectural choices¶

Parameter	Description	Options
`norm_fn`	Normalization function	`"layer"`, `"rms"`, `"group"`, `"deep"`, `"dynamic_tanh"`
`norm_pos`	Where to apply normalization	`"pre"`, `"post"`, `"pre_and_post"`, `"none"`
`actv_fn`	Activation function	`"silu"`, `"gelu"`, `"relu"`, `"gelu_new"`, …
`mlp_type`	Feed-forward type	`"mlp"`, `"glu"`
`embd_pos_enc_kind`	Embedding positional encoding	`"sinusoidal"`, `"learned"`, `"none"`
`block_pos_enc_kind`	Block-level positional encoding	`"alibi"`, `"rope"`, `"none"`
`attention_backend`	Attention implementation	`"flash"`, `"sdpa"`, `"eager"`

Advanced options¶

Parameter	Description
`num_kv_heads`	Number of key-value heads for GQA (defaults to `num_attention_heads`)
`qk_norm`	Enable query-key normalization
`local_attention`	Enable local (sliding window) attention
`local_attention_window_size`	Window size for local attention
`attention_gate`	Gating mechanism for attention output

Preset Configurations¶

BertBlocks includes preset configurations that reproduce known architectures:

BertBlocksConfig¶

Reproduces the original BERT architecture:

from bertblocks.config import BertBlocksConfig
from transformers import BertConfig

config = BertBlocksConfig.from_config(BertConfig.from_pretrained("bert-base-uncased"))

You can also create a BertBlocksConfig from a HuggingFace model:

config = BertBlocksConfig.from_huggingface("bert-base-uncased")

ModernBertConfig¶

Reproduces the ModernBERT architecture:

from bertblocks.config import BertBlocksConfig
from transformers import ModernBertConfig

config = BertBlocksConfig.from_config(ModernBertConfig.from_pretrained("answerdotai/ModernBERT-base"))

config = BertBlocksConfig.from_huggingface("answerdotai/ModernBERT-base")

Validation¶

BertBlocksConfig validates parameters on construction. For example, hidden_size must be divisible by num_attention_heads, and enum-typed parameters are checked against their allowed values. Invalid configurations raise descriptive errors.

YAML Configuration¶

For training via the CLI, configurations are specified in YAML files. See the configs/ directory for examples:

uv run -m bertblocks fit --config configs/pretraining.yaml