Configuration

BertBlocksConfig

class bertblocks.config.BertBlocksConfig(
vocab_size: int = 30522,
max_sequence_length: int = 512,
pad_token_id: int = 0,
mask_token_id: int = 1,
num_blocks: int = 12,
attn_implementation: Literal['flash_attention_2', 'eager', 'sdpa'] | None = None,
local_attention: tuple[int, int] = (-1, -1),
global_attention_every_n_layers: int = 0,
initializer_kind: Literal['trunc_normal', 'kaiming_normal', 'kaiming_uniform', 'xavier_normal', 'xavier_uniform'] = 'trunc_normal',
initializer_range: float = 0.02,
initializer_cutoff_factor: float = 3.0,
initializer_gain: float = 1.0,
add_timestep_emb: bool = False,
add_token_type_emb: bool = False,
type_vocab_size: int = 1,
head_type: Literal['proj', 'mlp', 'glu'] = 'mlp',
include_final_norm: bool = True,
residual_first_layer: bool = False,
emb_dropout_prob: float = 0.1,
actv_fn: Literal['relu', 'silu', 'gelu', 'leakyrelu', 'selu', 'logsigmoid', 'sigmoid', 'prelu'] = 'silu',
num_attention_heads: int = 12,
num_kv_heads: int | None = None,
attention_gate: Literal['elementwise', 'headwise'] | None = None,
hidden_size: int = 768,
intermediate_size: int = 3072,
emb_pos_enc_kind: Literal['sinusoidal', 'learned'] | None = None,
emb_pos_enc_kwargs: dict[str, Any] | None = None,
block_pos_enc_kind: Literal['alibi', 'rope', 'learned', 'learned_alibi'] | None = 'alibi',
block_pos_enc_kwargs: dict[str, Any] | None = None,
mlp_type: Literal['linear', 'mlp', 'glu'] = 'mlp',
mlp_in_bias: bool = True,
mlp_out_bias: bool = True,
attn_proj_bias: bool = True,
attn_out_bias: bool = True,
norm_kind: Literal['pre', 'post', 'both', 'none'] = 'pre',
norm_fn: Literal['group', 'layer', 'rms', 'deep', 'dynamictanh'] = 'rms',
norm_eps: float = 1e-12,
norm_bias: bool = True,
norm_scaling: bool = False,
norm_qk: bool = False,
norm_params: dict[str, Any] | None = None,
hidden_dropout_prob: float = 0.1,
attn_dropout_prob: float = 0.1,
classifier_dropout_prob: float = 0.1,
problem_type: Literal['regression', 'single_label_classification', 'multi_label_classification'] = 'regression',
num_classes: int = 2,
**kwargs: Any,
)[source]

Configuration class for BertBlocks models.

Variables:

model_type (str) – model type name for Huggingface config resolution. Default: ‘bertblocks’

Parameters:
  • vocab_size – The size of the vocabulary. This determines the number of unique tokens the model can process. Common values: 30522 (BERT), 50257 (GPT-2), 32000 (T5). Must be greater than 0.

  • max_sequence_length – Maximum number of tokens the model can process in a single sequence. This affects memory usage and determines the size of positional encodings (if used). Common values: 512 (BERT), 1024, 2048. Longer sequences require more memory. Must be greater than 0.

  • pad_token_id – The token ID used for padding sequences to the same length. This token is ignored during attention computation. Common values: 0 (BERT), 1 (RoBERTa). Must be non-negative and within the vocabulary range.

  • mask_token_id – The token ID used for masking tokens. Must be non-negative and within the vocabulary range.

  • hidden_size – The dimensionality of the hidden layers. This is the primary dimension of the model and affects memory usage and computational requirements. Common values: 768 (BERT-base), 1024 (BERT-large). Must be divisible by num_attention_heads. Must be greater than 0.

  • intermediate_size – The dimensionality of the feed-forward layers. This is typically 4x the hidden_size (e.g., 3072 for hidden_size=768). Must be greater than 0.

  • num_blocks – The number of transformer layers in the model. More layers generally improve model capacity but increase computational cost. Common values: 12 (BERT-base), 24 (BERT-large). Must be at least 1.

  • num_attention_heads – The number of attention heads in the multi-head attention mechanism. Each head has dimension hidden_size // num_attention_heads. More heads can capture different types of relationships. Common values: 12 (BERT-base), 16 (BERT-large). Must be at least 2 and hidden_size must be divisible by this value.

  • num_kv_heads – The number of key-value heads for Grouped Query Attention (GQA). When set to num_attention_heads (default), standard multi-head attention is used. When set to 1, multi-query attention (MQA) is used. Values between 1 and num_attention_heads enable GQA. Must divide num_attention_heads evenly.

  • emb_pos_enc_kind – The type of positional encoding to use at the embedding level. Available options: “sinusoidal” (Sinusoidal positional encoding), “learned” (Learned positional encoding).

  • emb_pos_enc_kwargs – Additional keyword arguments to pass to the positional encoding class. Values dependent on chosen pos_enc_kind. All positional encodings receive dim and max_seq_len automatically, these do not need to be specified.

  • block_pos_enc_kind – The type of positional encoding to use at the block level. Available options: “alibi” (ALiBi positional encoding), “sinusoidal” (Sinusoidal positional encoding), “rope” (Rotary positional encoding), “learned” (Learned positional encoding), “learned_alibi” (ALiBi positional encoding with linear layer).

  • block_pos_enc_kwargs – Additional keyword arguments to pass to the positional encoding class. Values dependent on chosen pos_enc_kind. All positional encodings receive dim and max_seq_len automatically, these do not need to be specified.

  • attention_gate – Adds a query-dependent gating mechanism that modulates the hidden states after attention. Available options: None (no gating, default), “headwise” (gating per head), “elementwise” (gating per element).

  • add_token_type_emb – Whether to add token type embeddings to the model.

  • type_vocab_size – The size of the token_type vocabulary. Only used if add_token_type_emb is True.

  • mlp_type – The type of MLP (feed-forward) layer architecture. Available options: “mlp” (Standard two-layer feed-forward network), “glu” (Gated Linear Unit with learned gating mechanism, typically better performance).

  • head_type – The type of MLP (feed-forward) layer architecture for the final head. Available options: “proj” (Simple one-layer feed-forward network), “mlp” (Standard two-layer feed-forward network), “glu” (Gated Linear Unit with learned gating mechanism, typically better performance).

  • mlp_in_bias – Whether to include bias terms in the input projection of MLP layers.

  • mlp_out_bias – Whether to include bias terms in the output projection of MLP layers.

  • attn_proj_bias – Whether to include bias terms in the qkv projection of attention layers.

  • attn_out_bias – Whether to include bias terms in the output projection of attention layers.

  • local_attention – Whether to include local attention mechanism. Default (-1, -1) means global attention.

  • global_attention_every_n_layers – The layer step size for global attention.

  • initializer_kind – The initialization method for weights. Determines the type of distribution random weights are sampled from for initialization. Defaults to a truncated normal distribution.

  • initializer_range – Standard deviation for weight initialization. Smaller values lead to more conservative initialization. Common values: 0.02 (BERT). Must be greater than 0.0.

  • initializer_cutoff_factor – Cutoff factor for truncated normal initialization. Values beyond initializer_range * initializer_cutoff_factor are redrawn. This ensures no extremely large initial weights. Common values: 2.0-3.0. Must be greater than 0.0.

  • initializer_gain – Gain to scale initialized weights with, e.g., for DeepNorm. Must be greater than 0.0.

  • add_timestep_emb – Whether to add timestep embeddings to the model (only needed for some diffusion models).

  • actv_fn – The activation function used in feed-forward networks.

  • norm_kind – When to apply normalization in the transformer layers. Available options: “pre” (Pre-normalization, normalize before attention/FFN, default, more stable), “post” (Post-normalization, normalize after attention/FFN, as in original Transformer), “both” (Apply normalization both before and after), “none” (No normalization, not recommended).

  • norm_fn – The type of normalization to apply. Available options: “rms” (Root Mean Square Layer Normalization, default, more efficient), “layer” (Standard Layer Normalization as used in BERT), “group” (Group Normalization, useful for smaller batch sizes), “deep” (DeepNorm), “dynamictanh” (Dynamic Tanh Normalization).

  • norm_eps – Small constant added to variance for numerical stability in normalization. Prevents division by zero in layer normalization. Common values: 1e-12 (BERT).

  • norm_params – Additional parameters for custom normalization layers. This field allows passing custom parameters to normalization layers that require them. For example, for DeepNorm: {“alpha”: 0.81} where alpha is the scaling factor.

  • norm_bias – Whether to include bias terms in the output projection of normalization layers.

  • norm_scaling – Whether norm scaling should be enabled. Defaults to False.

  • norm_qk – Whether to apply query-key normalization.

  • include_final_norm – Whether to apply a final normalization of the last hidden state.

  • emb_dropout_prob – Dropout probability applied to the embedding layer output. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.

  • hidden_dropout_prob – Dropout probability applied to hidden layer outputs. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.

  • attn_dropout_prob – Dropout probability applied to attention weights. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.

  • classifier_dropout_prob – Dropout probability for the classification head. Applied to the pooled representation before the final classification layer. Helps prevent overfitting in downstream tasks. Must be between 0.0 and 1.0.

  • attn_implementation – Which backend implementation of attention to use; can be “flash_attention_2” for FlashAttention2, “sdpa” torch, or “eager” for manual implementation. Defaults to SDPA.

  • problem_type – The problem type for automatic loss selection (HuggingFace standard). Automatically selects appropriate loss functions: “regression” (MSE loss for continuous targets), “single_label_classification” (CrossEntropy loss for single-label problems), “multi_label_classification” (BCEWithLogits loss for multi-label problems).

  • num_classes – The number of output classes for classification tasks. For regression tasks, typically 1. For binary classification, 2. For multi-class classification, the number of classes. Must be at least 1.

  • **kwargs – Additional keyword arguments passed to the parent PretrainedConfig class.

model_type: str = 'bertblocks'

Preset Configurations

BertConfig

class bertblocks.config.BertConfig(
vocab_size: int = 28996,
max_sequence_length: int = 512,
pad_token_id: int = 0,
mask_token_id: int = 103,
hidden_size: int = 768,
num_blocks: int = 12,
intermediate_size: int = 3072,
num_attention_heads: int = 12,
pos_enc_kind: Literal['learned', 'absolute'] = 'absolute',
type_vocab_size: int = 2,
initializer_range: float = 0.02,
actv_fn: Literal['relu', 'silu', 'gelu', 'leakyrelu', 'selu', 'logsigmoid', 'sigmoid', 'prelu'] = 'gelu',
norm_eps: float = 1e-12,
emb_dropout_prob: float = 0.1,
attn_dropout_prob: float = 0.1,
hidden_dropout_prob: float = 0.1,
classifier_dropout_prob: float = 0.1,
attn_implementation: Literal['flash_attention_2', 'eager', 'sdpa'] = 'flash_attention_2',
)[source]

Bases: BertBlocksConfig

BertBlocksConfig with default arguments applied for Bert architecture.

classmethod from_huggingface(
pretrained_model_name_or_path: str,
attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
) BertConfig[source]

Instantiate an equivalent BertBlocks BertConfig from a pretrained HuggingFace config.

ModernBertConfig

class bertblocks.config.ModernBertConfig(
vocab_size: int,
max_sequence_length: int,
pad_token_id: int,
mask_token_id: int,
hidden_size: int,
num_blocks: int,
intermediate_size: int,
num_attention_heads: int,
block_pos_enc_kwargs: dict[str, Any],
mlp_in_bias: bool,
mlp_out_bias: bool,
attn_proj_bias: bool,
attn_out_bias: bool,
local_attention: tuple[int, int],
global_attention_every_n_layers: int,
initializer_range: float,
actv_fn: Literal['relu', 'silu', 'gelu', 'leakyrelu', 'selu', 'logsigmoid', 'sigmoid', 'prelu'],
norm_eps: float,
norm_bias: bool,
emb_dropout_prob: float,
attn_dropout_prob: float,
hidden_dropout_prob: float,
classifier_dropout_prob: float,
attn_implementation: Literal['flash_attention_2', 'eager', 'sdpa'] = 'flash_attention_2',
)[source]

Bases: BertBlocksConfig

BertBlocksConfig with default arguments applied for ModernBert architecture.

classmethod from_huggingface(
pretrained_model_name_or_path: str,
attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'flash_attention_2',
) ModernBertConfig[source]

Instantiate an equivalent BertBlocks ModernBertConfig from a pretrained HuggingFace config.