Configuration¶
BertBlocksConfig¶
- class bertblocks.config.BertBlocksConfig(
- vocab_size: int = 30522,
- max_sequence_length: int = 512,
- pad_token_id: int = 0,
- mask_token_id: int = 1,
- num_blocks: int = 12,
- attn_implementation: Literal['flash_attention_2', 'eager', 'sdpa'] | None = None,
- local_attention: tuple[int, int] = (-1, -1),
- global_attention_every_n_layers: int = 0,
- initializer_kind: Literal['trunc_normal', 'kaiming_normal', 'kaiming_uniform', 'xavier_normal', 'xavier_uniform'] = 'trunc_normal',
- initializer_range: float = 0.02,
- initializer_cutoff_factor: float = 3.0,
- initializer_gain: float = 1.0,
- add_timestep_emb: bool = False,
- add_token_type_emb: bool = False,
- type_vocab_size: int = 1,
- head_type: Literal['proj', 'mlp', 'glu'] = 'mlp',
- include_final_norm: bool = True,
- residual_first_layer: bool = False,
- emb_dropout_prob: float = 0.1,
- actv_fn: Literal['relu', 'silu', 'gelu', 'leakyrelu', 'selu', 'logsigmoid', 'sigmoid', 'prelu'] = 'silu',
- num_attention_heads: int = 12,
- num_kv_heads: int | None = None,
- attention_gate: Literal['elementwise', 'headwise'] | None = None,
- hidden_size: int = 768,
- intermediate_size: int = 3072,
- emb_pos_enc_kind: Literal['sinusoidal', 'learned'] | None = None,
- emb_pos_enc_kwargs: dict[str, Any] | None = None,
- block_pos_enc_kind: Literal['alibi', 'rope', 'learned', 'learned_alibi'] | None = 'alibi',
- block_pos_enc_kwargs: dict[str, Any] | None = None,
- mlp_type: Literal['linear', 'mlp', 'glu'] = 'mlp',
- mlp_in_bias: bool = True,
- mlp_out_bias: bool = True,
- attn_proj_bias: bool = True,
- attn_out_bias: bool = True,
- norm_kind: Literal['pre', 'post', 'both', 'none'] = 'pre',
- norm_fn: Literal['group', 'layer', 'rms', 'deep', 'dynamictanh'] = 'rms',
- norm_eps: float = 1e-12,
- norm_bias: bool = True,
- norm_scaling: bool = False,
- norm_qk: bool = False,
- norm_params: dict[str, Any] | None = None,
- hidden_dropout_prob: float = 0.1,
- attn_dropout_prob: float = 0.1,
- classifier_dropout_prob: float = 0.1,
- problem_type: Literal['regression', 'single_label_classification', 'multi_label_classification'] = 'regression',
- num_classes: int = 2,
- **kwargs: Any,
Configuration class for BertBlocks models.
- Variables:
model_type (str) – model type name for Huggingface config resolution. Default: ‘bertblocks’
- Parameters:
vocab_size – The size of the vocabulary. This determines the number of unique tokens the model can process. Common values: 30522 (BERT), 50257 (GPT-2), 32000 (T5). Must be greater than 0.
max_sequence_length – Maximum number of tokens the model can process in a single sequence. This affects memory usage and determines the size of positional encodings (if used). Common values: 512 (BERT), 1024, 2048. Longer sequences require more memory. Must be greater than 0.
pad_token_id – The token ID used for padding sequences to the same length. This token is ignored during attention computation. Common values: 0 (BERT), 1 (RoBERTa). Must be non-negative and within the vocabulary range.
mask_token_id – The token ID used for masking tokens. Must be non-negative and within the vocabulary range.
hidden_size – The dimensionality of the hidden layers. This is the primary dimension of the model and affects memory usage and computational requirements. Common values: 768 (BERT-base), 1024 (BERT-large). Must be divisible by num_attention_heads. Must be greater than 0.
intermediate_size – The dimensionality of the feed-forward layers. This is typically 4x the hidden_size (e.g., 3072 for hidden_size=768). Must be greater than 0.
num_blocks – The number of transformer layers in the model. More layers generally improve model capacity but increase computational cost. Common values: 12 (BERT-base), 24 (BERT-large). Must be at least 1.
num_attention_heads – The number of attention heads in the multi-head attention mechanism. Each head has dimension hidden_size // num_attention_heads. More heads can capture different types of relationships. Common values: 12 (BERT-base), 16 (BERT-large). Must be at least 2 and hidden_size must be divisible by this value.
num_kv_heads – The number of key-value heads for Grouped Query Attention (GQA). When set to num_attention_heads (default), standard multi-head attention is used. When set to 1, multi-query attention (MQA) is used. Values between 1 and num_attention_heads enable GQA. Must divide num_attention_heads evenly.
emb_pos_enc_kind – The type of positional encoding to use at the embedding level. Available options: “sinusoidal” (Sinusoidal positional encoding), “learned” (Learned positional encoding).
emb_pos_enc_kwargs – Additional keyword arguments to pass to the positional encoding class. Values dependent on chosen pos_enc_kind. All positional encodings receive dim and max_seq_len automatically, these do not need to be specified.
block_pos_enc_kind – The type of positional encoding to use at the block level. Available options: “alibi” (ALiBi positional encoding), “sinusoidal” (Sinusoidal positional encoding), “rope” (Rotary positional encoding), “learned” (Learned positional encoding), “learned_alibi” (ALiBi positional encoding with linear layer).
block_pos_enc_kwargs – Additional keyword arguments to pass to the positional encoding class. Values dependent on chosen pos_enc_kind. All positional encodings receive dim and max_seq_len automatically, these do not need to be specified.
attention_gate – Adds a query-dependent gating mechanism that modulates the hidden states after attention. Available options: None (no gating, default), “headwise” (gating per head), “elementwise” (gating per element).
add_token_type_emb – Whether to add token type embeddings to the model.
type_vocab_size – The size of the token_type vocabulary. Only used if add_token_type_emb is True.
mlp_type – The type of MLP (feed-forward) layer architecture. Available options: “mlp” (Standard two-layer feed-forward network), “glu” (Gated Linear Unit with learned gating mechanism, typically better performance).
head_type – The type of MLP (feed-forward) layer architecture for the final head. Available options: “proj” (Simple one-layer feed-forward network), “mlp” (Standard two-layer feed-forward network), “glu” (Gated Linear Unit with learned gating mechanism, typically better performance).
mlp_in_bias – Whether to include bias terms in the input projection of MLP layers.
mlp_out_bias – Whether to include bias terms in the output projection of MLP layers.
attn_proj_bias – Whether to include bias terms in the qkv projection of attention layers.
attn_out_bias – Whether to include bias terms in the output projection of attention layers.
local_attention – Whether to include local attention mechanism. Default (-1, -1) means global attention.
global_attention_every_n_layers – The layer step size for global attention.
initializer_kind – The initialization method for weights. Determines the type of distribution random weights are sampled from for initialization. Defaults to a truncated normal distribution.
initializer_range – Standard deviation for weight initialization. Smaller values lead to more conservative initialization. Common values: 0.02 (BERT). Must be greater than 0.0.
initializer_cutoff_factor – Cutoff factor for truncated normal initialization. Values beyond initializer_range * initializer_cutoff_factor are redrawn. This ensures no extremely large initial weights. Common values: 2.0-3.0. Must be greater than 0.0.
initializer_gain – Gain to scale initialized weights with, e.g., for DeepNorm. Must be greater than 0.0.
add_timestep_emb – Whether to add timestep embeddings to the model (only needed for some diffusion models).
actv_fn – The activation function used in feed-forward networks.
norm_kind – When to apply normalization in the transformer layers. Available options: “pre” (Pre-normalization, normalize before attention/FFN, default, more stable), “post” (Post-normalization, normalize after attention/FFN, as in original Transformer), “both” (Apply normalization both before and after), “none” (No normalization, not recommended).
norm_fn – The type of normalization to apply. Available options: “rms” (Root Mean Square Layer Normalization, default, more efficient), “layer” (Standard Layer Normalization as used in BERT), “group” (Group Normalization, useful for smaller batch sizes), “deep” (DeepNorm), “dynamictanh” (Dynamic Tanh Normalization).
norm_eps – Small constant added to variance for numerical stability in normalization. Prevents division by zero in layer normalization. Common values: 1e-12 (BERT).
norm_params – Additional parameters for custom normalization layers. This field allows passing custom parameters to normalization layers that require them. For example, for DeepNorm: {“alpha”: 0.81} where alpha is the scaling factor.
norm_bias – Whether to include bias terms in the output projection of normalization layers.
norm_scaling – Whether norm scaling should be enabled. Defaults to False.
norm_qk – Whether to apply query-key normalization.
include_final_norm – Whether to apply a final normalization of the last hidden state.
emb_dropout_prob – Dropout probability applied to the embedding layer output. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.
hidden_dropout_prob – Dropout probability applied to hidden layer outputs. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.
attn_dropout_prob – Dropout probability applied to attention weights. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.
classifier_dropout_prob – Dropout probability for the classification head. Applied to the pooled representation before the final classification layer. Helps prevent overfitting in downstream tasks. Must be between 0.0 and 1.0.
attn_implementation – Which backend implementation of attention to use; can be “flash_attention_2” for FlashAttention2, “sdpa” torch, or “eager” for manual implementation. Defaults to SDPA.
problem_type – The problem type for automatic loss selection (HuggingFace standard). Automatically selects appropriate loss functions: “regression” (MSE loss for continuous targets), “single_label_classification” (CrossEntropy loss for single-label problems), “multi_label_classification” (BCEWithLogits loss for multi-label problems).
num_classes – The number of output classes for classification tasks. For regression tasks, typically 1. For binary classification, 2. For multi-class classification, the number of classes. Must be at least 1.
**kwargs – Additional keyword arguments passed to the parent PretrainedConfig class.
Preset Configurations¶
BertConfig¶
- class bertblocks.config.BertConfig(
- vocab_size: int = 28996,
- max_sequence_length: int = 512,
- pad_token_id: int = 0,
- mask_token_id: int = 103,
- hidden_size: int = 768,
- num_blocks: int = 12,
- intermediate_size: int = 3072,
- num_attention_heads: int = 12,
- pos_enc_kind: Literal['learned', 'absolute'] = 'absolute',
- type_vocab_size: int = 2,
- initializer_range: float = 0.02,
- actv_fn: Literal['relu', 'silu', 'gelu', 'leakyrelu', 'selu', 'logsigmoid', 'sigmoid', 'prelu'] = 'gelu',
- norm_eps: float = 1e-12,
- emb_dropout_prob: float = 0.1,
- attn_dropout_prob: float = 0.1,
- hidden_dropout_prob: float = 0.1,
- classifier_dropout_prob: float = 0.1,
- attn_implementation: Literal['flash_attention_2', 'eager', 'sdpa'] = 'flash_attention_2',
Bases:
BertBlocksConfigBertBlocksConfig with default arguments applied for Bert architecture.
- classmethod from_huggingface(
- pretrained_model_name_or_path: str,
- attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
Instantiate an equivalent BertBlocks BertConfig from a pretrained HuggingFace config.
ModernBertConfig¶
- class bertblocks.config.ModernBertConfig(
- vocab_size: int,
- max_sequence_length: int,
- pad_token_id: int,
- mask_token_id: int,
- hidden_size: int,
- num_blocks: int,
- intermediate_size: int,
- num_attention_heads: int,
- block_pos_enc_kwargs: dict[str, Any],
- mlp_in_bias: bool,
- mlp_out_bias: bool,
- attn_proj_bias: bool,
- attn_out_bias: bool,
- local_attention: tuple[int, int],
- global_attention_every_n_layers: int,
- initializer_range: float,
- actv_fn: Literal['relu', 'silu', 'gelu', 'leakyrelu', 'selu', 'logsigmoid', 'sigmoid', 'prelu'],
- norm_eps: float,
- norm_bias: bool,
- emb_dropout_prob: float,
- attn_dropout_prob: float,
- hidden_dropout_prob: float,
- classifier_dropout_prob: float,
- attn_implementation: Literal['flash_attention_2', 'eager', 'sdpa'] = 'flash_attention_2',
Bases:
BertBlocksConfigBertBlocksConfig with default arguments applied for ModernBert architecture.
- classmethod from_huggingface(
- pretrained_model_name_or_path: str,
- attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'flash_attention_2',
Instantiate an equivalent BertBlocks ModernBertConfig from a pretrained HuggingFace config.