Configuration

BertBlocksConfig

class bertblocks.config.BertBlocksConfig(
vocab_size: int = 30522,
max_sequence_length: int = 512,
pad_token_id: int = 0,
mask_token_id: int = 1,
num_blocks: int = 12,
attn_implementation: Literal['flash_attention_2', 'eager', 'sdpa'] | None = None,
local_attention: tuple[int, int] = (-1, -1),
global_attention_every_n_layers: int = 1,
initializer_kind: Literal['trunc_normal', 'kaiming_normal', 'kaiming_uniform', 'xavier_normal', 'xavier_uniform'] = 'trunc_normal',
initializer_range: float = 0.02,
initializer_cutoff_factor: float = 3.0,
initializer_gain: float = 1.0,
add_timestep_emb: bool = False,
add_token_type_emb: bool = False,
type_vocab_size: int = 1,
head_type: Literal['proj', 'mlp', 'glu'] = 'mlp',
include_final_norm: bool = True,
residual_first_layer: bool = False,
emb_dropout_prob: float = 0.1,
actv_fn: Literal['relu', 'silu', 'gelu', 'leakyrelu', 'selu', 'logsigmoid', 'sigmoid', 'prelu'] = 'silu',
num_attention_heads: int = 12,
num_kv_heads: int | None = None,
attention_gate: Literal['elementwise', 'headwise'] | None = None,
hidden_size: int = 768,
intermediate_size: int = 3072,
emb_pos_enc_kind: Literal['sinusoidal', 'learned'] | None = None,
emb_pos_enc_kwargs: dict[str, Any] | None = None,
block_pos_enc_kind: Literal['alibi', 'rope', 'learned', 'learned_alibi'] | None = 'alibi',
block_pos_enc_kwargs: dict[str, Any] | None = None,
mlp_type: Literal['linear', 'mlp', 'glu'] = 'mlp',
mlp_in_bias: bool = True,
mlp_out_bias: bool = True,
attn_proj_bias: bool = True,
attn_out_bias: bool = True,
norm_kind: Literal['pre', 'post', 'both', 'none'] = 'pre',
norm_fn: Literal['group', 'layer', 'rms', 'deep', 'dynamictanh'] = 'rms',
norm_eps: float = 1e-12,
norm_bias: bool = True,
norm_scaling: bool = False,
norm_qk: bool = False,
norm_params: dict[str, Any] | None = None,
hidden_dropout_prob: float = 0.1,
attn_dropout_prob: float = 0.1,
classifier_dropout_prob: float = 0.1,
problem_type: Literal['regression', 'single_label_classification', 'multi_label_classification'] = 'regression',
num_classes: int = 2,
**kwargs: Any,
)[source]

Configuration class for BertBlocks models.

Variables:

model_type (str) – model type name for Huggingface config resolution. Default: ‘bertblocks’

Parameters:
  • vocab_size – The size of the vocabulary. This determines the number of unique tokens the model can process. Common values: 30522 (BERT), 50257 (GPT-2), 32000 (T5). Must be greater than 0.

  • max_sequence_length – Maximum number of tokens the model can process in a single sequence. This affects memory usage and determines the size of positional encodings (if used). Common values: 512 (BERT), 1024, 2048. Longer sequences require more memory. Must be greater than 0.

  • pad_token_id – The token ID used for padding sequences to the same length. This token is ignored during attention computation. Common values: 0 (BERT), 1 (RoBERTa). Must be non-negative and within the vocabulary range.

  • mask_token_id – The token ID used for masking tokens. Must be non-negative and within the vocabulary range.

  • hidden_size – The dimensionality of the hidden layers. This is the primary dimension of the model and affects memory usage and computational requirements. Common values: 768 (BERT-base), 1024 (BERT-large). Must be divisible by num_attention_heads. Must be greater than 0.

  • intermediate_size – The dimensionality of the feed-forward layers. This is typically 4x the hidden_size (e.g., 3072 for hidden_size=768). Must be greater than 0.

  • num_blocks – The number of transformer layers in the model. More layers generally improve model capacity but increase computational cost. Common values: 12 (BERT-base), 24 (BERT-large). Must be at least 1.

  • num_attention_heads – The number of attention heads in the multi-head attention mechanism. Each head has dimension hidden_size // num_attention_heads. More heads can capture different types of relationships. Common values: 12 (BERT-base), 16 (BERT-large). Must be at least 2 and hidden_size must be divisible by this value.

  • num_kv_heads – The number of key-value heads for Grouped Query Attention (GQA). When set to num_attention_heads (default), standard multi-head attention is used. When set to 1, multi-query attention (MQA) is used. Values between 1 and num_attention_heads enable GQA. Must divide num_attention_heads evenly.

  • emb_pos_enc_kind – The type of positional encoding to use at the embedding level. Available options: “sinusoidal” (Sinusoidal positional encoding), “learned” (Learned positional encoding).

  • emb_pos_enc_kwargs – Additional keyword arguments to pass to the positional encoding class. Values dependent on chosen pos_enc_kind. All positional encodings receive dim and max_seq_len automatically, these do not need to be specified.

  • block_pos_enc_kind – The type of positional encoding to use at the block level. Available options: “alibi” (ALiBi positional encoding), “sinusoidal” (Sinusoidal positional encoding), “rope” (Rotary positional encoding), “learned” (Learned positional encoding), “learned_alibi” (ALiBi positional encoding with linear layer).

  • block_pos_enc_kwargs – Additional keyword arguments to pass to the positional encoding class. Values dependent on chosen pos_enc_kind. All positional encodings receive dim and max_seq_len automatically, these do not need to be specified.

  • attention_gate – Adds a query-dependent gating mechanism that modulates the hidden states after attention. Available options: None (no gating, default), “headwise” (gating per head), “elementwise” (gating per element).

  • add_token_type_emb – Whether to add token type embeddings to the model.

  • type_vocab_size – The size of the token_type vocabulary. Only used if add_token_type_emb is True.

  • mlp_type – The type of MLP (feed-forward) layer architecture. Available options: “mlp” (Standard two-layer feed-forward network), “glu” (Gated Linear Unit with learned gating mechanism, typically better performance).

  • head_type – The type of MLP (feed-forward) layer architecture for the final head. Available options: “proj” (Simple one-layer feed-forward network), “mlp” (Standard two-layer feed-forward network), “glu” (Gated Linear Unit with learned gating mechanism, typically better performance).

  • mlp_in_bias – Whether to include bias terms in the input projection of MLP layers.

  • mlp_out_bias – Whether to include bias terms in the output projection of MLP layers.

  • attn_proj_bias – Whether to include bias terms in the qkv projection of attention layers.

  • attn_out_bias – Whether to include bias terms in the output projection of attention layers.

  • local_attention – Whether to include local attention mechanism. Default (-1, -1) means global attention.

  • global_attention_every_n_layers – The layer step size for global attention. Set to 0 to disable global attention. Set to 1 for global attention in every layer. Set to 2 for global attention in every other layer, etc.

  • initializer_kind – The initialization method for weights. Determines the type of distribution random weights are sampled from for initialization. Defaults to a truncated normal distribution.

  • initializer_range – Standard deviation for weight initialization. Smaller values lead to more conservative initialization. Common values: 0.02 (BERT). Must be greater than 0.0.

  • initializer_cutoff_factor – Cutoff factor for truncated normal initialization. Values beyond initializer_range * initializer_cutoff_factor are redrawn. This ensures no extremely large initial weights. Common values: 2.0-3.0. Must be greater than 0.0.

  • initializer_gain – Gain to scale initialized weights with, e.g., for DeepNorm. Must be greater than 0.0.

  • add_timestep_emb – Whether to add timestep embeddings to the model (only needed for some diffusion models).

  • actv_fn – The activation function used in feed-forward networks.

  • norm_kind – When to apply normalization in the transformer layers. Available options: “pre” (Pre-normalization, normalize before attention/FFN, default, more stable), “post” (Post-normalization, normalize after attention/FFN, as in original Transformer), “both” (Apply normalization both before and after), “none” (No normalization, not recommended).

  • norm_fn – The type of normalization to apply. Available options: “rms” (Root Mean Square Layer Normalization, default, more efficient), “layer” (Standard Layer Normalization as used in BERT), “group” (Group Normalization, useful for smaller batch sizes), “deep” (DeepNorm), “dynamictanh” (Dynamic Tanh Normalization).

  • norm_eps – Small constant added to variance for numerical stability in normalization. Prevents division by zero in layer normalization. Common values: 1e-12 (BERT).

  • norm_params – Additional parameters for custom normalization layers. This field allows passing custom parameters to normalization layers that require them. For example, for DeepNorm: {“alpha”: 0.81} where alpha is the scaling factor.

  • norm_bias – Whether to include bias terms in the output projection of normalization layers.

  • norm_scaling – Whether norm scaling should be enabled. Defaults to False.

  • norm_qk – Whether to apply query-key normalization.

  • include_final_norm – Whether to apply a final normalization of the last hidden state.

  • emb_dropout_prob – Dropout probability applied to the embedding layer output. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.

  • hidden_dropout_prob – Dropout probability applied to hidden layer outputs. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.

  • attn_dropout_prob – Dropout probability applied to attention weights. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.

  • classifier_dropout_prob – Dropout probability for the classification head. Applied to the pooled representation before the final classification layer. Helps prevent overfitting in downstream tasks. Must be between 0.0 and 1.0.

  • attn_implementation – Which backend implementation of attention to use; can be “flash_attention_2” for FlashAttention2, “sdpa” torch, or “eager” for manual implementation. Defaults to SDPA.

  • problem_type – The problem type for automatic loss selection (HuggingFace standard). Automatically selects appropriate loss functions: “regression” (MSE loss for continuous targets), “single_label_classification” (CrossEntropy loss for single-label problems), “multi_label_classification” (BCEWithLogits loss for multi-label problems).

  • num_classes – The number of output classes for classification tasks. For regression tasks, typically 1. For binary classification, 2. For multi-class classification, the number of classes. Must be at least 1.

  • **kwargs – Additional keyword arguments passed to the parent PreTrainedConfig class.

classmethod from_bert_config(
orig_config: BertConfig,
) BertBlocksConfig[source]

Instantiate a BertBlocksConfig from a HuggingFace BERT config object.

classmethod from_config(
orig_config: PreTrainedConfig,
attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
) BertBlocksConfig[source]

Instantiate a BertBlocksConfig from any supported HuggingFace config object.

Dispatches to the appropriate from_*_config method based on the config type. Supported config types: BertConfig, ModernBertConfig.

classmethod from_huggingface(
pretrained_model_name_or_path: str,
attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
) BertBlocksConfig[source]

Instantiate a BertBlocksConfig from any supported pretrained HuggingFace model.

Automatically detects the model type and dispatches to the appropriate method. Supported model types: BERT, ModernBERT.

classmethod from_huggingface_bert(
pretrained_model_name_or_path: str,
attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
) BertBlocksConfig[source]

Instantiate a BertBlocksConfig from a pretrained HuggingFace BERT config.

classmethod from_huggingface_modernbert(
pretrained_model_name_or_path: str,
attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
) BertBlocksConfig[source]

Instantiate a BertBlocksConfig from a pretrained HuggingFace ModernBERT config.

classmethod from_modernbert_config(
orig_config: ModernBertConfig,
) BertBlocksConfig[source]

Instantiate a BertBlocksConfig from a HuggingFace ModernBERT config object.

model_type: str = 'bertblocks'

Preset Configurations

BertBlocksConfig

class bertblocks.config.BertBlocksConfig(
vocab_size: int = 30522,
max_sequence_length: int = 512,
pad_token_id: int = 0,
mask_token_id: int = 1,
num_blocks: int = 12,
attn_implementation: Literal['flash_attention_2', 'eager', 'sdpa'] | None = None,
local_attention: tuple[int, int] = (-1, -1),
global_attention_every_n_layers: int = 1,
initializer_kind: Literal['trunc_normal', 'kaiming_normal', 'kaiming_uniform', 'xavier_normal', 'xavier_uniform'] = 'trunc_normal',
initializer_range: float = 0.02,
initializer_cutoff_factor: float = 3.0,
initializer_gain: float = 1.0,
add_timestep_emb: bool = False,
add_token_type_emb: bool = False,
type_vocab_size: int = 1,
head_type: Literal['proj', 'mlp', 'glu'] = 'mlp',
include_final_norm: bool = True,
residual_first_layer: bool = False,
emb_dropout_prob: float = 0.1,
actv_fn: Literal['relu', 'silu', 'gelu', 'leakyrelu', 'selu', 'logsigmoid', 'sigmoid', 'prelu'] = 'silu',
num_attention_heads: int = 12,
num_kv_heads: int | None = None,
attention_gate: Literal['elementwise', 'headwise'] | None = None,
hidden_size: int = 768,
intermediate_size: int = 3072,
emb_pos_enc_kind: Literal['sinusoidal', 'learned'] | None = None,
emb_pos_enc_kwargs: dict[str, Any] | None = None,
block_pos_enc_kind: Literal['alibi', 'rope', 'learned', 'learned_alibi'] | None = 'alibi',
block_pos_enc_kwargs: dict[str, Any] | None = None,
mlp_type: Literal['linear', 'mlp', 'glu'] = 'mlp',
mlp_in_bias: bool = True,
mlp_out_bias: bool = True,
attn_proj_bias: bool = True,
attn_out_bias: bool = True,
norm_kind: Literal['pre', 'post', 'both', 'none'] = 'pre',
norm_fn: Literal['group', 'layer', 'rms', 'deep', 'dynamictanh'] = 'rms',
norm_eps: float = 1e-12,
norm_bias: bool = True,
norm_scaling: bool = False,
norm_qk: bool = False,
norm_params: dict[str, Any] | None = None,
hidden_dropout_prob: float = 0.1,
attn_dropout_prob: float = 0.1,
classifier_dropout_prob: float = 0.1,
problem_type: Literal['regression', 'single_label_classification', 'multi_label_classification'] = 'regression',
num_classes: int = 2,
**kwargs: Any,
)[source]

Bases: PreTrainedConfig

Configuration class for BertBlocks models.

Variables:

model_type (str) – model type name for Huggingface config resolution. Default: ‘bertblocks’

Parameters:
  • vocab_size – The size of the vocabulary. This determines the number of unique tokens the model can process. Common values: 30522 (BERT), 50257 (GPT-2), 32000 (T5). Must be greater than 0.

  • max_sequence_length – Maximum number of tokens the model can process in a single sequence. This affects memory usage and determines the size of positional encodings (if used). Common values: 512 (BERT), 1024, 2048. Longer sequences require more memory. Must be greater than 0.

  • pad_token_id – The token ID used for padding sequences to the same length. This token is ignored during attention computation. Common values: 0 (BERT), 1 (RoBERTa). Must be non-negative and within the vocabulary range.

  • mask_token_id – The token ID used for masking tokens. Must be non-negative and within the vocabulary range.

  • hidden_size – The dimensionality of the hidden layers. This is the primary dimension of the model and affects memory usage and computational requirements. Common values: 768 (BERT-base), 1024 (BERT-large). Must be divisible by num_attention_heads. Must be greater than 0.

  • intermediate_size – The dimensionality of the feed-forward layers. This is typically 4x the hidden_size (e.g., 3072 for hidden_size=768). Must be greater than 0.

  • num_blocks – The number of transformer layers in the model. More layers generally improve model capacity but increase computational cost. Common values: 12 (BERT-base), 24 (BERT-large). Must be at least 1.

  • num_attention_heads – The number of attention heads in the multi-head attention mechanism. Each head has dimension hidden_size // num_attention_heads. More heads can capture different types of relationships. Common values: 12 (BERT-base), 16 (BERT-large). Must be at least 2 and hidden_size must be divisible by this value.

  • num_kv_heads – The number of key-value heads for Grouped Query Attention (GQA). When set to num_attention_heads (default), standard multi-head attention is used. When set to 1, multi-query attention (MQA) is used. Values between 1 and num_attention_heads enable GQA. Must divide num_attention_heads evenly.

  • emb_pos_enc_kind – The type of positional encoding to use at the embedding level. Available options: “sinusoidal” (Sinusoidal positional encoding), “learned” (Learned positional encoding).

  • emb_pos_enc_kwargs – Additional keyword arguments to pass to the positional encoding class. Values dependent on chosen pos_enc_kind. All positional encodings receive dim and max_seq_len automatically, these do not need to be specified.

  • block_pos_enc_kind – The type of positional encoding to use at the block level. Available options: “alibi” (ALiBi positional encoding), “sinusoidal” (Sinusoidal positional encoding), “rope” (Rotary positional encoding), “learned” (Learned positional encoding), “learned_alibi” (ALiBi positional encoding with linear layer).

  • block_pos_enc_kwargs – Additional keyword arguments to pass to the positional encoding class. Values dependent on chosen pos_enc_kind. All positional encodings receive dim and max_seq_len automatically, these do not need to be specified.

  • attention_gate – Adds a query-dependent gating mechanism that modulates the hidden states after attention. Available options: None (no gating, default), “headwise” (gating per head), “elementwise” (gating per element).

  • add_token_type_emb – Whether to add token type embeddings to the model.

  • type_vocab_size – The size of the token_type vocabulary. Only used if add_token_type_emb is True.

  • mlp_type – The type of MLP (feed-forward) layer architecture. Available options: “mlp” (Standard two-layer feed-forward network), “glu” (Gated Linear Unit with learned gating mechanism, typically better performance).

  • head_type – The type of MLP (feed-forward) layer architecture for the final head. Available options: “proj” (Simple one-layer feed-forward network), “mlp” (Standard two-layer feed-forward network), “glu” (Gated Linear Unit with learned gating mechanism, typically better performance).

  • mlp_in_bias – Whether to include bias terms in the input projection of MLP layers.

  • mlp_out_bias – Whether to include bias terms in the output projection of MLP layers.

  • attn_proj_bias – Whether to include bias terms in the qkv projection of attention layers.

  • attn_out_bias – Whether to include bias terms in the output projection of attention layers.

  • local_attention – Whether to include local attention mechanism. Default (-1, -1) means global attention.

  • global_attention_every_n_layers – The layer step size for global attention. Set to 0 to disable global attention. Set to 1 for global attention in every layer. Set to 2 for global attention in every other layer, etc.

  • initializer_kind – The initialization method for weights. Determines the type of distribution random weights are sampled from for initialization. Defaults to a truncated normal distribution.

  • initializer_range – Standard deviation for weight initialization. Smaller values lead to more conservative initialization. Common values: 0.02 (BERT). Must be greater than 0.0.

  • initializer_cutoff_factor – Cutoff factor for truncated normal initialization. Values beyond initializer_range * initializer_cutoff_factor are redrawn. This ensures no extremely large initial weights. Common values: 2.0-3.0. Must be greater than 0.0.

  • initializer_gain – Gain to scale initialized weights with, e.g., for DeepNorm. Must be greater than 0.0.

  • add_timestep_emb – Whether to add timestep embeddings to the model (only needed for some diffusion models).

  • actv_fn – The activation function used in feed-forward networks.

  • norm_kind – When to apply normalization in the transformer layers. Available options: “pre” (Pre-normalization, normalize before attention/FFN, default, more stable), “post” (Post-normalization, normalize after attention/FFN, as in original Transformer), “both” (Apply normalization both before and after), “none” (No normalization, not recommended).

  • norm_fn – The type of normalization to apply. Available options: “rms” (Root Mean Square Layer Normalization, default, more efficient), “layer” (Standard Layer Normalization as used in BERT), “group” (Group Normalization, useful for smaller batch sizes), “deep” (DeepNorm), “dynamictanh” (Dynamic Tanh Normalization).

  • norm_eps – Small constant added to variance for numerical stability in normalization. Prevents division by zero in layer normalization. Common values: 1e-12 (BERT).

  • norm_params – Additional parameters for custom normalization layers. This field allows passing custom parameters to normalization layers that require them. For example, for DeepNorm: {“alpha”: 0.81} where alpha is the scaling factor.

  • norm_bias – Whether to include bias terms in the output projection of normalization layers.

  • norm_scaling – Whether norm scaling should be enabled. Defaults to False.

  • norm_qk – Whether to apply query-key normalization.

  • include_final_norm – Whether to apply a final normalization of the last hidden state.

  • emb_dropout_prob – Dropout probability applied to the embedding layer output. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.

  • hidden_dropout_prob – Dropout probability applied to hidden layer outputs. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.

  • attn_dropout_prob – Dropout probability applied to attention weights. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.

  • classifier_dropout_prob – Dropout probability for the classification head. Applied to the pooled representation before the final classification layer. Helps prevent overfitting in downstream tasks. Must be between 0.0 and 1.0.

  • attn_implementation – Which backend implementation of attention to use; can be “flash_attention_2” for FlashAttention2, “sdpa” torch, or “eager” for manual implementation. Defaults to SDPA.

  • problem_type – The problem type for automatic loss selection (HuggingFace standard). Automatically selects appropriate loss functions: “regression” (MSE loss for continuous targets), “single_label_classification” (CrossEntropy loss for single-label problems), “multi_label_classification” (BCEWithLogits loss for multi-label problems).

  • num_classes – The number of output classes for classification tasks. For regression tasks, typically 1. For binary classification, 2. For multi-class classification, the number of classes. Must be at least 1.

  • **kwargs – Additional keyword arguments passed to the parent PreTrainedConfig class.

classmethod from_bert_config(
orig_config: BertConfig,
) BertBlocksConfig[source]

Instantiate a BertBlocksConfig from a HuggingFace BERT config object.

classmethod from_config(
orig_config: PreTrainedConfig,
attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
) BertBlocksConfig[source]

Instantiate a BertBlocksConfig from any supported HuggingFace config object.

Dispatches to the appropriate from_*_config method based on the config type. Supported config types: BertConfig, ModernBertConfig.

classmethod from_huggingface(
pretrained_model_name_or_path: str,
attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
) BertBlocksConfig[source]

Instantiate a BertBlocksConfig from any supported pretrained HuggingFace model.

Automatically detects the model type and dispatches to the appropriate method. Supported model types: BERT, ModernBERT.

classmethod from_huggingface_bert(
pretrained_model_name_or_path: str,
attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
) BertBlocksConfig[source]

Instantiate a BertBlocksConfig from a pretrained HuggingFace BERT config.

classmethod from_huggingface_modernbert(
pretrained_model_name_or_path: str,
attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
) BertBlocksConfig[source]

Instantiate a BertBlocksConfig from a pretrained HuggingFace ModernBERT config.

classmethod from_modernbert_config(
orig_config: ModernBertConfig,
) BertBlocksConfig[source]

Instantiate a BertBlocksConfig from a HuggingFace ModernBERT config object.

ModernBertConfig

class bertblocks.config.ModernBertConfig(
transformers_version: str | None = None,
architectures: list[str] | None = None,
output_hidden_states: bool | None = False,
return_dict: bool | None = True,
dtype: str | torch.dtype | None = None,
chunk_size_feed_forward: int = 0,
is_encoder_decoder: bool = False,
id2label: dict[int, str] | dict[str, str] | None = None,
label2id: dict[str, int] | dict[str, str] | None = None,
problem_type: Literal['regression', 'single_label_classification', 'multi_label_classification'] | None = None,
vocab_size: int = 50368,
hidden_size: int = 768,
intermediate_size: int = 1152,
num_hidden_layers: int = 22,
num_attention_heads: int = 12,
hidden_activation: str = 'gelu',
max_position_embeddings: int = 8192,
initializer_range: float = 0.02,
initializer_cutoff_factor: float = 2.0,
norm_eps: float = 1e-05,
norm_bias: bool = False,
pad_token_id: int | None = 50283,
eos_token_id: int | list[int] | None = 50282,
bos_token_id: int | None = 50281,
cls_token_id: int | None = 50281,
sep_token_id: int | None = 50282,
attention_bias: bool = False,
attention_dropout: float | int = 0.0,
layer_types: list[str] | None = None,
rope_parameters: dict[Literal['full_attention', 'sliding_attention'], dict] | None = None,
local_attention: int = 128,
embedding_dropout: float | int = 0.0,
mlp_bias: bool = False,
mlp_dropout: float | int = 0.0,
decoder_bias: bool = True,
classifier_pooling: Literal['cls', 'mean'] = 'cls',
classifier_dropout: float | int = 0.0,
classifier_bias: bool = False,
classifier_activation: str = 'gelu',
deterministic_flash_attn: bool = False,
sparse_prediction: bool = False,
sparse_pred_ignore_index: int = -100,
tie_word_embeddings: bool = True,
)[source]

Bases: PreTrainedConfig

This is the configuration class to store the configuration of a ModernBertModel. It is used to instantiate a Modernbert model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)

Configuration objects inherit from [PreTrainedConfig] and can be used to control the model outputs. Read the documentation from [PreTrainedConfig] for more information.

Parameters:
  • vocab_size (int, optional, defaults to 50368) – Vocabulary size of the model. Defines the number of different tokens that can be represented by the input_ids.

  • hidden_size (int, optional, defaults to 768) – Dimension of the hidden representations.

  • intermediate_size (int, optional, defaults to 1152) – Dimension of the MLP representations.

  • num_hidden_layers (int, optional, defaults to 22) – Number of hidden layers in the Transformer decoder.

  • num_attention_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer decoder.

  • hidden_activation (str, optional, defaults to gelu) – The non-linear activation function (function or string) in the decoder. For example, “gelu”, “relu”, “silu”, etc.

  • max_position_embeddings (int, optional, defaults to 8192) – The maximum sequence length that this model might ever be used with.

  • initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • initializer_cutoff_factor (float, optional, defaults to 2.0) – The cutoff factor for the truncated_normal_initializer for initializing all weight matrices.

  • norm_eps (float, optional, defaults to 1e-05) – The epsilon used by the rms normalization layers.

  • norm_bias (bool, optional, defaults to False) – Whether to use bias in the normalization layers.

  • pad_token_id (int, optional, defaults to 50283) – Token id used for padding in the vocabulary.

  • eos_token_id (Union[int, list[int]], optional, defaults to 50282) – Token id used for end-of-stream in the vocabulary.

  • bos_token_id (int, optional, defaults to 50281) – Token id used for beginning-of-stream in the vocabulary.

  • cls_token_id (int, optional, defaults to 50281) – Token id used for CLS in the vocabulary.

  • sep_token_id (int, optional, defaults to 50282) – Token id used for separator in the vocabulary.

  • attention_bias (bool, optional, defaults to False) – Whether to use a bias in the query, key, value and output projection layers during self-attention.

  • attention_dropout (Union[float, int], optional, defaults to 0.0) – The dropout ratio for the attention probabilities.

  • layer_types (list[str], optional) – A list that explicitly maps each layer index with its layer type. If not provided, it will be automatically generated based on config values.

  • rope_parameters (dict[Literal[full_attention, sliding_attention], dict], optional) – Dictionary containing the configuration parameters for the RoPE embeddings. The dictionary should contain a value for rope_theta and optionally parameters used for scaling in case you want to use RoPE with longer max_position_embeddings.

  • local_attention (int, optional, defaults to 128) – The window size for local attention.

  • embedding_dropout (Union[float, int], optional, defaults to 0.0) – The dropout ratio for the embeddings.

  • mlp_bias (bool, optional, defaults to False) – Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.

  • mlp_dropout (float, optional, defaults to 0.0) – The dropout ratio for the MLP layers.

  • decoder_bias (bool, optional, defaults to True) – Whether to use bias in the decoder layers.

  • classifier_pooling (str, optional, defaults to “cls”) – The pooling method for the classifier. Should be either “cls” or “mean”. In local attention layers, the CLS token doesn’t attend to all tokens on long sequences.

  • classifier_dropout (Union[float, int], optional, defaults to 0.0) – The dropout ratio for classifier.

  • classifier_bias (bool, optional, defaults to False) – Whether to use bias in the classifier.

  • classifier_activation (str, optional, defaults to “gelu”) – The activation function for the classifier.

  • deterministic_flash_attn (bool, optional, defaults to False) – Whether to use deterministic flash attention. If False, inference will be faster but not deterministic.

  • sparse_prediction (bool, optional, defaults to False) – Whether to use sparse prediction for the masked language model instead of returning the full dense logits.

  • sparse_pred_ignore_index (int, optional, defaults to -100) – The index to ignore for the sparse prediction.

  • tie_word_embeddings (bool, optional, defaults to True) – Whether to tie weight embeddings according to model’s tied_weights_keys mapping.

Examples:

```python >>> from transformers import ModernBertModel, ModernBertConfig

>>> # Initializing a ModernBert style configuration
>>> configuration = ModernBertConfig()
>>> # Initializing a model from the modernbert-base style configuration
>>> model = ModernBertModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```
property sliding_window

local_attention is the total window, so we divide by 2.

Type:

Half-window size

to_dict()[source]

Serializes this instance to a Python dictionary.

Returns:

Dictionary of all the attributes that make up this configuration instance.

Return type:

dict[str, Any]

validate() None

Run class validators on the instance.