Configuration¶
BertBlocksConfig¶
- class bertblocks.config.BertBlocksConfig(
- vocab_size: int = 30522,
- max_sequence_length: int = 512,
- pad_token_id: int = 0,
- mask_token_id: int = 1,
- num_blocks: int = 12,
- attn_implementation: Literal['flash_attention_2', 'eager', 'sdpa'] | None = None,
- local_attention: tuple[int, int] = (-1, -1),
- global_attention_every_n_layers: int = 1,
- initializer_kind: Literal['trunc_normal', 'kaiming_normal', 'kaiming_uniform', 'xavier_normal', 'xavier_uniform'] = 'trunc_normal',
- initializer_range: float = 0.02,
- initializer_cutoff_factor: float = 3.0,
- initializer_gain: float = 1.0,
- add_timestep_emb: bool = False,
- add_token_type_emb: bool = False,
- type_vocab_size: int = 1,
- head_type: Literal['proj', 'mlp', 'glu'] = 'mlp',
- include_final_norm: bool = True,
- residual_first_layer: bool = False,
- emb_dropout_prob: float = 0.1,
- actv_fn: Literal['relu', 'silu', 'gelu', 'leakyrelu', 'selu', 'logsigmoid', 'sigmoid', 'prelu'] = 'silu',
- num_attention_heads: int = 12,
- num_kv_heads: int | None = None,
- attention_gate: Literal['elementwise', 'headwise'] | None = None,
- hidden_size: int = 768,
- intermediate_size: int = 3072,
- emb_pos_enc_kind: Literal['sinusoidal', 'learned'] | None = None,
- emb_pos_enc_kwargs: dict[str, Any] | None = None,
- block_pos_enc_kind: Literal['alibi', 'rope', 'learned', 'learned_alibi'] | None = 'alibi',
- block_pos_enc_kwargs: dict[str, Any] | None = None,
- mlp_type: Literal['linear', 'mlp', 'glu'] = 'mlp',
- mlp_in_bias: bool = True,
- mlp_out_bias: bool = True,
- attn_proj_bias: bool = True,
- attn_out_bias: bool = True,
- norm_kind: Literal['pre', 'post', 'both', 'none'] = 'pre',
- norm_fn: Literal['group', 'layer', 'rms', 'deep', 'dynamictanh'] = 'rms',
- norm_eps: float = 1e-12,
- norm_bias: bool = True,
- norm_scaling: bool = False,
- norm_qk: bool = False,
- norm_params: dict[str, Any] | None = None,
- hidden_dropout_prob: float = 0.1,
- attn_dropout_prob: float = 0.1,
- classifier_dropout_prob: float = 0.1,
- problem_type: Literal['regression', 'single_label_classification', 'multi_label_classification'] = 'regression',
- num_classes: int = 2,
- **kwargs: Any,
Configuration class for BertBlocks models.
- Variables:
model_type (str) – model type name for Huggingface config resolution. Default: ‘bertblocks’
- Parameters:
vocab_size – The size of the vocabulary. This determines the number of unique tokens the model can process. Common values: 30522 (BERT), 50257 (GPT-2), 32000 (T5). Must be greater than 0.
max_sequence_length – Maximum number of tokens the model can process in a single sequence. This affects memory usage and determines the size of positional encodings (if used). Common values: 512 (BERT), 1024, 2048. Longer sequences require more memory. Must be greater than 0.
pad_token_id – The token ID used for padding sequences to the same length. This token is ignored during attention computation. Common values: 0 (BERT), 1 (RoBERTa). Must be non-negative and within the vocabulary range.
mask_token_id – The token ID used for masking tokens. Must be non-negative and within the vocabulary range.
hidden_size – The dimensionality of the hidden layers. This is the primary dimension of the model and affects memory usage and computational requirements. Common values: 768 (BERT-base), 1024 (BERT-large). Must be divisible by num_attention_heads. Must be greater than 0.
intermediate_size – The dimensionality of the feed-forward layers. This is typically 4x the hidden_size (e.g., 3072 for hidden_size=768). Must be greater than 0.
num_blocks – The number of transformer layers in the model. More layers generally improve model capacity but increase computational cost. Common values: 12 (BERT-base), 24 (BERT-large). Must be at least 1.
num_attention_heads – The number of attention heads in the multi-head attention mechanism. Each head has dimension hidden_size // num_attention_heads. More heads can capture different types of relationships. Common values: 12 (BERT-base), 16 (BERT-large). Must be at least 2 and hidden_size must be divisible by this value.
num_kv_heads – The number of key-value heads for Grouped Query Attention (GQA). When set to num_attention_heads (default), standard multi-head attention is used. When set to 1, multi-query attention (MQA) is used. Values between 1 and num_attention_heads enable GQA. Must divide num_attention_heads evenly.
emb_pos_enc_kind – The type of positional encoding to use at the embedding level. Available options: “sinusoidal” (Sinusoidal positional encoding), “learned” (Learned positional encoding).
emb_pos_enc_kwargs – Additional keyword arguments to pass to the positional encoding class. Values dependent on chosen pos_enc_kind. All positional encodings receive dim and max_seq_len automatically, these do not need to be specified.
block_pos_enc_kind – The type of positional encoding to use at the block level. Available options: “alibi” (ALiBi positional encoding), “sinusoidal” (Sinusoidal positional encoding), “rope” (Rotary positional encoding), “learned” (Learned positional encoding), “learned_alibi” (ALiBi positional encoding with linear layer).
block_pos_enc_kwargs – Additional keyword arguments to pass to the positional encoding class. Values dependent on chosen pos_enc_kind. All positional encodings receive dim and max_seq_len automatically, these do not need to be specified.
attention_gate – Adds a query-dependent gating mechanism that modulates the hidden states after attention. Available options: None (no gating, default), “headwise” (gating per head), “elementwise” (gating per element).
add_token_type_emb – Whether to add token type embeddings to the model.
type_vocab_size – The size of the token_type vocabulary. Only used if add_token_type_emb is True.
mlp_type – The type of MLP (feed-forward) layer architecture. Available options: “mlp” (Standard two-layer feed-forward network), “glu” (Gated Linear Unit with learned gating mechanism, typically better performance).
head_type – The type of MLP (feed-forward) layer architecture for the final head. Available options: “proj” (Simple one-layer feed-forward network), “mlp” (Standard two-layer feed-forward network), “glu” (Gated Linear Unit with learned gating mechanism, typically better performance).
mlp_in_bias – Whether to include bias terms in the input projection of MLP layers.
mlp_out_bias – Whether to include bias terms in the output projection of MLP layers.
attn_proj_bias – Whether to include bias terms in the qkv projection of attention layers.
attn_out_bias – Whether to include bias terms in the output projection of attention layers.
local_attention – Whether to include local attention mechanism. Default (-1, -1) means global attention.
global_attention_every_n_layers – The layer step size for global attention. Set to 0 to disable global attention. Set to 1 for global attention in every layer. Set to 2 for global attention in every other layer, etc.
initializer_kind – The initialization method for weights. Determines the type of distribution random weights are sampled from for initialization. Defaults to a truncated normal distribution.
initializer_range – Standard deviation for weight initialization. Smaller values lead to more conservative initialization. Common values: 0.02 (BERT). Must be greater than 0.0.
initializer_cutoff_factor – Cutoff factor for truncated normal initialization. Values beyond initializer_range * initializer_cutoff_factor are redrawn. This ensures no extremely large initial weights. Common values: 2.0-3.0. Must be greater than 0.0.
initializer_gain – Gain to scale initialized weights with, e.g., for DeepNorm. Must be greater than 0.0.
add_timestep_emb – Whether to add timestep embeddings to the model (only needed for some diffusion models).
actv_fn – The activation function used in feed-forward networks.
norm_kind – When to apply normalization in the transformer layers. Available options: “pre” (Pre-normalization, normalize before attention/FFN, default, more stable), “post” (Post-normalization, normalize after attention/FFN, as in original Transformer), “both” (Apply normalization both before and after), “none” (No normalization, not recommended).
norm_fn – The type of normalization to apply. Available options: “rms” (Root Mean Square Layer Normalization, default, more efficient), “layer” (Standard Layer Normalization as used in BERT), “group” (Group Normalization, useful for smaller batch sizes), “deep” (DeepNorm), “dynamictanh” (Dynamic Tanh Normalization).
norm_eps – Small constant added to variance for numerical stability in normalization. Prevents division by zero in layer normalization. Common values: 1e-12 (BERT).
norm_params – Additional parameters for custom normalization layers. This field allows passing custom parameters to normalization layers that require them. For example, for DeepNorm: {“alpha”: 0.81} where alpha is the scaling factor.
norm_bias – Whether to include bias terms in the output projection of normalization layers.
norm_scaling – Whether norm scaling should be enabled. Defaults to False.
norm_qk – Whether to apply query-key normalization.
include_final_norm – Whether to apply a final normalization of the last hidden state.
emb_dropout_prob – Dropout probability applied to the embedding layer output. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.
hidden_dropout_prob – Dropout probability applied to hidden layer outputs. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.
attn_dropout_prob – Dropout probability applied to attention weights. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.
classifier_dropout_prob – Dropout probability for the classification head. Applied to the pooled representation before the final classification layer. Helps prevent overfitting in downstream tasks. Must be between 0.0 and 1.0.
attn_implementation – Which backend implementation of attention to use; can be “flash_attention_2” for FlashAttention2, “sdpa” torch, or “eager” for manual implementation. Defaults to SDPA.
problem_type – The problem type for automatic loss selection (HuggingFace standard). Automatically selects appropriate loss functions: “regression” (MSE loss for continuous targets), “single_label_classification” (CrossEntropy loss for single-label problems), “multi_label_classification” (BCEWithLogits loss for multi-label problems).
num_classes – The number of output classes for classification tasks. For regression tasks, typically 1. For binary classification, 2. For multi-class classification, the number of classes. Must be at least 1.
**kwargs – Additional keyword arguments passed to the parent PreTrainedConfig class.
- classmethod from_bert_config(
- orig_config: BertConfig,
Instantiate a BertBlocksConfig from a HuggingFace BERT config object.
- classmethod from_config(
- orig_config: PreTrainedConfig,
- attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
Instantiate a BertBlocksConfig from any supported HuggingFace config object.
Dispatches to the appropriate from_*_config method based on the config type. Supported config types: BertConfig, ModernBertConfig.
- classmethod from_huggingface(
- pretrained_model_name_or_path: str,
- attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
Instantiate a BertBlocksConfig from any supported pretrained HuggingFace model.
Automatically detects the model type and dispatches to the appropriate method. Supported model types: BERT, ModernBERT.
- classmethod from_huggingface_bert(
- pretrained_model_name_or_path: str,
- attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
Instantiate a BertBlocksConfig from a pretrained HuggingFace BERT config.
- classmethod from_huggingface_modernbert(
- pretrained_model_name_or_path: str,
- attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
Instantiate a BertBlocksConfig from a pretrained HuggingFace ModernBERT config.
- classmethod from_modernbert_config(
- orig_config: ModernBertConfig,
Instantiate a BertBlocksConfig from a HuggingFace ModernBERT config object.
Preset Configurations¶
BertBlocksConfig¶
- class bertblocks.config.BertBlocksConfig(
- vocab_size: int = 30522,
- max_sequence_length: int = 512,
- pad_token_id: int = 0,
- mask_token_id: int = 1,
- num_blocks: int = 12,
- attn_implementation: Literal['flash_attention_2', 'eager', 'sdpa'] | None = None,
- local_attention: tuple[int, int] = (-1, -1),
- global_attention_every_n_layers: int = 1,
- initializer_kind: Literal['trunc_normal', 'kaiming_normal', 'kaiming_uniform', 'xavier_normal', 'xavier_uniform'] = 'trunc_normal',
- initializer_range: float = 0.02,
- initializer_cutoff_factor: float = 3.0,
- initializer_gain: float = 1.0,
- add_timestep_emb: bool = False,
- add_token_type_emb: bool = False,
- type_vocab_size: int = 1,
- head_type: Literal['proj', 'mlp', 'glu'] = 'mlp',
- include_final_norm: bool = True,
- residual_first_layer: bool = False,
- emb_dropout_prob: float = 0.1,
- actv_fn: Literal['relu', 'silu', 'gelu', 'leakyrelu', 'selu', 'logsigmoid', 'sigmoid', 'prelu'] = 'silu',
- num_attention_heads: int = 12,
- num_kv_heads: int | None = None,
- attention_gate: Literal['elementwise', 'headwise'] | None = None,
- hidden_size: int = 768,
- intermediate_size: int = 3072,
- emb_pos_enc_kind: Literal['sinusoidal', 'learned'] | None = None,
- emb_pos_enc_kwargs: dict[str, Any] | None = None,
- block_pos_enc_kind: Literal['alibi', 'rope', 'learned', 'learned_alibi'] | None = 'alibi',
- block_pos_enc_kwargs: dict[str, Any] | None = None,
- mlp_type: Literal['linear', 'mlp', 'glu'] = 'mlp',
- mlp_in_bias: bool = True,
- mlp_out_bias: bool = True,
- attn_proj_bias: bool = True,
- attn_out_bias: bool = True,
- norm_kind: Literal['pre', 'post', 'both', 'none'] = 'pre',
- norm_fn: Literal['group', 'layer', 'rms', 'deep', 'dynamictanh'] = 'rms',
- norm_eps: float = 1e-12,
- norm_bias: bool = True,
- norm_scaling: bool = False,
- norm_qk: bool = False,
- norm_params: dict[str, Any] | None = None,
- hidden_dropout_prob: float = 0.1,
- attn_dropout_prob: float = 0.1,
- classifier_dropout_prob: float = 0.1,
- problem_type: Literal['regression', 'single_label_classification', 'multi_label_classification'] = 'regression',
- num_classes: int = 2,
- **kwargs: Any,
Bases:
PreTrainedConfigConfiguration class for BertBlocks models.
- Variables:
model_type (str) – model type name for Huggingface config resolution. Default: ‘bertblocks’
- Parameters:
vocab_size – The size of the vocabulary. This determines the number of unique tokens the model can process. Common values: 30522 (BERT), 50257 (GPT-2), 32000 (T5). Must be greater than 0.
max_sequence_length – Maximum number of tokens the model can process in a single sequence. This affects memory usage and determines the size of positional encodings (if used). Common values: 512 (BERT), 1024, 2048. Longer sequences require more memory. Must be greater than 0.
pad_token_id – The token ID used for padding sequences to the same length. This token is ignored during attention computation. Common values: 0 (BERT), 1 (RoBERTa). Must be non-negative and within the vocabulary range.
mask_token_id – The token ID used for masking tokens. Must be non-negative and within the vocabulary range.
hidden_size – The dimensionality of the hidden layers. This is the primary dimension of the model and affects memory usage and computational requirements. Common values: 768 (BERT-base), 1024 (BERT-large). Must be divisible by num_attention_heads. Must be greater than 0.
intermediate_size – The dimensionality of the feed-forward layers. This is typically 4x the hidden_size (e.g., 3072 for hidden_size=768). Must be greater than 0.
num_blocks – The number of transformer layers in the model. More layers generally improve model capacity but increase computational cost. Common values: 12 (BERT-base), 24 (BERT-large). Must be at least 1.
num_attention_heads – The number of attention heads in the multi-head attention mechanism. Each head has dimension hidden_size // num_attention_heads. More heads can capture different types of relationships. Common values: 12 (BERT-base), 16 (BERT-large). Must be at least 2 and hidden_size must be divisible by this value.
num_kv_heads – The number of key-value heads for Grouped Query Attention (GQA). When set to num_attention_heads (default), standard multi-head attention is used. When set to 1, multi-query attention (MQA) is used. Values between 1 and num_attention_heads enable GQA. Must divide num_attention_heads evenly.
emb_pos_enc_kind – The type of positional encoding to use at the embedding level. Available options: “sinusoidal” (Sinusoidal positional encoding), “learned” (Learned positional encoding).
emb_pos_enc_kwargs – Additional keyword arguments to pass to the positional encoding class. Values dependent on chosen pos_enc_kind. All positional encodings receive dim and max_seq_len automatically, these do not need to be specified.
block_pos_enc_kind – The type of positional encoding to use at the block level. Available options: “alibi” (ALiBi positional encoding), “sinusoidal” (Sinusoidal positional encoding), “rope” (Rotary positional encoding), “learned” (Learned positional encoding), “learned_alibi” (ALiBi positional encoding with linear layer).
block_pos_enc_kwargs – Additional keyword arguments to pass to the positional encoding class. Values dependent on chosen pos_enc_kind. All positional encodings receive dim and max_seq_len automatically, these do not need to be specified.
attention_gate – Adds a query-dependent gating mechanism that modulates the hidden states after attention. Available options: None (no gating, default), “headwise” (gating per head), “elementwise” (gating per element).
add_token_type_emb – Whether to add token type embeddings to the model.
type_vocab_size – The size of the token_type vocabulary. Only used if add_token_type_emb is True.
mlp_type – The type of MLP (feed-forward) layer architecture. Available options: “mlp” (Standard two-layer feed-forward network), “glu” (Gated Linear Unit with learned gating mechanism, typically better performance).
head_type – The type of MLP (feed-forward) layer architecture for the final head. Available options: “proj” (Simple one-layer feed-forward network), “mlp” (Standard two-layer feed-forward network), “glu” (Gated Linear Unit with learned gating mechanism, typically better performance).
mlp_in_bias – Whether to include bias terms in the input projection of MLP layers.
mlp_out_bias – Whether to include bias terms in the output projection of MLP layers.
attn_proj_bias – Whether to include bias terms in the qkv projection of attention layers.
attn_out_bias – Whether to include bias terms in the output projection of attention layers.
local_attention – Whether to include local attention mechanism. Default (-1, -1) means global attention.
global_attention_every_n_layers – The layer step size for global attention. Set to 0 to disable global attention. Set to 1 for global attention in every layer. Set to 2 for global attention in every other layer, etc.
initializer_kind – The initialization method for weights. Determines the type of distribution random weights are sampled from for initialization. Defaults to a truncated normal distribution.
initializer_range – Standard deviation for weight initialization. Smaller values lead to more conservative initialization. Common values: 0.02 (BERT). Must be greater than 0.0.
initializer_cutoff_factor – Cutoff factor for truncated normal initialization. Values beyond initializer_range * initializer_cutoff_factor are redrawn. This ensures no extremely large initial weights. Common values: 2.0-3.0. Must be greater than 0.0.
initializer_gain – Gain to scale initialized weights with, e.g., for DeepNorm. Must be greater than 0.0.
add_timestep_emb – Whether to add timestep embeddings to the model (only needed for some diffusion models).
actv_fn – The activation function used in feed-forward networks.
norm_kind – When to apply normalization in the transformer layers. Available options: “pre” (Pre-normalization, normalize before attention/FFN, default, more stable), “post” (Post-normalization, normalize after attention/FFN, as in original Transformer), “both” (Apply normalization both before and after), “none” (No normalization, not recommended).
norm_fn – The type of normalization to apply. Available options: “rms” (Root Mean Square Layer Normalization, default, more efficient), “layer” (Standard Layer Normalization as used in BERT), “group” (Group Normalization, useful for smaller batch sizes), “deep” (DeepNorm), “dynamictanh” (Dynamic Tanh Normalization).
norm_eps – Small constant added to variance for numerical stability in normalization. Prevents division by zero in layer normalization. Common values: 1e-12 (BERT).
norm_params – Additional parameters for custom normalization layers. This field allows passing custom parameters to normalization layers that require them. For example, for DeepNorm: {“alpha”: 0.81} where alpha is the scaling factor.
norm_bias – Whether to include bias terms in the output projection of normalization layers.
norm_scaling – Whether norm scaling should be enabled. Defaults to False.
norm_qk – Whether to apply query-key normalization.
include_final_norm – Whether to apply a final normalization of the last hidden state.
emb_dropout_prob – Dropout probability applied to the embedding layer output. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.
hidden_dropout_prob – Dropout probability applied to hidden layer outputs. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.
attn_dropout_prob – Dropout probability applied to attention weights. Common values: 0.1 (default), 0.0 (no dropout). Must be between 0.0 and 1.0.
classifier_dropout_prob – Dropout probability for the classification head. Applied to the pooled representation before the final classification layer. Helps prevent overfitting in downstream tasks. Must be between 0.0 and 1.0.
attn_implementation – Which backend implementation of attention to use; can be “flash_attention_2” for FlashAttention2, “sdpa” torch, or “eager” for manual implementation. Defaults to SDPA.
problem_type – The problem type for automatic loss selection (HuggingFace standard). Automatically selects appropriate loss functions: “regression” (MSE loss for continuous targets), “single_label_classification” (CrossEntropy loss for single-label problems), “multi_label_classification” (BCEWithLogits loss for multi-label problems).
num_classes – The number of output classes for classification tasks. For regression tasks, typically 1. For binary classification, 2. For multi-class classification, the number of classes. Must be at least 1.
**kwargs – Additional keyword arguments passed to the parent PreTrainedConfig class.
- classmethod from_bert_config(
- orig_config: BertConfig,
Instantiate a BertBlocksConfig from a HuggingFace BERT config object.
- classmethod from_config(
- orig_config: PreTrainedConfig,
- attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
Instantiate a BertBlocksConfig from any supported HuggingFace config object.
Dispatches to the appropriate from_*_config method based on the config type. Supported config types: BertConfig, ModernBertConfig.
- classmethod from_huggingface(
- pretrained_model_name_or_path: str,
- attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
Instantiate a BertBlocksConfig from any supported pretrained HuggingFace model.
Automatically detects the model type and dispatches to the appropriate method. Supported model types: BERT, ModernBERT.
- classmethod from_huggingface_bert(
- pretrained_model_name_or_path: str,
- attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
Instantiate a BertBlocksConfig from a pretrained HuggingFace BERT config.
- classmethod from_huggingface_modernbert(
- pretrained_model_name_or_path: str,
- attn_implementation: Literal['flash_attention_2', 'sdpa', 'eager'] = 'sdpa',
Instantiate a BertBlocksConfig from a pretrained HuggingFace ModernBERT config.
- classmethod from_modernbert_config(
- orig_config: ModernBertConfig,
Instantiate a BertBlocksConfig from a HuggingFace ModernBERT config object.
ModernBertConfig¶
- class bertblocks.config.ModernBertConfig(
- transformers_version: str | None = None,
- architectures: list[str] | None = None,
- output_hidden_states: bool | None = False,
- return_dict: bool | None = True,
- dtype: str | torch.dtype | None = None,
- chunk_size_feed_forward: int = 0,
- is_encoder_decoder: bool = False,
- id2label: dict[int, str] | dict[str, str] | None = None,
- label2id: dict[str, int] | dict[str, str] | None = None,
- problem_type: Literal['regression', 'single_label_classification', 'multi_label_classification'] | None = None,
- vocab_size: int = 50368,
- hidden_size: int = 768,
- intermediate_size: int = 1152,
- num_hidden_layers: int = 22,
- num_attention_heads: int = 12,
- hidden_activation: str = 'gelu',
- max_position_embeddings: int = 8192,
- initializer_range: float = 0.02,
- initializer_cutoff_factor: float = 2.0,
- norm_eps: float = 1e-05,
- norm_bias: bool = False,
- pad_token_id: int | None = 50283,
- eos_token_id: int | list[int] | None = 50282,
- bos_token_id: int | None = 50281,
- cls_token_id: int | None = 50281,
- sep_token_id: int | None = 50282,
- attention_bias: bool = False,
- attention_dropout: float | int = 0.0,
- layer_types: list[str] | None = None,
- rope_parameters: dict[Literal['full_attention', 'sliding_attention'], dict] | None = None,
- local_attention: int = 128,
- embedding_dropout: float | int = 0.0,
- mlp_bias: bool = False,
- mlp_dropout: float | int = 0.0,
- decoder_bias: bool = True,
- classifier_pooling: Literal['cls', 'mean'] = 'cls',
- classifier_dropout: float | int = 0.0,
- classifier_bias: bool = False,
- classifier_activation: str = 'gelu',
- deterministic_flash_attn: bool = False,
- sparse_prediction: bool = False,
- sparse_pred_ignore_index: int = -100,
- tie_word_embeddings: bool = True,
Bases:
PreTrainedConfigThis is the configuration class to store the configuration of a ModernBertModel. It is used to instantiate a Modernbert model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
Configuration objects inherit from [PreTrainedConfig] and can be used to control the model outputs. Read the documentation from [PreTrainedConfig] for more information.
- Parameters:
vocab_size (int, optional, defaults to 50368) – Vocabulary size of the model. Defines the number of different tokens that can be represented by the input_ids.
hidden_size (int, optional, defaults to 768) – Dimension of the hidden representations.
intermediate_size (int, optional, defaults to 1152) – Dimension of the MLP representations.
num_hidden_layers (int, optional, defaults to 22) – Number of hidden layers in the Transformer decoder.
num_attention_heads (int, optional, defaults to 12) – Number of attention heads for each attention layer in the Transformer decoder.
hidden_activation (str, optional, defaults to gelu) – The non-linear activation function (function or string) in the decoder. For example, “gelu”, “relu”, “silu”, etc.
max_position_embeddings (int, optional, defaults to 8192) – The maximum sequence length that this model might ever be used with.
initializer_range (float, optional, defaults to 0.02) – The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_cutoff_factor (float, optional, defaults to 2.0) – The cutoff factor for the truncated_normal_initializer for initializing all weight matrices.
norm_eps (float, optional, defaults to 1e-05) – The epsilon used by the rms normalization layers.
norm_bias (bool, optional, defaults to False) – Whether to use bias in the normalization layers.
pad_token_id (int, optional, defaults to 50283) – Token id used for padding in the vocabulary.
eos_token_id (Union[int, list[int]], optional, defaults to 50282) – Token id used for end-of-stream in the vocabulary.
bos_token_id (int, optional, defaults to 50281) – Token id used for beginning-of-stream in the vocabulary.
cls_token_id (int, optional, defaults to 50281) – Token id used for CLS in the vocabulary.
sep_token_id (int, optional, defaults to 50282) – Token id used for separator in the vocabulary.
attention_bias (bool, optional, defaults to False) – Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (Union[float, int], optional, defaults to 0.0) – The dropout ratio for the attention probabilities.
layer_types (list[str], optional) – A list that explicitly maps each layer index with its layer type. If not provided, it will be automatically generated based on config values.
rope_parameters (dict[Literal[full_attention, sliding_attention], dict], optional) – Dictionary containing the configuration parameters for the RoPE embeddings. The dictionary should contain a value for rope_theta and optionally parameters used for scaling in case you want to use RoPE with longer max_position_embeddings.
local_attention (int, optional, defaults to 128) – The window size for local attention.
embedding_dropout (Union[float, int], optional, defaults to 0.0) – The dropout ratio for the embeddings.
mlp_bias (bool, optional, defaults to False) – Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
mlp_dropout (float, optional, defaults to 0.0) – The dropout ratio for the MLP layers.
decoder_bias (bool, optional, defaults to True) – Whether to use bias in the decoder layers.
classifier_pooling (str, optional, defaults to “cls”) – The pooling method for the classifier. Should be either “cls” or “mean”. In local attention layers, the CLS token doesn’t attend to all tokens on long sequences.
classifier_dropout (Union[float, int], optional, defaults to 0.0) – The dropout ratio for classifier.
classifier_bias (bool, optional, defaults to False) – Whether to use bias in the classifier.
classifier_activation (str, optional, defaults to “gelu”) – The activation function for the classifier.
deterministic_flash_attn (bool, optional, defaults to False) – Whether to use deterministic flash attention. If False, inference will be faster but not deterministic.
sparse_prediction (bool, optional, defaults to False) – Whether to use sparse prediction for the masked language model instead of returning the full dense logits.
sparse_pred_ignore_index (int, optional, defaults to -100) – The index to ignore for the sparse prediction.
tie_word_embeddings (bool, optional, defaults to True) – Whether to tie weight embeddings according to model’s tied_weights_keys mapping.
Examples:
```python >>> from transformers import ModernBertModel, ModernBertConfig
>>> # Initializing a ModernBert style configuration >>> configuration = ModernBertConfig()
>>> # Initializing a model from the modernbert-base style configuration >>> model = ModernBertModel(configuration)
>>> # Accessing the model configuration >>> configuration = model.config ```
- property sliding_window¶
local_attention is the total window, so we divide by 2.
- Type:
Half-window size