Modeling¶
The bertblocks.modeling package contains all neural network components.
Models¶
The top-level model classes that combine embedding, encoder, and task head.
- class bertblocks.modeling.model.BertBlocksPreTrainedModel(config: BertBlocksConfig, *args: Any, **kwargs: Any)[source]¶
Bases:
PreTrainedModelBase class for all BertBlocks models.
This class provides the base configuration and weight initialization for all BertBlocks model variants. It inherits from HuggingFace’s PreTrainedModel to provide compatibility with the transformers library.
- config_class¶
alias of
BertBlocksConfig
- class bertblocks.modeling.model.BertBlocksModel(config: BertBlocksConfig, add_pooling_layer: bool = False)[source]¶
Bases:
BertBlocksPreTrainedModelCore BertBlocks model for encoding sequences.
This is the base BertBlocks model that outputs hidden states without any task-specific head. It can be used as a feature extractor for downstream tasks.
- Variables:
embd (TokenEmbedding) – Embedding layer.
encd (Encoder) – Encoder stack.
norm (nn.Module) – Normalization layer. Falls back to nn.Identity if not configured.
pool (Pooler | None) – Pooler layer, optional.
pad_token_id (int) – Token ID to insert for padding.
- Parameters:
config (BertBlocksConfig) – Configuration object determining model hyperparameters. Passed to other submodules.
add_pooling_layer (bool) – Whether to add a pooling layer after the encoder layers.
- forward(
- input_ids: Tensor,
- attention_mask: Tensor | None = None,
- token_type_ids: Tensor | None = None,
- output_attentions: bool = False,
- output_hidden_states: bool = False,
Forward pass of the BertBlocks model.
- Parameters:
input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
output_attentions – Whether to return attention weights from all layers. Defaults to None.
output_hidden_states – Whether to return hidden states from all layers. Defaults to False.
- Returns:
last_hidden_state: Hidden states from the last layer
pooler_output: Pooler output from the last layer (optional)
hidden_states: Hidden states from all layers (optional)
attentions: Attention weights from all layers (optional)
- Return type:
BaseModelOutput or BaseModelOutputWithPooling containing
- get_input_embeddings() Embedding[source]¶
Get the input token embeddings.
- Returns:
The input token embedding layer.
- Return type:
nn.Embedding
Task Heads¶
- class bertblocks.modeling.model.BertBlocksForMaskedLM(config: BertBlocksConfig)[source]¶
Bases:
BertBlocksPreTrainedModelBertBlocks model for masked language modeling tasks.
This model extends the base BertBlocks model with a prediction head and decoder for masked language modeling. It can be used for pre-training or fine-tuning on masked language modeling tasks.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
vocab_size: Size of the vocabulary for token embeddings
hidden_size: Dimensionality of hidden layers
- forward(
- input_ids: Tensor,
- attention_mask: Tensor | None = None,
- token_type_ids: Tensor | None = None,
- labels: Tensor | None = None,
- output_attentions: bool | None = False,
- output_hidden_states: bool | None = False,
Forward pass for masked language modeling.
- Parameters:
input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
labels (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor of target token ids for computing loss. Defaults to None.
output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.
output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.
- Returns:
MaskedLMOutput
loss: Masked language modeling loss if labels provided
logits: Prediction scores over vocabulary
hidden_states: Hidden states from all layers if requested
attentions: Attention weights from all layers if requested
- class bertblocks.modeling.model.BertBlocksForSequenceClassification(config: BertBlocksConfig)[source]¶
Bases:
BertBlocksForTasksBaseBertBlocks model for sequence classification tasks.
This model extends the base BertBlocks model with a classification head for sequence-level prediction tasks. It supports regression, single-label classification, and multi-label classification.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
hidden_size: Dimensionality of hidden layers
num_classes: Number of output labels for classification tasks
problem_type: Problem type for automatic loss selection
- forward(
- input_ids: Tensor,
- attention_mask: Tensor | None = None,
- token_type_ids: Tensor | None = None,
- labels: Tensor | None = None,
- output_attentions: bool | None = False,
- output_hidden_states: bool | None = False,
Forward pass for sequence classification.
- Parameters:
input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
labels (torch.Tensor, shape [batch_size,] or [batch_size, num_classes], optional) – Tensor of target labels for computing loss. Defaults to None.
output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.
output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.
- Returns:
SequenceClassifierOutput
loss: Classification loss if labels provided
logits: Classification scores
hidden_states: Hidden states from all layers if requested
attentions: Attention weights from all layers if requested
- class bertblocks.modeling.model.BertBlocksForTokenClassification(config: BertBlocksConfig)[source]¶
Bases:
BertBlocksForTasksBaseBertBlocks model for token classification tasks.
This model extends the base BertBlocks model with a classification head for token-level prediction tasks such as named entity recognition, part-of-speech tagging, and other sequence labeling tasks.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
hidden_size: Dimensionality of hidden layers
num_classes: Number of output labels for classification tasks
- forward(
- input_ids: Tensor,
- attention_mask: Tensor | None = None,
- token_type_ids: Tensor | None = None,
- labels: Tensor | None = None,
- output_attentions: bool | None = False,
- output_hidden_states: bool | None = False,
Forward pass for token classification.
- Parameters:
input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
labels (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor of target labels for computing loss. Defaults to None.
output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.
output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.
- Returns:
TokenClassifierOutput
loss: Token classification loss if labels provided
logits: Classification scores for each token
hidden_states: Hidden states from all layers if requested
attentions: Attention weights from all layers if requested
- class bertblocks.modeling.model.BertBlocksForQuestionAnswering(config: BertBlocksConfig)[source]¶
Bases:
BertBlocksForTasksBaseBertBlocks model for extractive question answering tasks.
This model extends the base BertBlocks model with a classification head that predicts start and end positions of answers in the input sequence. It is designed for tasks like SQuAD where the answer is a span of text within the provided context.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
hidden_size: Dimensionality of hidden layers
- forward(
- input_ids: Tensor,
- attention_mask: Tensor | None = None,
- token_type_ids: Tensor | None = None,
- start_positions: Tensor | None = None,
- end_positions: Tensor | None = None,
- output_attentions: bool | None = False,
- output_hidden_states: bool | None = False,
Forward pass for question answering.
- Parameters:
input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
start_positions (torch.Tensor, shape [batch_size,], optional) – Tensor of start positions for computing loss. Values should be in [0, sequence_length-1]. Defaults to None.
end_positions (torch.Tensor, shape [batch_size,], optional) – Tensor of end positions for computing loss. Values should be in [0, sequence_length-1]. Defaults to None.
output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.
output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.
- Returns:
QuestionAnsweringModelOutput
loss: Span prediction loss if start_positions and end_positions provided
start_logits: Scores for start position of answer span
end_logits: Scores for end position of answer span
hidden_states: Hidden states from all layers if requested
attentions: Attention weights from all layers if requested
- class bertblocks.modeling.model.BertBlocksForMaskedDiffusion(config: BertBlocksConfig)[source]¶
Bases:
BertBlocksForMaskedLM,GenerationMixinImplementation of a masked diffusion model.
Closely follows https://github.com/kuleshov-group/mdlm
- forward(
- input_ids: Tensor,
- attention_mask: Tensor | None = None,
- token_type_ids: Tensor | None = None,
- labels: Tensor | None = None,
- output_attentions: bool | None = False,
- output_hidden_states: bool | None = False,
Forward pass for diffusion language modeling.
- Parameters:
input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids. When training, should be timestep-corrupted token IDs.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None (all tokens are attended to).
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
labels (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating uncorrupted token IDs.
output_attentions (bool) – Whether to return attention weights from all layers. Defaults to False.
output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.
- generate(
- input_ids: Tensor | None = None,
- attention_mask: Tensor | None = None,
- max_length: int | None = None,
- num_samples: int = 1,
- num_steps: int = 100,
- temperature: float = 1.0,
- eps: float = 1e-05,
- block_size: int | None = None,
Generate samples using iterative denoising from noise to data.
Supports both unconditional generation and prefix-conditioned generation. Compatible with HuggingFace tokenizer output.
- Parameters:
input_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Input token IDs to condition on. If provided, tokens where attention_mask=1 will be preserved during sampling. If None, generates unconditionally from scratch.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Mask indicating which input_ids positions to preserve (1) vs denoise (0). If None and input_ids is provided, all input positions are preserved.
max_length (int, optional) – Maximum sequence length to generate. If None, uses self.max_seq_len. If input_ids is shorter than max_length, extends with MASK tokens.
num_samples (int, optional) – Number of sequences to generate when input_ids is None. Defaults to 1.
num_steps (int, optional) – Number of denoising steps (more = higher quality, slower). Defaults to 100.
temperature (float, optional) – Temperature parameter. Defaults to 1.0.
eps (float, optional) – Final noise level. Defaults to 1e-5.
block_size (int, optional) – Size of blocks for block-wise denoising. If None, processes the whole sequence in parallel. Block denoising processes the sequence left-to-right in chunks, which can improve coherence for longer sequences.
- Returns:
Generated token sequences.
- Return type:
torch.Tensor (shape [batch_size, max_length] or [num_samples, max_length])
Examples
# Unconditional generation >>> sequences = model.generate(num_samples=4, max_length=128)
# Prefix-conditioned generation >>> inputs = tokenizer(“The cat sat on”, return_tensors=”pt”) >>> sequences = model.generate(**inputs, max_length=128)
# Block denoising for longer sequences >>> sequences = model.generate(**inputs, max_length=256, block_size=64)
- infill(
- input_ids: Tensor,
- attention_mask: Tensor | None = None,
- num_steps: int = 100,
- temperature: float = 1.0,
- eps: float = 1e-05,
- block_size: int | None = None,
Fill masked positions in the input using iterative diffusion denoising.
Unlike generate() which extends a prefix, this method fills in MASK tokens at arbitrary positions within the sequence. All non-MASK tokens are preserved.
- Parameters:
input_ids (torch.Tensor, shape [batch_size, seq_len]) – Input sequences containing MASK tokens at positions to be filled. Non-MASK tokens will be preserved.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Mask indicating which positions are valid (1) vs padding (0). If None, all positions are valid.
num_steps (int, optional) – Number of denoising steps. Defaults to 100.
temperature (float, optional) – Sampling temperature. Defaults to 1.0.
eps (float, optional) – Final noise level. Defaults to 1e-5.
block_size (int, optional) – Size of blocks for block-wise denoising. If None, processes the whole sequence in parallel.
- Returns:
Sequences with MASK positions filled.
- Return type:
torch.Tensor (shape [batch_size, seq_len])
Examples
# Fill middle of sequence >>> text = “The cat [MASK] [MASK] [MASK] the mat.” >>> inputs = tokenizer(text, return_tensors=”pt”) >>> filled = model.infill(inputs[“input_ids”])
# Block denoising for longer sequences >>> filled = model.infill(inputs[“input_ids”], block_size=64)
- class bertblocks.modeling.model.BertBlocksForEnhancedMaskedLM(
- config: BertBlocksConfig,
- masking_strategy: Literal['random'] = 'random',
- masking_probability: float = 0.5,
Bases:
BertBlocksForMaskedLMBertBlocks model for enhanced masked language modeling tasks.
This model extends the base BertBlocks model with a prediction head and decoder for enhanced masked language modeling. It can be used for pre-training or fine-tuning on enhanced masked language modeling tasks. Enhanced masked language modeling uses one additional transformer layer to handle the masking, instead of masking input tokens.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
vocab_size: Size of the vocabulary for token embeddings
hidden_size: Dimensionality of hidden layers
masking_strategy (str) – Masking strategy to use. Available options: “random”.
masking_probability (float) – Probability of masking tokens. Defaults to 0.5.
- forward(
- input_ids: Tensor,
- attention_mask: Tensor | None = None,
- token_type_ids: Tensor | None = None,
- labels: Tensor | None = None,
- output_attentions: bool | None = False,
- output_hidden_states: bool | None = False,
Forward pass for masked language modeling.
- Parameters:
input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
labels (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor of target token ids for computing loss. Defaults to None.
output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.
output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.
- Returns:
MaskedLMOutput
loss: Masked language modeling loss if labels provided
logits: Prediction scores over vocabulary
hidden_states: Hidden states from all layers if requested
attentions: Attention weights from all layers if requested
Output Types¶
- class bertblocks.modeling.model.MaybeUnpaddedBaseModelOutput(
- last_hidden_state: torch.FloatTensor | None = None,
- hidden_states: tuple[torch.FloatTensor, ...] | None = None,
- attentions: tuple[torch.FloatTensor, ...] | None = None,
- cu_seqlens: torch.FloatTensor | None = None,
- indices: torch.FloatTensor | None = None,
- seq_len: int | None = None,
- batch_size: int | None = None,
- class bertblocks.modeling.model.MaybeUnpaddedBaseModelOutputWithPooling(
- last_hidden_state: torch.FloatTensor | None = None,
- pooler_output: torch.FloatTensor | None = None,
- hidden_states: tuple[torch.FloatTensor, ...] | None = None,
- attentions: tuple[torch.FloatTensor, ...] | None = None,
- cu_seqlens: torch.FloatTensor | None = None,
- indices: torch.FloatTensor | None = None,
- seq_len: int | None = None,
- batch_size: int | None = None,
Transformer Block¶
- class bertblocks.modeling.block.Block(config: BertBlocksConfig, layer_id: int)[source]¶
Bases:
ModuleA single transformer block.
Implements a standard transformer block with attention and feed-forward layers, supporting both pre-normalization and post-normalization schemes.
The block consists of:
Multi-head self-attention with residual connection
Feed-forward network with residual connection
Layer normalization (pre/post/both/none)
- Variables:
layer_id (int) – index position of the layer in the models’ encoder stack.
attn (Attention) – Attention module.
ffwd (nn.Module) – Feed-forward module.
pre_norm_attn (nn.Module) – Pre-normalization layer for attention module. Falls back to nn.Identity if not configured.
pre_norm_ffwd (nn.Module) – Pre-normalization layer for feed-forward module. Falls back to nn.Identity if not configured.
post_norm_attn (nn.Module) – Pre-normalization function for attention module. Falls back to nn.Identity if not configured.
post_norm_ffwd (nn.Module) – Post-normalization function for feed-forward module. Falls back to nn.Identity if not configured.
attn_drop (nn.Dropout) – Post-attention dropout layer. Falls back to nn.Identity if not configured.
ffwd_drop (nn.Dropout) – Post-Feed-forward dropout layer. Falls back to nn.Identity if not configured.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
norm_kind: Normalization layer type
attn_dropout_prob: Dropout probability for attention layer
hidden_dropout_prob: Dropout probability for feed-forward layers
layer_id (int) – zero-indexed layer id indicating index in the encoder stack.
References
“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)
“On Layer Normalization in the Transformer Architecture” (https://arxiv.org/pdf/2002.04745)
“The Curse of Depth in Large Language Models” (https://arxiv.org/pdf/2502.05795)
- forward(
- x: Tensor,
- attention_mask: Tensor | None = None,
- cu_seqlens: Tensor | None = None,
- max_seq_len: int | None = None,
Forward pass of the transformer block.
Applies a sequence of operations: pre-norm -> attention -> residual -> post-norm -> pre-norm -> feed-forward -> residual -> post-norm. Supports both padded and unpadded sequences.
- Parameters:
x (Tensor, shape [batch_size, seq_len, hidden_size] or [total_seq_len, hidden_size]) – Hidden state to process. For padded sequences, use [batch_size, seq_len, hidden_size]. For unpadded sequences, use [total_seq_len, hidden_size].
attention_mask (Tensor, shape [batch_size, 1, seq_len, seq_len], optional) – 4D attention mask for padded sequences. Boolean or float mask with shape [batch_size, num_heads, seq_len, seq_len]. Ignored if cu_seqlens is provided. Defaults to None.
cu_seqlens (Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences. If provided, enables flash attention optimized path. Defaults to None.
max_seq_len (int, optional) – Maximum sequence length in the batch when using unpadded format. Required when cu_seqlens is provided. Defaults to None.
- Returns:
A tuple containing:
output (Tensor): Transformed hidden state with same shape and dtype as input.
attention_weights (Tensor | None): Attention weights if returned by backend, otherwise None.
- Return type:
tuple[Tensor, Tensor | None]
References
“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)
“On Layer Normalization in the Transformer Architecture” (https://arxiv.org/pdf/2002.04745)
“The Curse of Depth in Large Language Models” (https://arxiv.org/pdf/2502.05795)
- class bertblocks.modeling.block.EnhancedMaskingBlock(
- config: BertBlocksConfig,
- layer_id: int,
- masking_strategy: Literal['random'],
- masking_probability: float = 0.5,
Bases:
BlockA single transformer block.
Implements an enhanced masking transformer block which allows for custom modifications of the attention mask.
- Variables:
layer_id (int) – index position of the layer in the models’ encoder stack.
attn (Attention) – Attention module.
ffwd (nn.Module) – Feed-forward module.
pre_norm_attn (nn.Module) – Pre-normalization layer for attention module. Falls back to nn.Identity if not configured.
pre_norm_ffwd (nn.Module) – Pre-normalization layer for feed-forward module. Falls back to nn.Identity if not configured.
post_norm_attn (nn.Module) – Pre-normalization function for attention module. Falls back to nn.Identity if not configured.
post_norm_ffwd (nn.Module) – Post-normalization function for feed-forward module. Falls back to nn.Identity if not configured.
attn_drop (nn.Dropout) – Post-attention dropout layer. Falls back to nn.Identity if not configured.
ffwd_drop (nn.Dropout) – Post-Feed-forward dropout layer. Falls back to nn.Identity if not configured.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
norm_kind: Normalization layer type
attn_dropout_prob: Dropout probability for attention layer
hidden_dropout_prob: Dropout probability for feed-forward layers
layer_id (int) – layer id indicating index in the encoder stack.
masking_strategy (str) – Masking strategy to use. Available options: “random”.
masking_probability (float) – Probability of masking tokens. Defaults to 0.5.
References
“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)
“On Layer Normalization in the Transformer Architecture” (https://arxiv.org/pdf/2002.04745)
Paper for enhanced masking?
- forward(
- x: Tensor,
- attention_mask: Tensor | None = None,
- cu_seqlens: Tensor | None = None,
- max_seq_len: int | None = None,
Forward pass of the enhanced masking transformer block.
Applies custom masking strategy to attention before processing through the transformer. Supports random masking with configurable probability.
- Parameters:
x (Tensor, shape [batch_size, seq_len, hidden_size] or [total_seq_len, hidden_size]) – Hidden state to process. For padded sequences, use [batch_size, seq_len, hidden_size]. For unpadded sequences, use [total_seq_len, hidden_size].
attention_mask (Tensor, shape [batch_size, seq_len], optional) – 2D binary mask indicating which tokens are valid (1) vs padding (0). If None, all tokens are considered valid. Defaults to None.
cu_seqlens (Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences. Defaults to None.
max_seq_len (int, optional) – Maximum sequence length in the batch when using unpadded format. Defaults to None.
- Returns:
A tuple containing:
output (Tensor): Transformed hidden state with same shape and dtype as input.
attention_weights (Tensor | None): Attention weights if returned by backend, otherwise None.
- Return type:
tuple[Tensor, Tensor | None]
Note
Diagonal of attention mask is set to 0 to prevent tokens from attending to themselves.
- class bertblocks.modeling.block.Encoder(config: BertBlocksConfig)[source]¶
Bases:
ModuleMulti-layer transformer encoder.
Uses sequence packing for higher efficiency.
- Variables:
blocks (nn.ModuleList) – Stack of Block modules.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
num_blocks: Number of transformer blocks
num_attention_heads: Number of transformer attention heads
- forward(
- x: Tensor,
- attention_mask: Tensor | None,
- cu_seqlens: Tensor | None,
- max_seq_len: int | None,
- output_attentions: bool | None = False,
- output_hidden_states: bool | None = False,
Forward pass of the encoder.
Processes input hidden state sequentially through all transformer blocks. Supports both padded and unpadded (packed) sequences for efficient processing.
- Parameters:
x (Tensor, shape [batch_size, seq_len, hidden_size] or [total_seq_len, hidden_size]) – Hidden state to process. For padded sequences, use [batch_size, seq_len, hidden_size]. For unpadded sequences, use [total_seq_len, hidden_size].
attention_mask (Tensor, shape [batch_size, 1, seq_len, seq_len], optional) – 4D attention mask for padded sequences. Ignored if cu_seqlens is provided. Defaults to None.
cu_seqlens (Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences. Defaults to None.
max_seq_len (int, optional) – Maximum sequence length in the batch when using unpadded format. Defaults to None.
output_attentions (bool, optional) – Whether to return attention weights from all layers. Defaults to False.
output_hidden_states (bool, optional) – Whether to return hidden states from all layers. Defaults to False.
- Returns:
A tuple containing:
last_hidden_state (Tensor): Output of the final transformer layer with same shape as input.
all_hidden_states (tuple[Tensor, …] | None): Tuple of hidden states from all layers (including input embedding). Only returned if output_hidden_states=True, length = num_blocks + 1.
all_attentions (tuple[Tensor, …] | None): Tuple of attention weights from all layers. Only returned if output_attentions=True, length = num_blocks.
- Return type:
tuple[Tensor, tuple[Tensor, …] | None, tuple[Tensor, …] | None]
References
“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)
Attention¶
- class bertblocks.modeling.attention.Attention(config: BertBlocksConfig, layer_id: int)[source]¶
Bases:
ModuleAttention with configurable positional encodings.
- Variables:
num_heads (int) – Number of attention heads.
head_dim (int) – Dimension size of attention heads.
max_seq_len (int) – Maximum sequence length.
dropout_p (float) – Dropout probability for attention.
local_attention (tuple[int, int]) – Local attention size, if applied.
deterministic (bool) – Whether to use deterministic attention.
proj (nn.Linear) – Fused QKV projection layer.
ffwd (nn.Linear) – Feed-forward layer to combine heads after attention.
qk_norm (bool) – Whether to apply query-key-normalization.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
num_attention_heads: Number of attention heads in multi-head attention
hidden_size: Dimensionality of hidden layers (must be divisible by num_attention_heads)
max_sequence_length: Maximum sequence length for positional encodings
attn_proj_bias: Whether to include bias in QKV projection
attn_out_bias: Whether to include bias in output projection
attn_dropout_prob: Dropout probability for attention weights
block_pos_enc_kind: Type of positional embedding (“alibi”, “rope”, “relative”, etc.)
layer_id (int): layer id indicating index in the encoder stack.
- forward(
- x: Tensor,
- attention_mask: Tensor | None = None,
- cu_seqlens: Tensor | None = None,
- max_seq_len: int | None = None,
Forward pass of the attention mechanism.
Automatically routes to padded or unpadded implementation based on backend capabilities. Supports both standard padded sequences and packed (unpadded) sequences via flash attention.
- Parameters:
x (torch.Tensor, shape [batch_size, seq_len, hidden_size] or [total_seq_len, hidden_size]) – Hidden state to apply attention to. For padded inputs, use [batch_size, seq_len, hidden_size]. For unpadded inputs, use [total_seq_len, hidden_size].
attention_mask (torch.Tensor, shape [batch_size, 1, seq_len, seq_len], optional) – 4D attention mask for padded sequences. Should be in causal or full attention format. Ignored if cu_seqlens is provided. Defaults to None.
cu_seqlens (torch.Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences in packed format. If provided, enables flash attention optimized path. Defaults to None.
max_seq_len (int, optional) – Maximum sequence length in the batch when using unpadded format. Required when cu_seqlens is provided. Defaults to None.
- Returns:
- A tuple containing:
- output (torch.Tensor): Attention output with shape [batch_size, seq_len, hidden_size] (padded)
or [total_seq_len, hidden_size] (unpadded).
attention_weights (torch.Tensor | None): Optional attention weights. None for most backends.
- Return type:
- Raises:
ValueError – If neither attention_mask nor cu_seqlens is provided.
References
“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)
- class bertblocks.modeling.attention.AttentionGate(config: BertBlocksConfig)[source]¶
Bases:
ModuleA multiplicative attention gate that should be positioned ahead of the final feed-forward module.
Gating values are computed from the query vectors, which act as the input signal.
- Variables:
num_heads (int) – Number of attention heads.
head_dim (int) – Dimension size of attention heads.
attention_gate_type (AttentionGate) – Attention gate type.
gate_proj (nn.Linear) – Gating layer.
- Parameters:
config (BertBlocksConfig) – Configuration object determining model hyperparameters. May be passed to other submodules.
References
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free (https://openreview.net/pdf?id=1b7whO4SfY)
- forward(q: Tensor, x: Tensor) Tensor[source]¶
Forward pass of the attention gate.
- Parameters:
(torch.Tensor (x) – or [total_seq_len, num_heads, head_dim]): Query tensor.
[batch_size (shape) – or [total_seq_len, num_heads, head_dim]): Query tensor.
seq_len – or [total_seq_len, num_heads, head_dim]): Query tensor.
num_heads – or [total_seq_len, num_heads, head_dim]): Query tensor.
head_dim] (num_heads *) – or [total_seq_len, num_heads, head_dim]): Query tensor.
(torch.Tensor – or [total_seq_len, num_heads * head_dim]): Hidden state after attention.
[batch_size – or [total_seq_len, num_heads * head_dim]): Hidden state after attention.
seq_len – or [total_seq_len, num_heads * head_dim]): Hidden state after attention.
head_dim] – or [total_seq_len, num_heads * head_dim]): Hidden state after attention.
- Returns: torch.Tensor
Hidden state modulated by query projection.
Attention Backends¶
- class bertblocks.modeling.backends.AttentionBackend[source]¶
Abstract base class for attention backends.
- forward_padded(
- q: Tensor,
- k: Tensor,
- v: Tensor,
- attention_mask: Tensor,
- dropout_p: float = 0.0,
- deterministic: bool = False,
Forward pass with padded sequences.
- Parameters:
q (Tensor, shape [batch_size, seq_len, num_heads, head_dim]) – Query tensor.
k (Tensor, shape [batch_size, seq_len, num_kv_heads, head_dim]) – Key tensor.
v (Tensor, shape [batch_size, seq_len, num_kv_heads, head_dim]) – Value tensor.
attention_mask (Tensor) – Attention mask.
dropout_p (float) – Dropout probability.
deterministic (bool) – Whether to use deterministic attention.
- Returns:
- Output tensor [batch_size, seq_len, num_heads * head_dim] and optional
attention weights.
- Return type:
tuple[Tensor, Tensor | None]
- forward_unpadded(
- q: Tensor,
- k: Tensor,
- v: Tensor,
- cu_seqlens: Tensor,
- max_seq_len: int,
- alibi_slopes: Tensor | None = None,
- local_attention: tuple[int, int] = (-1, -1),
- dropout_p: float = 0.0,
- deterministic: bool = False,
Forward pass with unpadded sequences.
- Parameters:
q (Tensor, shape [total_seq_len, num_heads, head_dim]) – Query tensor.
k (Tensor, shape [total_seq_len, num_kv_heads, head_dim]) – Key tensor.
v (Tensor, shape [total_seq_len, num_kv_heads, head_dim]) – Value tensor.
cu_seqlens (Tensor, shape [batch_size + 1]) – Cumulative sequence lengths.
max_seq_len (int) – Maximum sequence length in batch.
alibi_slopes (Tensor, optional) – ALiBi slopes for positional bias.
local_attention (tuple[int, int]) – Local attention window size.
dropout_p (float) – Dropout probability.
deterministic (bool) – Whether to use deterministic attention.
- Returns:
- Output tensor [total_seq_len, num_heads * head_dim] and optional attention
weights.
- Return type:
tuple[Tensor, Tensor | None]
- class bertblocks.modeling.backends.FlashBackend[source]¶
Bases:
AttentionBackendFlash Attention 2 backend.
- class bertblocks.modeling.backends.SDPABackend[source]¶
Bases:
AttentionBackendPyTorch SDPA backend - works efficiently with padded sequences.
- class bertblocks.modeling.backends.EagerBackend[source]¶
Bases:
AttentionBackendNative PyTorch backend.
- bertblocks.modeling.backends.get_attention(config: BertBlocksConfig) AttentionBackend[source]¶
Get the Attention backend specified in the configuration.
This factory function returns the appropriate attention backend based on the configuration.
- Parameters:
config (BertBlocksConfig) – Configuration object determining model hyperparameters.
- Returns:
An attention backend module.
- Raises:
ValueError – If the specified attention backend is not supported.
Supported backends:
flash_attention_2: Flash Attention
sdpa: torch scaled dot product attention
eager: native torch attention
Embeddings¶
- class bertblocks.modeling.embedding.TokenEmbedding(config: BertBlocksConfig)[source]¶
Bases:
ModuleToken embedding layer.
Implements the token embedding layer that converts input token IDs to dense vector representations. Optionally applies positional encodings and/or token type encodings.
- Variables:
embd (nn.Embedding) – Token type embedding layer.
pose (nn.Module | None) – Positional encoding layer.
tokt (nn.Module | None) – Token type embedding layer.
norm (nn.Module) – Normalization layer. Falls back to nn.Identity if not configured.
drop (nn.Dropout) – Dropout layer. Falls back to nn.Identity if not configured.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
vocab_size (int): Size of the vocabulary for token embeddings
hidden_size: Dimensionality of embeddings and hidden states
pad_token_id: Token ID used for padding sequences
emb_pos_enc_kind: Type of positional encoding (“sinusoidal”, “learned”, etc.)
max_sequence_length: Maximum sequence length for positional encodings
add_token_type_emb: Whether to add token type embeddings
norm_kind: When to apply normalization (“post”, “both”, etc.)
emb_dropout_prob: Dropout probability for embedding layer output
- forward(
- input_ids: LongTensor,
- cu_seqlens: LongTensor | None = None,
- token_type_ids: LongTensor | None = None,
Forward pass of the token embedding layer.
Combines token embeddings, optional token type embeddings, optional positional encodings, normalization, and dropout.
- Parameters:
input_ids (torch.Tensor, shape [batch_size, seq_len] or [total_seq_len]) – Token IDs to embed. For padded inputs, shape is [batch_size, seq_len]. For unpadded inputs, shape is [total_seq_len].
cu_seqlens (torch.Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences. Used by positional encodings to compute per-sequence position indices. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len] or [total_seq_len], optional) – Segment IDs indicating token type (e.g., 0 for sentence A, 1 for sentence B in NSP). Only used if add_token_type_emb is True in config. Defaults to None.
- Returns:
- Embedded token representations with shape:
[batch_size, seq_len, hidden_size] for padded inputs
[total_seq_len, hidden_size] for unpadded inputs
- Return type:
References
“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (https://arxiv.org/abs/1810.04805)
“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)
- class bertblocks.modeling.embedding.TokenTypeEmbedding(config: BertBlocksConfig)[source]¶
Bases:
ModuleToken type embedding layer.
Implements the token type embedding layer that converts token type IDs to dense vector representations.
- Variables:
embd (nn.Embedding) – Token type embedding layer.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
type_vocab_size: Size of the token type vocabulary
hidden_size: Dimensionality of embeddings and hidden states
- forward(x: Tensor, token_type_ids: Tensor | None = None) Tensor[source]¶
Forward pass of the token type embeddings.
Uses supplied token type ids if given, otherwise defaults to constant token type ids.
- Parameters:
x (torch.Tensor, shape [total_seq_len, hidden_size] or [batch_size, seq_len, hidden_size]) – Hidden state to add token type ids to.
(torch.Tensor (token_type_ids) – optional): Indicates the token type of each token in the sequence.
[total_seq_len (shape) – optional): Indicates the token type of each token in the sequence.
[batch_size (hidden_size] or) – optional): Indicates the token type of each token in the sequence.
seq_len – optional): Indicates the token type of each token in the sequence.
hidden_size] – optional): Indicates the token type of each token in the sequence.
:param : optional): Indicates the token type of each token in the sequence.
- Returns:
- Hidden state with token type embedding added, shape [total_seq_len, hidden_size] or
[batch_size, seq_len, hidden_size].
- Return type:
Positional Encodings¶
- class bertblocks.modeling.position.SinusoidalPositionalEncoding(dim: int, max_seq_len: int = 1024, base: float = 10000.0)[source]¶
Bases:
ModuleImplementation of Sinusoidal Positional Encodings.
References
“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)
- forward(x: Tensor, cu_seqlens: Tensor | None = None) Tensor[source]¶
Add sinusoidal positional encoding to a given tensor.
- Parameters:
x (torch.Tensor) – The tensor to add positional encoding to. - For unpadded: shape [total_seq_len, embedding_dim] - For padded: shape [batch_size, seq_len, embedding_dim]
cu_seqlens (torch.Tensor, shape [batch_size + 1,], optional) – Cumulative sequence lengths for unpadded sequences. If None, assumes padded format.
- Returns:
The tensor after adding positional encoding, same shape as input.
- Return type:
- class bertblocks.modeling.position.LearnedPositionalEncoding(dim: int, max_seq_len: int)[source]¶
Bases:
ModuleLearned Positional Encodings.
- Variables:
embd (nn.Embedding) – The embedding layer encoding position.
- Parameters:
- forward(x: Tensor, cu_seqlens: Tensor | None = None) Tensor[source]¶
Add learned positional encodings to a given tensor.
- Parameters:
x (torch.Tensor) – The tensor to add positional encodings to. - For unpadded: shape [total_seq_len, embedding_dim] - For padded: shape [batch_size, seq_len, embedding_dim]
cu_seqlens (torch.Tensor, shape [batch_size + 1,], optional) – Cumulative sequence lengths for unpadded sequences. If None, assumes padded format.
- Returns:
The tensor after adding learned positional encodings, same shape as input.
- Return type:
- class bertblocks.modeling.position.AlibiPositionalEncoding(num_heads: int)[source]¶
Bases:
ModuleAlibi Positional Encodings.
- Variables:
slopes (torch.Tensor) – The alibi slope tensor indicating degree of positional bias for each head.
- Parameters:
num_heads (int) – Number of attention heads.
- forward(attention_mask: Tensor) Tensor[source]¶
Add AliBi biases to a given attention mask.
- Parameters:
attention_mask (torch.Tensor, shape [batch_size, num_heads, seq_len, seq_len]) – The attention mask.
- Returns:
The attention mask after adding alibi biases. Same shape as input.
- Return type:
- class bertblocks.modeling.position.RotaryPositionalEncoding(
- rope_dim: int,
- head_dim: int,
- base: float | None = 10000.0,
- interleaved: bool | None = False,
- max_seq_len: int = 512,
- device: device | str = 'cuda',
Bases:
ModuleImplementation of rotary positional encodings.
- Parameters:
rope_dim (int) – dimensionality of positional encoding. Equal to head_dim for full RoPE.
head_dim (int) – dimensionality of attention heads.
base (float, optional) – frequency base for positional encodings. Defaults to 10_000.0
interleaved (bool, optional) – indicates whether to rotate pairs of even and odd dimensions (True, GPT-J style) instead of 1st half and 2nd half (False, GPT-NeoX style). Defaults to False.
device (torch.device, optional) – device on which to allocate the frequency buffer. Defaults to None (cpu).
References
“RoFormer: Enhanced Transformer with Rotary Position Embedding” (https://arxiv.org/abs/2104.09864)
“GPT-NeoX-20B: An Open-Source Autoregressive Language Model” (https://arxiv.org/abs/2204.06745)
“Round and Round We Go! What makes Rotary Positional Encodings useful?” (https://arxiv.org/abs/2410.06205)
- forward( ) tuple[Tensor, Tensor][source]¶
Apply rotary positional encoding to query and key tensors.
- Parameters:
(Tensor (k) – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.
[batch (shape) – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.
seqlen – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.
num_heads – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.
or (head_dim] if padded) – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.
(Tensor – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.
[batch – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.
seqlen – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.
num_kv_heads – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.
or – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.
cu_seqlens (Tensor, shape [batch_size + 1,], optional) – Cumulative sequence lengths if unpadded. Defaults to None.
max_seqlen (int, optional) – Maximum sequence length in batch. Defaults to None.
- Returns:
(q, k) with rotary position encoding applied, same shapes as input.
- Return type:
tuple[Tensor, Tensor]
Feed-Forward Networks¶
- class bertblocks.modeling.mlp.MLP(hidden_size: int, intermediate_size: int, actv_fn: str, in_bias: bool = True, out_bias: bool = True)[source]¶
Bases:
ModuleStandard Multi-Layer Perceptron for BertBlocks.
This class implements a standard two-layer MLP (feedforward network).
- Variables:
uprj (nn.Linear) – up projection layer, from hidden size to intermediate size.
actv (nn.Module) – Activation function.
dprj (nn.Linear) – down projection layer, from intermediate size to hidden size.
- Parameters:
hidden_size (int) – Dimensionality of hidden layers (input/output dimension).
intermediate_size (int) – Dimensionality of feed-forward layers.
actv_fn (str) – Activation function used in feed-forward networks.
in_bias (bool) – Whether to include bias in the input projection layer. Defaults to True.
out_bias (bool) – Whether to include bias in the output projection layer. Defaults to True.
- forward(x: torch.Tensor) torch.Tensor[source]¶
Forward pass of the MLP layer.
Applies standard feedforward transformation: activation(W1*x + b1)*W2 + b2 where biases are optional based on configuration.
- Parameters:
x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Input tensor.
- Returns:
- Transformed tensor after two linear projections, activation, and dropout,
shape [batch_size, sequence_length, hidden_size].
- Return type:
- class bertblocks.modeling.mlp.GLU(hidden_size: int, intermediate_size: int, actv_fn: str, in_bias: bool = True, out_bias: bool = True)[source]¶
Bases:
ModuleGated Linear Unit (GLU) implementation for BertBlocks.
This class implements a GLU-style MLP layer that uses gating to control information flow.
- Variables:
uprj (nn.Linear) – up projection layer, from hidden size to 2 * intermediate size.
actv (nn.Module) – Activation function.
dprj (nn.Linear) – down projection layer, from intermediate size to hidden size.
- Parameters:
hidden_size (int) – Dimensionality of hidden layers (input/output dimension).
intermediate_size (int) – Dimensionality of feed-forward layers.
actv_fn (str) – Activation function used in feed-forward networks.
in_bias (bool) – Whether to include bias in the input projection layer. Defaults to True.
out_bias (bool) – Whether to include bias in the output projection layer. Defaults to True.
- forward(x: torch.Tensor) torch.Tensor[source]¶
Forward pass of the GLU layer.
Implements the gated linear unit computation: value * activation(gate) where both value and gate are linear projections of the input.
- Parameters:
x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Input tensor.
- Returns:
- Transformed tensor after gated projection, down-projection, and dropout,
shape [batch_size, sequence_length, hidden_size].
- Return type:
- class bertblocks.modeling.mlp.Linear(hidden_size: int, actv_fn: str, bias: bool = True)[source]¶
Bases:
ModuleLinear layer wrapper implementation for BertBlocks.
- Variables:
ffwd (nn.Linear) – linear feed-forward layer.
actv (nn.Module) – activation function.
- Parameters:
- forward(x: torch.Tensor) torch.Tensor[source]¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- bertblocks.modeling.mlp.get_mlp(config: BertBlocksConfig) nn.Module[source]¶
Get the MLP layer specified in the configuration.
This factory function returns the appropriate MLP architecture based on the configuration. Supports both standard MLP and GLU variants.
- Parameters:
config (BertBlocksConfig) – Configuration object determining model hyperparameters.
- Returns:
An MLP module (nn.Module) that can transform hidden states.
- Raises:
ValueError – If the specified MLP type is not supported.
- Supported MLP types:
linear: Standard single feed-forward layer.
mlp: Standard two-layer feedforward network
glu: Gated Linear Unit with learned gating mechanism
Normalization¶
- class bertblocks.modeling.norms.DynamicTanhNorm(alpha: float, dim: int)[source]¶
Bases:
ModuleDynamic Tanh normalization.
- Variables:
alpha (nn.Parameter) – learnable scalar input scale parameter.
beta (nn.Parameter) – learnable, per-channel shift parameter.
gamma (nn.Parameter) – learnable, per-channel scale parameter.
- Parameters:
References
Transformers without Normalization (https://arxiv.org/pdf/2503.10622)
- forward(x: Tensor) Tensor[source]¶
Apply dynamic tanh normalization.
- Parameters:
x (torch.Tensor) – Input tensor to normalize.
- Returns:
Normalized tensor.
- Return type:
- class bertblocks.modeling.norms.DeepNorm(alpha: float, normalized_shape: int | list[int], eps: float = 1e-05, **norm_kwargs: Any)[source]¶
Bases:
ModuleDeepNorm normalization.
References: - DeepNet: Scaling Transformers to 1,000 Layers (https://ieeexplore.ieee.org/document/10496231)
- forward(x: Tensor, gx: Tensor) Tensor[source]¶
Apply DeepNorm.
- Parameters:
x (torch.Tensor) – Input tensor.
gx (torch.Tensor) – Gradient tensor to be scaled and added.
- Returns:
Normalized tensor.
- Return type:
- bertblocks.modeling.norms.get_norm(config: BertBlocksConfig) Module[source]¶
Get the normalization layer specified in the configuration.
This factory function returns the appropriate normalization layer based on the configuration. Supports different normalization techniques commonly used in transformer architectures.
- Parameters:
config (BertBlocksConfig) – Configuration object determining model hyperparameters.
layer_id (int, optional) – Layer ID to index into per-layer config definitions. Unused for scalar config values.
- Returns:
A normalization module (nn.Module) that can normalize tensors.
- Raises:
ValueError – If the specified normalization type is not supported.
Supported normalization types:
group: Group normalization
layer: Layer normalization across the hidden dimension
rms: Root Mean Square layer normalization
deep: DeepNorm
dynamictanh: DynamicTanhNorm
Prediction Heads¶
- class bertblocks.modeling.head.Pooler(config: BertBlocksConfig)[source]¶
Bases:
ModulePooling layer.
Applies a linear layer and activation function to the first token of the last hidden state.
- Variables:
ffwd – Feed-forward layer from hidden size to hidden size.
actv – Activation function.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
hidden_size: Dimensionality of hidden layers
actv_fn: Activation function used in feed-forward networks
- forward(x: torch.Tensor) torch.Tensor[source]¶
Forward pass of the pooling layer.
- Parameters:
x (torch.Tensor, shape [batch_size, seq_len, hidden_size]) – Padded input hidden states.
- Returns:
Pooled representation of the first token. Shape [batch_size, hidden_size].
- Return type:
- class bertblocks.modeling.head.ProjectionPredictionHead(config: BertBlocksConfig)[source]¶
Bases:
ModulePrediction head with linear projection.
- Variables:
pre_norm (nn.Module) – Pre-norm function. Falls back to nn.Identity if not configured.
ffwd (nn.Linear) – Feed-forward projection layer.
post_norm (nn.Module) – Post-norm function. Falls back to nn.Identity if not configured.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
norm_kind: When to apply normalization (“pre”, “post”, “both”, “none”)
- forward(x: torch.Tensor) torch.Tensor[source]¶
Forward pass of the prediction head.
- Parameters:
x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Padded input hidden state.
- Returns:
Transformed hidden state, shape [batch_size, sequence_length, hidden_size].
- Return type:
- class bertblocks.modeling.head.GLUPredictionHead(config: BertBlocksConfig)[source]¶
Bases:
ModulePrediction head with gated activation.
- Variables:
pre_norm (nn.Module) – Pre-norm function. Falls back to nn.Identity if not configured.
ffwd (nn.Module) – Feed-forward projection layer.
post_norm (nn.Module) – Post-norm function. Falls back to nn.Identity if not configured.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
norm_kind: When to apply normalization (“pre”, “post”, “both”, “none”)
- forward(x: torch.Tensor) torch.Tensor[source]¶
Forward pass of the prediction head.
- Parameters:
x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Padded input hidden state.
- Returns:
Transformed hidden state, shape [batch_size, sequence_length, hidden_size].
- Return type:
- class bertblocks.modeling.head.MLPPredictionHead(config: BertBlocksConfig)[source]¶
Bases:
ModuleMLP Prediction head.
- Variables:
pre_norm (nn.Module) – Pre-norm function. Falls back to nn.Identity if not configured.
ffwd (nn.Module) – Feed-forward projection layer.
post_norm (nn.Module) – Post-norm function. Falls back to nn.Identity if not configured.
- Parameters:
config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
norm_kind: When to apply normalization (“pre”, “post”, “both”, “none”)
- forward(x: torch.Tensor) torch.Tensor[source]¶
Forward pass of the prediction head.
- Parameters:
x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Padded input hidden state.
- Returns:
Transformed hidden state, shape [batch_size, sequence_length, hidden_size].
- Return type:
- bertblocks.modeling.head.get_prediction_head(config: BertBlocksConfig) Module[source]¶
Get the prediction head layer specified in the configuration.
This factory function returns the appropriate prediction head architecture based on the configuration. Supports both standard MLP and GLU variants.
- Parameters:
config (BertBlocksConfig) – Configuration object determining model hyperparameters.
- Returns:
An prediction head module that can transform hidden states.
- Raises:
ValueError – If the specified prediction head type is not supported.
Supported prediction head types:
proj: Projection prediction head.
mlp: Standard two-layer feedforward network
glu: Gated Linear Unit
Activations¶
- bertblocks.modeling.activations.get_actv_fn(actv_fn: str) Module[source]¶
Get the activation function specified in the configuration.
- Parameters:
actv_fn (str) – Kind of activation function.
- Returns:
An activation function module that can be called on tensors.
- Return type:
nn.Module
- Raises:
ValueError – If the specified activation function is not supported.
Supported activation functions:
relu: Rectified Linear Unit
silu: Sigmoid Linear Unit (Swish)
gelu: Gaussian Error Linear Unit
leakyrelu: Leaky Rectified Linear Unit
selu: Scaled Exponential Linear Unit
logsigmoid: Log-sigmoid activation
sigmoid: Standard sigmoid activation
prelu: Parametric Rectified Linear Unit
Loss Functions¶
- bertblocks.modeling.loss.get_loss_function(
- problem_type: Literal['regression', 'single_label_classification', 'multi_label_classification'] | None,
Return the applicable loss function for a given problem type.
- Parameters:
problem_type (Literal["regression", "single_label_classification", "multi_label_classification"] | None) – The type of problem.
- Returns:
The appropriate loss function module.
- Return type:
nn.Module
- Raises:
ValueError – If the problem type is not supported.
Padding Utilities¶
- bertblocks.modeling.padding.unpad_input( ) tuple[Tensor, Tensor, Tensor, int][source]¶
Remove padding from input sequences.
Automatically detects and handles both standard (binary 0/1) and packed (sequence-indexed) attention mask formats.
- Parameters:
input_ids (torch.Tensor, shape [batch, seqlen, ...]) – tensor of token IDs.
attention_mask (torch.Tensor | None, shape [batch, seqlen]) – token mask. Can be binary (standard) or sequence-indexed (packed).
pad_token_id (int | None) – id of the padding token to remove, optional. Only used if attention_mask is None. If both are None, assumes full inputs.
- Returns:
tuple[torch.Tensor, torch.Tensor, torch.Tensor, int]
unpadded_inputs (torch.Tensor, shape [total_seq_len, …]): the fused unpadded token IDs
indices (torch.Tensor, shape [total_seq_len, …]): the sequence indices
cu_seqlens (torch.Tensor, [batch + 1,]): the cumulative sequence lengths
max_seqlen_in_batch (int): the maximum unpadded sequence length encountered in the batch
- bertblocks.modeling.padding.pad_output( ) Tensor[source]¶
Add padding to sequences.
- Parameters:
inputs (torch.Tensor, shape [total_nnz, ...]) – Input tensor, unpadded.
indices (torch.Tensor, shape [total_nnz,]) – Indices tensor.
batch (int) – batch size
seqlen (int) – sequence length
pad_token_id (int) – token ID to insert for padding.
- Returns:
- torch.Tensor
The padded inputs, shape [batch, seqlen, …]
Scaling¶
- class bertblocks.modeling.scale.LayerScaler(layer_id: int)[source]¶
Bases:
ModuleScales an input inversely to the layer depth.
- Variables:
scaling_factor (torch.Tensor) – scaling factor.
- Parameters:
layer_id (int) – layer position in the encoder stack (0-indexed).
References
The Curse of Depth in Large Language Models (https://arxiv.org/pdf/2502.05795)
- forward(x: Tensor) Tensor[source]¶
Apply layer scaling.
- Parameters:
x (torch.Tensor) – Input tensor to scale.
- Returns:
Scaled tensor.
- Return type:
- class bertblocks.modeling.scale.LearnableLayerScaler(layer_id: int)[source]¶
Bases:
ModuleScales an input with a learnable per-layer scale parameter.
Unlike LayerScaler which uses a fixed formula based on depth, this module learns an independent scale parameter for each layer during training.
- Variables:
scale (nn.Parameter) – Learnable scaling parameter.
- Parameters:
layer_id (int) – layer position in the encoder stack (0-indexed). Used to maintain interface compatibility with LayerScaler.
- forward(x: Tensor) Tensor[source]¶
Apply learnable layer scaling.
- Parameters:
x (torch.Tensor) – Input tensor to scale.
- Returns:
Scaled tensor.
- Return type: