Modeling¶

The bertblocks.modeling package contains all neural network components.

Models¶

The top-level model classes that combine embedding, encoder, and task head.

class bertblocks.modeling.model.BertBlocksPreTrainedModel(config: BertBlocksConfig, *args: Any, **kwargs: Any)[source]¶

Bases: PreTrainedModel

Base class for all BertBlocks models.

This class provides the base configuration and weight initialization for all BertBlocks model variants. It inherits from HuggingFace’s PreTrainedModel to provide compatibility with the transformers library.

config_class¶: alias of BertBlocksConfig

class bertblocks.modeling.model.BertBlocksModel(config: BertBlocksConfig, add_pooling_layer: bool = False)[source]¶

Bases: BertBlocksPreTrainedModel

Core BertBlocks model for encoding sequences.

This is the base BertBlocks model that outputs hidden states without any task-specific head. It can be used as a feature extractor for downstream tasks.

Variables:

embd (TokenEmbedding) – Embedding layer.
encd (Encoder) – Encoder stack.
norm (nn.Module) – Normalization layer. Falls back to nn.Identity if not configured.
pool (Pooler | None) – Pooler layer, optional.
pad_token_id (int) – Token ID to insert for padding.

Parameters:

config (BertBlocksConfig) – Configuration object determining model hyperparameters. Passed to other submodules.
add_pooling_layer (bool) – Whether to add a pooling layer after the encoder layers.

config_class¶: alias of BertBlocksConfig

property device: device¶: Get the device of the model parameters.

property dtype: dtype¶: Get the dtype of the model parameters.

forward( input_ids: Tensor, attention_mask: Tensor | None = None, token_type_ids: Tensor | None = None, output_attentions: bool = False, output_hidden_states: bool = False, pad_outputs: bool = True, ) → UnpaddedBaseModelOutput | UnpaddedBaseModelOutputWithPooling | BaseModelOutput | BaseModelOutputWithPooling[source]¶

Forward pass of the BertBlocks model.

Parameters:

input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
output_attentions – Whether to return attention weights from all layers. Defaults to None.
output_hidden_states – Whether to return hidden states from all layers. Defaults to False.
pad_outputs – Whether to pad outputs if model is in unpadding mode. Defaults to True.

Returns:

last_hidden_state: Hidden states from the last layer
pooler_output: Pooler output from the last layer (optional)
hidden_states: Hidden states from all layers (optional)
attentions: Attention weights from all layers (optional)

Return type:

BaseModelOutput or BaseModelOutputWithPooling containing

get_input_embeddings() → Embedding[source]¶

Get the input token embeddings.

Returns:: The input token embedding layer.
Return type:: nn.Embedding

set_input_embeddings(value: Embedding) → None[source]¶

Set the input token embeddings.

Parameters:: value – The new input token embedding layer to use.

unpad_input( input_ids: Tensor, attention_mask: Tensor | None, ) → tuple[Tensor, Tensor, Tensor, int][source]¶: Unpad input tensors.

Task Heads¶

class bertblocks.modeling.model.BertBlocksForMaskedLM(config: BertBlocksConfig)[source]¶

Bases: BertBlocksPreTrainedModel

BertBlocks model for masked language modeling tasks.

This model extends the base BertBlocks model with a prediction head and decoder for masked language modeling. It can be used for pre-training or fine-tuning on masked language modeling tasks.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

vocab_size: Size of the vocabulary for token embeddings
hidden_size: Dimensionality of hidden layers

config_class¶: alias of BertBlocksConfig

Forward pass for masked language modeling.

Parameters:

input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
labels (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor of target token ids for computing loss. Defaults to None.
output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.
output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.

Returns:

MaskedLMOutput

loss: Masked language modeling loss if labels provided

logits: Prediction scores over vocabulary

hidden_states: Hidden states from all layers if requested

attentions: Attention weights from all layers if requested

get_input_embeddings() → Module[source]¶: Return the encoder input embeddings.

get_output_embeddings() → Module[source]¶: Return the decoder embeddings.

set_output_embeddings(new_embeddings: Linear) → None[source]¶: Replace the decoder embeddings with given one (e.g., the encoder side).

class bertblocks.modeling.model.BertBlocksForSequenceClassification(config: BertBlocksConfig)[source]¶

Bases: BertBlocksForTasksBase

BertBlocks model for sequence classification tasks.

This model extends the base BertBlocks model with a classification head for sequence-level prediction tasks. It supports regression, single-label classification, and multi-label classification.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

hidden_size: Dimensionality of hidden layers
num_classes: Number of output labels for classification tasks
problem_type: Problem type for automatic loss selection

config_class¶: alias of BertBlocksConfig

Forward pass for sequence classification.

Parameters:

input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
labels (torch.Tensor, shape [batch_size,] or [batch_size, num_classes], optional) – Tensor of target labels for computing loss. Defaults to None.
output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.
output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.

Returns:

SequenceClassifierOutput

loss: Classification loss if labels provided

logits: Classification scores

hidden_states: Hidden states from all layers if requested

attentions: Attention weights from all layers if requested

class bertblocks.modeling.model.BertBlocksForTokenClassification(config: BertBlocksConfig)[source]¶

Bases: BertBlocksForTasksBase

BertBlocks model for token classification tasks.

This model extends the base BertBlocks model with a classification head for token-level prediction tasks such as named entity recognition, part-of-speech tagging, and other sequence labeling tasks.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

hidden_size: Dimensionality of hidden layers
num_classes: Number of output labels for classification tasks

config_class¶: alias of BertBlocksConfig

Forward pass for token classification.

Parameters:

input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
labels (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor of target labels for computing loss. Defaults to None.
output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.
output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.

Returns:

TokenClassifierOutput

loss: Token classification loss if labels provided

logits: Classification scores for each token

hidden_states: Hidden states from all layers if requested

attentions: Attention weights from all layers if requested

class bertblocks.modeling.model.BertBlocksForQuestionAnswering(config: BertBlocksConfig)[source]¶

Bases: BertBlocksForTasksBase

BertBlocks model for extractive question answering tasks.

This model extends the base BertBlocks model with a classification head that predicts start and end positions of answers in the input sequence. It is designed for tasks like SQuAD where the answer is a span of text within the provided context.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

hidden_size: Dimensionality of hidden layers

config_class¶: alias of BertBlocksConfig

forward( input_ids: Tensor, attention_mask: Tensor | None = None, token_type_ids: Tensor | None = None, start_positions: Tensor | None = None, end_positions: Tensor | None = None, output_attentions: bool | None = False, output_hidden_states: bool | None = False, ) → QuestionAnsweringModelOutput[source]¶

Forward pass for question answering.

Parameters:

input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
start_positions (torch.Tensor, shape [batch_size,], optional) – Tensor of start positions for computing loss. Values should be in [0, sequence_length-1]. Defaults to None.
end_positions (torch.Tensor, shape [batch_size,], optional) – Tensor of end positions for computing loss. Values should be in [0, sequence_length-1]. Defaults to None.
output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.
output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.

Returns:

QuestionAnsweringModelOutput

loss: Span prediction loss if start_positions and end_positions provided

start_logits: Scores for start position of answer span

end_logits: Scores for end position of answer span

hidden_states: Hidden states from all layers if requested

attentions: Attention weights from all layers if requested

class bertblocks.modeling.model.BertBlocksForMaskedDiffusion(config: BertBlocksConfig)[source]¶

Bases: BertBlocksForMaskedLM, GenerationMixin

Implementation of a masked diffusion model.

Closely follows https://github.com/kuleshov-group/mdlm

config_class¶: alias of BertBlocksConfig

Forward pass for diffusion language modeling.

Parameters:

input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids. When training, should be timestep-corrupted token IDs.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None (all tokens are attended to).
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
labels (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating uncorrupted token IDs.
output_attentions (bool) – Whether to return attention weights from all layers. Defaults to False.
output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.

generate( input_ids: Tensor | None = None, attention_mask: Tensor | None = None, max_length: int | None = None, num_samples: int = 1, num_steps: int = 100, temperature: float = 1.0, eps: float = 1e-05, block_size: int | None = None, ) → Tensor[source]¶

Generate samples using iterative denoising from noise to data.

Supports both unconditional generation and prefix-conditioned generation. Compatible with HuggingFace tokenizer output.

Parameters:

input_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Input token IDs to condition on. If provided, tokens where attention_mask=1 will be preserved during sampling. If None, generates unconditionally from scratch.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Mask indicating which input_ids positions to preserve (1) vs denoise (0). If None and input_ids is provided, all input positions are preserved.
max_length (int, optional) – Maximum sequence length to generate. If None, uses self.max_seq_len. If input_ids is shorter than max_length, extends with MASK tokens.
num_samples (int, optional) – Number of sequences to generate when input_ids is None. Defaults to 1.
num_steps (int, optional) – Number of denoising steps (more = higher quality, slower). Defaults to 100.
temperature (float, optional) – Temperature parameter. Defaults to 1.0.
eps (float, optional) – Final noise level. Defaults to 1e-5.
block_size (int, optional) – Size of blocks for block-wise denoising. If None, processes the whole sequence in parallel. Block denoising processes the sequence left-to-right in chunks, which can improve coherence for longer sequences.

Returns:

Generated token sequences.

Return type:

torch.Tensor (shape [batch_size, max_length] or [num_samples, max_length])

Examples

# Unconditional generation >>> sequences = model.generate(num_samples=4, max_length=128)

# Prefix-conditioned generation >>> inputs = tokenizer(“The cat sat on”, return_tensors=”pt”) >>> sequences = model.generate(**inputs, max_length=128)

# Block denoising for longer sequences >>> sequences = model.generate(**inputs, max_length=256, block_size=64)

get_input_embeddings() → Module[source]¶: Return the encoder input embeddings.

get_output_embeddings() → Module[source]¶: Return the decoder embeddings.

infill( input_ids: Tensor, attention_mask: Tensor | None = None, num_steps: int = 100, temperature: float = 1.0, eps: float = 1e-05, block_size: int | None = None, ) → Tensor[source]¶

Fill masked positions in the input using iterative diffusion denoising.

Unlike generate() which extends a prefix, this method fills in MASK tokens at arbitrary positions within the sequence. All non-MASK tokens are preserved.

Parameters:

input_ids (torch.Tensor, shape [batch_size, seq_len]) – Input sequences containing MASK tokens at positions to be filled. Non-MASK tokens will be preserved.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Mask indicating which positions are valid (1) vs padding (0). If None, all positions are valid.
num_steps (int, optional) – Number of denoising steps. Defaults to 100.
temperature (float, optional) – Sampling temperature. Defaults to 1.0.
eps (float, optional) – Final noise level. Defaults to 1e-5.
block_size (int, optional) – Size of blocks for block-wise denoising. If None, processes the whole sequence in parallel.

Returns:

Sequences with MASK positions filled.

Return type:

torch.Tensor (shape [batch_size, seq_len])

Examples

# Fill middle of sequence >>> text = “The cat [MASK] [MASK] [MASK] the mat.” >>> inputs = tokenizer(text, return_tensors=”pt”) >>> filled = model.infill(inputs[“input_ids”])

# Block denoising for longer sequences >>> filled = model.infill(inputs[“input_ids”], block_size=64)

prepare_inputs_for_generation( input_ids: Tensor, attention_mask: Tensor | None = None, target_length: int | None = None, ) → tuple[Tensor, Tensor][source]¶: Modify input arguments to be ready for generation.

set_input_embeddings(value: Module) → None[source]¶: Update the encoder input embeddings.

set_output_embeddings(new_embeddings: Linear) → None[source]¶: Replace the decoder embeddings with given one (e.g., the encoder side).

class bertblocks.modeling.model.BertBlocksForEnhancedMaskedLM( config: BertBlocksConfig, masking_strategy: Literal['random'] = 'random', masking_probability: float = 0.5, )[source]¶

Bases: BertBlocksForMaskedLM

BertBlocks model for enhanced masked language modeling tasks.

This model extends the base BertBlocks model with a prediction head and decoder for enhanced masked language modeling. It can be used for pre-training or fine-tuning on enhanced masked language modeling tasks. Enhanced masked language modeling uses one additional transformer layer to handle the masking, instead of masking input tokens.

Parameters:

config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
- vocab_size: Size of the vocabulary for token embeddings
- hidden_size: Dimensionality of hidden layers
masking_strategy (str) – Masking strategy to use. Available options: “random”.
masking_probability (float) – Probability of masking tokens. Defaults to 0.5.

config_class¶: alias of BertBlocksConfig

Forward pass for masked language modeling.

Parameters:

input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.
attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.
labels (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor of target token ids for computing loss. Defaults to None.
output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.
output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.

Returns:

MaskedLMOutput

loss: Masked language modeling loss if labels provided

logits: Prediction scores over vocabulary

hidden_states: Hidden states from all layers if requested

attentions: Attention weights from all layers if requested

Output Types¶

Transformer Block¶

class bertblocks.modeling.block.Block(config: BertBlocksConfig, layer_id: int)[source]¶

Bases: Module

A single transformer block.

Implements a standard transformer block with attention and feed-forward layers, supporting both pre-normalization and post-normalization schemes.

The block consists of:

Multi-head self-attention with residual connection

Feed-forward network with residual connection

Layer normalization (pre/post/both/none)

Variables:

layer_id (int) – index position of the layer in the models’ encoder stack.
attn (Attention) – Attention module.
ffwd (nn.Module) – Feed-forward module.
pre_norm_attn (nn.Module) – Pre-normalization layer for attention module. Falls back to nn.Identity if not configured.
pre_norm_ffwd (nn.Module) – Pre-normalization layer for feed-forward module. Falls back to nn.Identity if not configured.
post_norm_attn (nn.Module) – Pre-normalization function for attention module. Falls back to nn.Identity if not configured.
post_norm_ffwd (nn.Module) – Post-normalization function for feed-forward module. Falls back to nn.Identity if not configured.
attn_drop (nn.Dropout) – Post-attention dropout layer. Falls back to nn.Identity if not configured.
ffwd_drop (nn.Dropout) – Post-Feed-forward dropout layer. Falls back to nn.Identity if not configured.

Parameters:

config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
- norm_kind: Normalization layer type
- attn_dropout_prob: Dropout probability for attention layer
- hidden_dropout_prob: Dropout probability for feed-forward layers
layer_id (int) – zero-indexed layer id indicating index in the encoder stack.

References

“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)
“On Layer Normalization in the Transformer Architecture” (https://arxiv.org/pdf/2002.04745)
“The Curse of Depth in Large Language Models” (https://arxiv.org/pdf/2502.05795)

forward( x: Tensor, attention_mask: Tensor | None = None, cu_seqlens: Tensor | None = None, max_seq_len: int | None = None, ) → tuple[Tensor, Tensor | None][source]¶

Forward pass of the transformer block.

Applies a sequence of operations: pre-norm -> attention -> residual -> post-norm -> pre-norm -> feed-forward -> residual -> post-norm. Supports both padded and unpadded sequences.

Parameters:

x (Tensor, shape [batch_size, seq_len, hidden_size] or [total_seq_len, hidden_size]) – Hidden state to process. For padded sequences, use [batch_size, seq_len, hidden_size]. For unpadded sequences, use [total_seq_len, hidden_size].
attention_mask (Tensor, shape [batch_size, 1, seq_len, seq_len], optional) – 4D attention mask for padded sequences. Boolean or float mask with shape [batch_size, num_heads, seq_len, seq_len]. Ignored if cu_seqlens is provided. Defaults to None.
cu_seqlens (Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences. If provided, enables flash attention optimized path. Defaults to None.
max_seq_len (int, optional) – Maximum sequence length in the batch when using unpadded format. Required when cu_seqlens is provided. Defaults to None.

Returns:

A tuple containing:

output (Tensor): Transformed hidden state with same shape and dtype as input.

attention_weights (Tensor | None): Attention weights if returned by backend, otherwise None.

Return type:

tuple[Tensor, Tensor | None]

References

“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)
“On Layer Normalization in the Transformer Architecture” (https://arxiv.org/pdf/2002.04745)
“The Curse of Depth in Large Language Models” (https://arxiv.org/pdf/2502.05795)

class bertblocks.modeling.block.EnhancedMaskingBlock( config: BertBlocksConfig, layer_id: int, masking_strategy: Literal['random'], masking_probability: float = 0.5, )[source]¶

Bases: Block

A single transformer block.

Implements an enhanced masking transformer block which allows for custom modifications of the attention mask.

Variables:

layer_id (int) – index position of the layer in the models’ encoder stack.
attn (Attention) – Attention module.
ffwd (nn.Module) – Feed-forward module.
pre_norm_attn (nn.Module) – Pre-normalization layer for attention module. Falls back to nn.Identity if not configured.
pre_norm_ffwd (nn.Module) – Pre-normalization layer for feed-forward module. Falls back to nn.Identity if not configured.
post_norm_attn (nn.Module) – Pre-normalization function for attention module. Falls back to nn.Identity if not configured.
post_norm_ffwd (nn.Module) – Post-normalization function for feed-forward module. Falls back to nn.Identity if not configured.
attn_drop (nn.Dropout) – Post-attention dropout layer. Falls back to nn.Identity if not configured.
ffwd_drop (nn.Dropout) – Post-Feed-forward dropout layer. Falls back to nn.Identity if not configured.

Parameters:

config (BertBlocksConfig) –
Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:
- norm_kind: Normalization layer type
- attn_dropout_prob: Dropout probability for attention layer
- hidden_dropout_prob: Dropout probability for feed-forward layers
layer_id (int) – layer id indicating index in the encoder stack.
masking_strategy (str) – Masking strategy to use. Available options: “random”.
masking_probability (float) – Probability of masking tokens. Defaults to 0.5.

References

“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)
“On Layer Normalization in the Transformer Architecture” (https://arxiv.org/pdf/2002.04745)
Paper for enhanced masking?

forward( x: Tensor, attention_mask: Tensor | None = None, cu_seqlens: Tensor | None = None, max_seq_len: int | None = None, ) → tuple[Tensor, Tensor | None][source]¶

Forward pass of the enhanced masking transformer block.

Applies custom masking strategy to attention before processing through the transformer. Supports random masking with configurable probability.

Parameters:

x (Tensor, shape [batch_size, seq_len, hidden_size] or [total_seq_len, hidden_size]) – Hidden state to process. For padded sequences, use [batch_size, seq_len, hidden_size]. For unpadded sequences, use [total_seq_len, hidden_size].
attention_mask (Tensor, shape [batch_size, seq_len], optional) – 2D binary mask indicating which tokens are valid (1) vs padding (0). If None, all tokens are considered valid. Defaults to None.
cu_seqlens (Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences. Defaults to None.
max_seq_len (int, optional) – Maximum sequence length in the batch when using unpadded format. Defaults to None.

Returns:

A tuple containing:

output (Tensor): Transformed hidden state with same shape and dtype as input.

attention_weights (Tensor | None): Attention weights if returned by backend, otherwise None.

Return type:

tuple[Tensor, Tensor | None]

Note

Diagonal of attention mask is set to 0 to prevent tokens from attending to themselves.

class bertblocks.modeling.block.Encoder(config: BertBlocksConfig)[source]¶

Bases: Module

Multi-layer transformer encoder.

Uses sequence packing for higher efficiency.

Variables:

blocks (nn.ModuleList) – Stack of Block modules.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

num_blocks: Number of transformer blocks

num_attention_heads: Number of transformer attention heads

Forward pass of the encoder.

Processes input hidden state sequentially through all transformer blocks. Supports both padded and unpadded (packed) sequences for efficient processing.

Parameters:

x (Tensor, shape [batch_size, seq_len, hidden_size] or [total_seq_len, hidden_size]) – Hidden state to process. For padded sequences, use [batch_size, seq_len, hidden_size]. For unpadded sequences, use [total_seq_len, hidden_size].
attention_mask (Tensor, shape [batch_size, 1, seq_len, seq_len], optional) – 4D attention mask for padded sequences. Ignored if cu_seqlens is provided. Defaults to None.
cu_seqlens (Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences. Defaults to None.
max_seq_len (int, optional) – Maximum sequence length in the batch when using unpadded format. Defaults to None.
output_attentions (bool, optional) – Whether to return attention weights from all layers. Defaults to False.
output_hidden_states (bool, optional) – Whether to return hidden states from all layers. Defaults to False.

Returns:

A tuple containing:

last_hidden_state (Tensor): Output of the final transformer layer with same shape as input.

all_hidden_states (tuple[Tensor, …] | None): Tuple of hidden states from all layers (including input embedding). Only returned if output_hidden_states=True, length = num_blocks + 1.

all_attentions (tuple[Tensor, …] | None): Tuple of attention weights from all layers. Only returned if output_attentions=True, length = num_blocks.

Return type:

tuple[Tensor, tuple[Tensor, …] | None, tuple[Tensor, …] | None]

References

“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)

bertblocks.modeling.block.convert_to_4d_attention_mask(attention_mask: Tensor) → Tensor[source]¶

Convert a 2D attention mask to 4D.

Parameters:: attention_mask (Tensor, shape [batch_size, seq_length]) – The input attention mask.
Returns:: The converted 4D attention mask.
Return type:: Tensor

Attention¶

class bertblocks.modeling.attention.Attention(config: BertBlocksConfig, layer_id: int)[source]¶

Bases: Module

Attention with configurable positional encodings.

Variables:

num_heads (int) – Number of attention heads.
head_dim (int) – Dimension size of attention heads.
max_seq_len (int) – Maximum sequence length.
dropout_p (float) – Dropout probability for attention.
local_attention (tuple[int, int]) – Local attention size, if applied.
deterministic (bool) – Whether to use deterministic attention.
proj (nn.Linear) – Fused QKV projection layer.
ffwd (nn.Linear) – Feed-forward layer to combine heads after attention.
qk_norm (bool) – Whether to apply query-key-normalization.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

num_attention_heads: Number of attention heads in multi-head attention

hidden_size: Dimensionality of hidden layers (must be divisible by num_attention_heads)

max_sequence_length: Maximum sequence length for positional encodings

attn_proj_bias: Whether to include bias in QKV projection

attn_out_bias: Whether to include bias in output projection

attn_dropout_prob: Dropout probability for attention weights

block_pos_enc_kind: Type of positional embedding (“alibi”, “rope”, “relative”, etc.)

layer_id (int): layer id indicating index in the encoder stack.

forward( x: Tensor, attention_mask: Tensor | None = None, cu_seqlens: Tensor | None = None, max_seq_len: int | None = None, ) → tuple[Tensor, Tensor | None][source]¶

Forward pass of the attention mechanism.

Automatically routes to padded or unpadded implementation based on backend capabilities. Supports both standard padded sequences and packed (unpadded) sequences via flash attention.

Parameters:

x (torch.Tensor, shape [batch_size, seq_len, hidden_size] or [total_seq_len, hidden_size]) – Hidden state to apply attention to. For padded inputs, use [batch_size, seq_len, hidden_size]. For unpadded inputs, use [total_seq_len, hidden_size].
attention_mask (torch.Tensor, shape [batch_size, 1, seq_len, seq_len], optional) – 4D attention mask for padded sequences. Should be in causal or full attention format. Ignored if cu_seqlens is provided. Defaults to None.
cu_seqlens (torch.Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences in packed format. If provided, enables flash attention optimized path. Defaults to None.
max_seq_len (int, optional) – Maximum sequence length in the batch when using unpadded format. Required when cu_seqlens is provided. Defaults to None.

Returns:

A tuple containing:

output (torch.Tensor): Attention output with shape [batch_size, seq_len, hidden_size] (padded)
or [total_seq_len, hidden_size] (unpadded).
attention_weights (torch.Tensor | None): Optional attention weights. None for most backends.

Return type:

tuple

Raises:

ValueError – If neither attention_mask nor cu_seqlens is provided.

References

“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)

class bertblocks.modeling.attention.AttentionGate(config: BertBlocksConfig)[source]¶

Bases: Module

A multiplicative attention gate that should be positioned ahead of the final feed-forward module.

Gating values are computed from the query vectors, which act as the input signal.

Variables:

num_heads (int) – Number of attention heads.
head_dim (int) – Dimension size of attention heads.
attention_gate_type (AttentionGate) – Attention gate type.
gate_proj (nn.Linear) – Gating layer.

Parameters:

config (BertBlocksConfig) – Configuration object determining model hyperparameters. May be passed to other submodules.

References

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free (https://openreview.net/pdf?id=1b7whO4SfY)

forward(q: Tensor, x: Tensor) → Tensor[source]¶

Forward pass of the attention gate.

Parameters:

(torch.Tensor (x) – or [total_seq_len, num_heads, head_dim]): Query tensor.
[batch_size (shape) – or [total_seq_len, num_heads, head_dim]): Query tensor.
seq_len – or [total_seq_len, num_heads, head_dim]): Query tensor.
num_heads – or [total_seq_len, num_heads, head_dim]): Query tensor.
head_dim] (num_heads *) – or [total_seq_len, num_heads, head_dim]): Query tensor.
(torch.Tensor – or [total_seq_len, num_heads * head_dim]): Hidden state after attention.
[batch_size – or [total_seq_len, num_heads * head_dim]): Hidden state after attention.
seq_len – or [total_seq_len, num_heads * head_dim]): Hidden state after attention.
head_dim] – or [total_seq_len, num_heads * head_dim]): Hidden state after attention.

Returns: torch.Tensor: Hidden state modulated by query projection.

Attention Backends¶

class bertblocks.modeling.backends.AttentionBackend[source]¶

Abstract base class for attention backends.

forward_padded( q: Tensor, k: Tensor, v: Tensor, attention_mask: Tensor, dropout_p: float = 0.0, deterministic: bool = False, ) → tuple[Tensor, Tensor | None][source]¶

Forward pass with padded sequences.

Parameters:

q (Tensor, shape [batch_size, seq_len, num_heads, head_dim]) – Query tensor.
k (Tensor, shape [batch_size, seq_len, num_kv_heads, head_dim]) – Key tensor.
v (Tensor, shape [batch_size, seq_len, num_kv_heads, head_dim]) – Value tensor.
attention_mask (Tensor) – Attention mask.
dropout_p (float) – Dropout probability.
deterministic (bool) – Whether to use deterministic attention.

Returns:

Output tensor [batch_size, seq_len, num_heads * head_dim] and optional: attention weights.

Return type:

tuple[Tensor, Tensor | None]

forward_unpadded( q: Tensor, k: Tensor, v: Tensor, cu_seqlens: Tensor, max_seq_len: int, alibi_slopes: Tensor | None = None, local_attention: tuple[int, int] = (-1, -1), dropout_p: float = 0.0, deterministic: bool = False, ) → tuple[Tensor, Tensor | None][source]¶

Forward pass with unpadded sequences.

Parameters:

q (Tensor, shape [total_seq_len, num_heads, head_dim]) – Query tensor.
k (Tensor, shape [total_seq_len, num_kv_heads, head_dim]) – Key tensor.
v (Tensor, shape [total_seq_len, num_kv_heads, head_dim]) – Value tensor.
cu_seqlens (Tensor, shape [batch_size + 1]) – Cumulative sequence lengths.
max_seq_len (int) – Maximum sequence length in batch.
alibi_slopes (Tensor, optional) – ALiBi slopes for positional bias.
local_attention (tuple[int, int]) – Local attention window size.
dropout_p (float) – Dropout probability.
deterministic (bool) – Whether to use deterministic attention.

Returns:

Output tensor [total_seq_len, num_heads * head_dim] and optional attention: weights.

Return type:

tuple[Tensor, Tensor | None]

class bertblocks.modeling.backends.FlashBackend[source]¶

Bases: AttentionBackend

Flash Attention 2 backend.

class bertblocks.modeling.backends.SDPABackend[source]¶

Bases: AttentionBackend

PyTorch SDPA backend - works efficiently with padded sequences.

class bertblocks.modeling.backends.EagerBackend[source]¶

Bases: AttentionBackend

Native PyTorch backend.

bertblocks.modeling.backends.get_attention(config: BertBlocksConfig) → AttentionBackend[source]¶

Get the Attention backend specified in the configuration.

This factory function returns the appropriate attention backend based on the configuration.

Parameters:: config (BertBlocksConfig) – Configuration object determining model hyperparameters.
Returns:: An attention backend module.
Raises:: ValueError – If the specified attention backend is not supported.

Supported backends:

flash_attention_2: Flash Attention

sdpa: torch scaled dot product attention

eager: native torch attention

Embeddings¶

class bertblocks.modeling.embedding.TokenEmbedding(config: BertBlocksConfig)[source]¶

Bases: Module

Token embedding layer.

Implements the token embedding layer that converts input token IDs to dense vector representations. Optionally applies positional encodings and/or token type encodings.

Variables:

embd (nn.Embedding) – Token type embedding layer.
pose (nn.Module | None) – Positional encoding layer.
tokt (nn.Module | None) – Token type embedding layer.
norm (nn.Module) – Normalization layer. Falls back to nn.Identity if not configured.
drop (nn.Dropout) – Dropout layer. Falls back to nn.Identity if not configured.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

vocab_size (int): Size of the vocabulary for token embeddings
hidden_size: Dimensionality of embeddings and hidden states
pad_token_id: Token ID used for padding sequences
emb_pos_enc_kind: Type of positional encoding (“sinusoidal”, “learned”, etc.)
max_sequence_length: Maximum sequence length for positional encodings
add_token_type_emb: Whether to add token type embeddings
norm_kind: When to apply normalization (“post”, “both”, etc.)
emb_dropout_prob: Dropout probability for embedding layer output

forward( input_ids: LongTensor, cu_seqlens: LongTensor | None = None, token_type_ids: LongTensor | None = None, ) → Tensor[source]¶

Forward pass of the token embedding layer.

Combines token embeddings, optional token type embeddings, optional positional encodings, normalization, and dropout.

Parameters:

input_ids (torch.Tensor, shape [batch_size, seq_len] or [total_seq_len]) – Token IDs to embed. For padded inputs, shape is [batch_size, seq_len]. For unpadded inputs, shape is [total_seq_len].
cu_seqlens (torch.Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences. Used by positional encodings to compute per-sequence position indices. Defaults to None.
token_type_ids (torch.Tensor, shape [batch_size, seq_len] or [total_seq_len], optional) – Segment IDs indicating token type (e.g., 0 for sentence A, 1 for sentence B in NSP). Only used if add_token_type_emb is True in config. Defaults to None.

Returns:

Embedded token representations with shape:

[batch_size, seq_len, hidden_size] for padded inputs
[total_seq_len, hidden_size] for unpadded inputs

Return type:

torch.Tensor

References

“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (https://arxiv.org/abs/1810.04805)
“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)

class bertblocks.modeling.embedding.TokenTypeEmbedding(config: BertBlocksConfig)[source]¶

Bases: Module

Token type embedding layer.

Implements the token type embedding layer that converts token type IDs to dense vector representations.

Variables:

embd (nn.Embedding) – Token type embedding layer.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

type_vocab_size: Size of the token type vocabulary

hidden_size: Dimensionality of embeddings and hidden states

forward(x: Tensor, token_type_ids: Tensor | None = None) → Tensor[source]¶

Forward pass of the token type embeddings.

Uses supplied token type ids if given, otherwise defaults to constant token type ids.

Parameters:

x (torch.Tensor, shape [total_seq_len, hidden_size] or [batch_size, seq_len, hidden_size]) – Hidden state to add token type ids to.
(torch.Tensor (token_type_ids) – optional): Indicates the token type of each token in the sequence.
[total_seq_len (shape) – optional): Indicates the token type of each token in the sequence.
[batch_size (hidden_size] or) – optional): Indicates the token type of each token in the sequence.
seq_len – optional): Indicates the token type of each token in the sequence.
hidden_size] – optional): Indicates the token type of each token in the sequence.

:param : optional): Indicates the token type of each token in the sequence.

Returns:

Hidden state with token type embedding added, shape [total_seq_len, hidden_size] or: [batch_size, seq_len, hidden_size].

Return type:

torch.Tensor

Positional Encodings¶

class bertblocks.modeling.position.SinusoidalPositionalEncoding(dim: int, max_seq_len: int = 1024, base: float = 10000.0)[source]¶

Bases: Module

Implementation of Sinusoidal Positional Encodings.

References

“Attention Is All You Need” (https://arxiv.org/pdf/1706.03762)

forward(x: Tensor, cu_seqlens: Tensor | None = None) → Tensor[source]¶

Add sinusoidal positional encoding to a given tensor.

Parameters:

x (torch.Tensor) – The tensor to add positional encoding to. - For unpadded: shape [total_seq_len, embedding_dim] - For padded: shape [batch_size, seq_len, embedding_dim]
cu_seqlens (torch.Tensor, shape [batch_size + 1,], optional) – Cumulative sequence lengths for unpadded sequences. If None, assumes padded format.

Returns:

The tensor after adding positional encoding, same shape as input.

Return type:

torch.Tensor

class bertblocks.modeling.position.LearnedPositionalEncoding(dim: int, max_seq_len: int)[source]¶

Bases: Module

Learned Positional Encodings.

Variables:

embd (nn.Embedding) – The embedding layer encoding position.

Parameters:

dim (int) – Hidden size of the model.
max_seq_len (int) – Maximum sequence length for the model.

forward(x: Tensor, cu_seqlens: Tensor | None = None) → Tensor[source]¶

Add learned positional encodings to a given tensor.

Parameters:

x (torch.Tensor) – The tensor to add positional encodings to. - For unpadded: shape [total_seq_len, embedding_dim] - For padded: shape [batch_size, seq_len, embedding_dim]
cu_seqlens (torch.Tensor, shape [batch_size + 1,], optional) – Cumulative sequence lengths for unpadded sequences. If None, assumes padded format.

Returns:

The tensor after adding learned positional encodings, same shape as input.

Return type:

torch.Tensor

class bertblocks.modeling.position.AlibiPositionalEncoding(num_heads: int)[source]¶

Bases: Module

Alibi Positional Encodings.

Variables:: slopes (torch.Tensor) – The alibi slope tensor indicating degree of positional bias for each head.
Parameters:: num_heads (int) – Number of attention heads.

forward(attention_mask: Tensor) → Tensor[source]¶

Add AliBi biases to a given attention mask.

Parameters:: attention_mask (torch.Tensor, shape [batch_size, num_heads, seq_len, seq_len]) – The attention mask.
Returns:: The attention mask after adding alibi biases. Same shape as input.
Return type:: torch.Tensor

static get_slopes(num_heads: int) → Tensor[source]¶: Construct ALiBi slopes.

class bertblocks.modeling.position.RotaryPositionalEncoding( rope_dim: int, head_dim: int, base: float = 10000.0, interleaved: bool = False, max_seq_len: int = 512, )[source]¶

Bases: Module

Implementation of rotary positional encodings.

Parameters:

rope_dim (int) – dimensionality of positional encoding. Equal to head_dim for full RoPE.
head_dim (int) – dimensionality of attention heads.
base (float, optional) – frequency base for positional encodings. Defaults to 10_000.0
interleaved (bool, optional) – indicates whether to rotate pairs of even and odd dimensions (True, GPT-J style) instead of 1st half and 2nd half (False, GPT-NeoX style). Defaults to False.
max_seq_len (int, optional) – initial maximum sequence length for the cos/sin cache. Defaults to 512.

References

“RoFormer: Enhanced Transformer with Rotary Position Embedding” (https://arxiv.org/abs/2104.09864)
“GPT-NeoX-20B: An Open-Source Autoregressive Language Model” (https://arxiv.org/abs/2204.06745)
“Round and Round We Go! What makes Rotary Positional Encodings useful?” (https://arxiv.org/abs/2410.06205)

forward( q: Tensor, k: Tensor, cu_seqlens: Tensor | None = None, max_seqlen: int | None = None, attention_mask: Tensor | None = None, ) → tuple[Tensor, Tensor][source]¶

Apply rotary positional encoding to query and key tensors.

Parameters:

(Tensor (k) – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.
[batch (shape) – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.
seqlen – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.
num_heads – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.
or (head_dim] if padded) – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.
(Tensor – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.
[batch – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.
seqlen – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.
num_kv_heads – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.
or – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.
cu_seqlens (Tensor, shape [batch_size + 1,], optional) – Cumulative sequence lengths if unpadded. Defaults to None.
max_seqlen (int, optional) – Maximum sequence length in batch. Defaults to None.
attention_mask (Tensor, shape [batch_size, seq_len], optional) – Attention mask for padded data. Defaults to None.

Returns:

(q, k) with rotary position encoding applied, same shapes as input.

Return type:

tuple[Tensor, Tensor]

Feed-Forward Networks¶

class bertblocks.modeling.mlp.MLP(hidden_size: int, intermediate_size: int, actv_fn: str, in_bias: bool = True, out_bias: bool = True)[source]¶

Bases: Module

Standard Multi-Layer Perceptron for BertBlocks.

This class implements a standard two-layer MLP (feedforward network).

Variables:

uprj (nn.Linear) – up projection layer, from hidden size to intermediate size.
actv (nn.Module) – Activation function.
dprj (nn.Linear) – down projection layer, from intermediate size to hidden size.

Parameters:

hidden_size (int) – Dimensionality of hidden layers (input/output dimension).
intermediate_size (int) – Dimensionality of feed-forward layers.
actv_fn (str) – Activation function used in feed-forward networks.
in_bias (bool) – Whether to include bias in the input projection layer. Defaults to True.
out_bias (bool) – Whether to include bias in the output projection layer. Defaults to True.

forward(x: torch.Tensor) → torch.Tensor[source]¶

Forward pass of the MLP layer.

Applies standard feedforward transformation: activation(W1*x + b1)*W2 + b2 where biases are optional based on configuration.

Parameters:

x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Input tensor.

Returns:

Transformed tensor after two linear projections, activation, and dropout,: shape [batch_size, sequence_length, hidden_size].

Return type:

torch.Tensor

class bertblocks.modeling.mlp.GLU(hidden_size: int, intermediate_size: int, actv_fn: str, in_bias: bool = True, out_bias: bool = True)[source]¶

Bases: Module

Gated Linear Unit (GLU) implementation for BertBlocks.

This class implements a GLU-style MLP layer that uses gating to control information flow.

Variables:

uprj (nn.Linear) – up projection layer, from hidden size to 2 * intermediate size.
actv (nn.Module) – Activation function.
dprj (nn.Linear) – down projection layer, from intermediate size to hidden size.

Parameters:

hidden_size (int) – Dimensionality of hidden layers (input/output dimension).
intermediate_size (int) – Dimensionality of feed-forward layers.
actv_fn (str) – Activation function used in feed-forward networks.
in_bias (bool) – Whether to include bias in the input projection layer. Defaults to True.
out_bias (bool) – Whether to include bias in the output projection layer. Defaults to True.

forward(x: torch.Tensor) → torch.Tensor[source]¶

Forward pass of the GLU layer.

Implements the gated linear unit computation: value * activation(gate) where both value and gate are linear projections of the input.

Parameters:

x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Input tensor.

Returns:

Transformed tensor after gated projection, down-projection, and dropout,: shape [batch_size, sequence_length, hidden_size].

Return type:

torch.Tensor

class bertblocks.modeling.mlp.Linear(hidden_size: int, actv_fn: str, bias: bool = True)[source]¶

Bases: Module

Linear layer wrapper implementation for BertBlocks.

Variables:

ffwd (nn.Linear) – linear feed-forward layer.
actv (nn.Module) – activation function.

Parameters:

hidden_size (int) – Dimensionality of hidden layers (input/output dimension).
actv_fn (str) – Activation function.
bias (bool) – Whether to include bias in the layer. Defaults to True.

forward(x: torch.Tensor) → torch.Tensor[source]¶

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

bertblocks.modeling.mlp.get_mlp(config: BertBlocksConfig) → nn.Module[source]¶

Get the MLP layer specified in the configuration.

This factory function returns the appropriate MLP architecture based on the configuration. Supports both standard MLP and GLU variants.

Parameters:: config (BertBlocksConfig) – Configuration object determining model hyperparameters.
Returns:: An MLP module (nn.Module) that can transform hidden states.
Raises:: ValueError – If the specified MLP type is not supported.

Supported MLP types:

linear: Standard single feed-forward layer.
mlp: Standard two-layer feedforward network
glu: Gated Linear Unit with learned gating mechanism

Normalization¶

class bertblocks.modeling.norms.DynamicTanhNorm(alpha: float, dim: int)[source]¶

Bases: Module

Dynamic Tanh normalization.

Variables:

alpha (nn.Parameter) – learnable scalar input scale parameter.
beta (nn.Parameter) – learnable, per-channel shift parameter.
gamma (nn.Parameter) – learnable, per-channel scale parameter.

Parameters:

alpha (float) – Initial alpha value.
dim (int) – Dimensionality of the input.

References

Transformers without Normalization (https://arxiv.org/pdf/2503.10622)

forward(x: Tensor) → Tensor[source]¶

Apply dynamic tanh normalization.

Parameters:: x (torch.Tensor) – Input tensor to normalize.
Returns:: Normalized tensor.
Return type:: torch.Tensor

class bertblocks.modeling.norms.DeepNorm(alpha: float, normalized_shape: int | list[int], eps: float = 1e-05, **norm_kwargs: Any)[source]¶

Bases: Module

DeepNorm normalization.

References: - DeepNet: Scaling Transformers to 1,000 Layers (https://ieeexplore.ieee.org/document/10496231)

forward(x: Tensor, gx: Tensor) → Tensor[source]¶

Apply DeepNorm.

Parameters:

x (torch.Tensor) – Input tensor.
gx (torch.Tensor) – Gradient tensor to be scaled and added.

Returns:

Normalized tensor.

Return type:

torch.Tensor

bertblocks.modeling.norms.get_norm(config: BertBlocksConfig) → Module[source]¶

Get the normalization layer specified in the configuration.

This factory function returns the appropriate normalization layer based on the configuration. Supports different normalization techniques commonly used in transformer architectures.

Parameters:

config (BertBlocksConfig) – Configuration object determining model hyperparameters.
layer_id (int, optional) – Layer ID to index into per-layer config definitions. Unused for scalar config values.

Returns:

A normalization module (nn.Module) that can normalize tensors.

Raises:

ValueError – If the specified normalization type is not supported.

Supported normalization types:

group: Group normalization

layer: Layer normalization across the hidden dimension

rms: Root Mean Square layer normalization

deep: DeepNorm

dynamictanh: DynamicTanhNorm

Prediction Heads¶

class bertblocks.modeling.head.Pooler(config: BertBlocksConfig)[source]¶

Bases: Module

Pooling layer.

Applies a linear layer and activation function to the first token of the last hidden state.

Variables:

ffwd – Feed-forward layer from hidden size to hidden size.
actv – Activation function.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

hidden_size: Dimensionality of hidden layers

actv_fn: Activation function used in feed-forward networks

forward(x: torch.Tensor) → torch.Tensor[source]¶

Forward pass of the pooling layer.

Parameters:: x (torch.Tensor, shape [batch_size, seq_len, hidden_size]) – Padded input hidden states.
Returns:: Pooled representation of the first token. Shape [batch_size, hidden_size].
Return type:: torch.Tensor

class bertblocks.modeling.head.ProjectionPredictionHead(config: BertBlocksConfig)[source]¶

Bases: Module

Prediction head with linear projection.

Variables:

pre_norm (nn.Module) – Pre-norm function. Falls back to nn.Identity if not configured.
ffwd (nn.Linear) – Feed-forward projection layer.
post_norm (nn.Module) – Post-norm function. Falls back to nn.Identity if not configured.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

norm_kind: When to apply normalization (“pre”, “post”, “both”, “none”)

forward(x: torch.Tensor) → torch.Tensor[source]¶

Forward pass of the prediction head.

Parameters:: x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Padded input hidden state.
Returns:: Transformed hidden state, shape [batch_size, sequence_length, hidden_size].
Return type:: torch.Tensor

class bertblocks.modeling.head.GLUPredictionHead(config: BertBlocksConfig)[source]¶

Bases: Module

Prediction head with gated activation.

Variables:

pre_norm (nn.Module) – Pre-norm function. Falls back to nn.Identity if not configured.
ffwd (nn.Module) – Feed-forward projection layer.
post_norm (nn.Module) – Post-norm function. Falls back to nn.Identity if not configured.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

norm_kind: When to apply normalization (“pre”, “post”, “both”, “none”)

forward(x: torch.Tensor) → torch.Tensor[source]¶

Forward pass of the prediction head.

Parameters:: x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Padded input hidden state.
Returns:: Transformed hidden state, shape [batch_size, sequence_length, hidden_size].
Return type:: torch.Tensor

class bertblocks.modeling.head.MLPPredictionHead(config: BertBlocksConfig)[source]¶

Bases: Module

MLP Prediction head.

Variables:

pre_norm (nn.Module) – Pre-norm function. Falls back to nn.Identity if not configured.
ffwd (nn.Module) – Feed-forward projection layer.
post_norm (nn.Module) – Post-norm function. Falls back to nn.Identity if not configured.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

norm_kind: When to apply normalization (“pre”, “post”, “both”, “none”)

forward(x: torch.Tensor) → torch.Tensor[source]¶

Forward pass of the prediction head.

Parameters:: x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Padded input hidden state.
Returns:: Transformed hidden state, shape [batch_size, sequence_length, hidden_size].
Return type:: torch.Tensor

bertblocks.modeling.head.get_prediction_head(config: BertBlocksConfig) → Module[source]¶

Get the prediction head layer specified in the configuration.

This factory function returns the appropriate prediction head architecture based on the configuration. Supports both standard MLP and GLU variants.

Parameters:: config (BertBlocksConfig) – Configuration object determining model hyperparameters.
Returns:: An prediction head module that can transform hidden states.
Raises:: ValueError – If the specified prediction head type is not supported.

Supported prediction head types:

proj: Projection prediction head.

mlp: Standard two-layer feedforward network

glu: Gated Linear Unit

Activations¶

bertblocks.modeling.activations.get_actv_fn(actv_fn: str) → Module[source]¶

Get the activation function specified in the configuration.

Parameters:: actv_fn (str) – Kind of activation function.
Returns:: An activation function module that can be called on tensors.
Return type:: nn.Module
Raises:: ValueError – If the specified activation function is not supported.

Supported activation functions:

relu: Rectified Linear Unit

silu: Sigmoid Linear Unit (Swish)

gelu: Gaussian Error Linear Unit

leakyrelu: Leaky Rectified Linear Unit

selu: Scaled Exponential Linear Unit

logsigmoid: Log-sigmoid activation

sigmoid: Standard sigmoid activation

prelu: Parametric Rectified Linear Unit

Loss Functions¶

bertblocks.modeling.loss.get_loss_function( problem_type: Literal['regression', 'single_label_classification', 'multi_label_classification'] | None, ) → Module[source]¶

Return the applicable loss function for a given problem type.

Parameters:: problem_type (Literal["regression", "single_label_classification", "multi_label_classification"] | None) – The type of problem.
Returns:: The appropriate loss function module.
Return type:: nn.Module
Raises:: ValueError – If the problem type is not supported.

Padding Utilities¶

bertblocks.modeling.padding.unpad_input( input_ids: Tensor, attention_mask: Tensor | None, pad_token_id: int | None = None, ) → tuple[Tensor, Tensor, Tensor, int][source]¶

Remove padding from input sequences.

Automatically detects and handles both standard (binary 0/1) and packed (sequence-indexed) attention mask formats.

Parameters:

input_ids (torch.Tensor, shape [batch, seqlen, ...]) – tensor of token IDs.
attention_mask (torch.Tensor | None, shape [batch, seqlen]) – token mask. Can be binary (standard) or sequence-indexed (packed).
pad_token_id (int | None) – id of the padding token to remove, optional. Only used if attention_mask is None. If both are None, assumes full inputs.

Returns:

tuple[torch.Tensor, torch.Tensor, torch.Tensor, int]

unpadded_inputs (torch.Tensor, shape [total_seq_len, …]): the fused unpadded token IDs

indices (torch.Tensor, shape [total_seq_len, …]): the sequence indices

cu_seqlens (torch.Tensor, [batch + 1,]): the cumulative sequence lengths

max_seqlen_in_batch (int): the maximum unpadded sequence length encountered in the batch

bertblocks.modeling.padding.pad_output( inputs: Tensor, indices: Tensor, batch: int, seqlen: int, pad_token_id: int | None = None, ) → Tensor[source]¶

Add padding to sequences.

Parameters:

inputs (torch.Tensor, shape [total_nnz, ...]) – Input tensor, unpadded.
indices (torch.Tensor, shape [total_nnz,]) – Indices tensor.
batch (int) – batch size
seqlen (int) – sequence length
pad_token_id (int) – token ID to insert for padding.

Returns:

torch.Tensor: The padded inputs, shape [batch, seqlen, …]

Scaling¶

class bertblocks.modeling.scale.LayerScaler(layer_id: int)[source]¶

Bases: Module

Scales an input inversely to the layer depth.

Variables:: scaling_factor (torch.Tensor) – scaling factor.
Parameters:: layer_id (int) – layer position in the encoder stack (0-indexed).

References

The Curse of Depth in Large Language Models (https://arxiv.org/pdf/2502.05795)

forward(x: Tensor) → Tensor[source]¶

Apply layer scaling.

Parameters:: x (torch.Tensor) – Input tensor to scale.
Returns:: Scaled tensor.
Return type:: torch.Tensor

class bertblocks.modeling.scale.LearnableLayerScaler(layer_id: int)[source]¶

Bases: Module

Scales an input with a learnable per-layer scale parameter.

Unlike LayerScaler which uses a fixed formula based on depth, this module learns an independent scale parameter for each layer during training.

Variables:: scale (nn.Parameter) – Learnable scaling parameter.
Parameters:: layer_id (int) – layer position in the encoder stack (0-indexed). Used to maintain interface compatibility with LayerScaler.

forward(x: Tensor) → Tensor[source]¶

Apply learnable layer scaling.

Parameters:: x (torch.Tensor) – Input tensor to scale.
Returns:: Scaled tensor.
Return type:: torch.Tensor