Modeling

The bertblocks.modeling package contains all neural network components.

Models

The top-level model classes that combine embedding, encoder, and task head.

class bertblocks.modeling.model.BertBlocksPreTrainedModel(config: BertBlocksConfig, *args: Any, **kwargs: Any)[source]

Bases: PreTrainedModel

Base class for all BertBlocks models.

This class provides the base configuration and weight initialization for all BertBlocks model variants. It inherits from HuggingFace’s PreTrainedModel to provide compatibility with the transformers library.

config_class

alias of BertBlocksConfig

class bertblocks.modeling.model.BertBlocksModel(config: BertBlocksConfig, add_pooling_layer: bool = False)[source]

Bases: BertBlocksPreTrainedModel

Core BertBlocks model for encoding sequences.

This is the base BertBlocks model that outputs hidden states without any task-specific head. It can be used as a feature extractor for downstream tasks.

Variables:
  • embd (TokenEmbedding) – Embedding layer.

  • encd (Encoder) – Encoder stack.

  • norm (nn.Module) – Normalization layer. Falls back to nn.Identity if not configured.

  • pool (Pooler | None) – Pooler layer, optional.

  • pad_token_id (int) – Token ID to insert for padding.

Parameters:
  • config (BertBlocksConfig) – Configuration object determining model hyperparameters. Passed to other submodules.

  • add_pooling_layer (bool) – Whether to add a pooling layer after the encoder layers.

property device: device

Get the device of the model parameters.

property dtype: dtype

Get the dtype of the model parameters.

forward(
input_ids: Tensor,
attention_mask: Tensor | None = None,
token_type_ids: Tensor | None = None,
output_attentions: bool = False,
output_hidden_states: bool = False,
) MaybeUnpaddedBaseModelOutput | MaybeUnpaddedBaseModelOutputWithPooling[source]

Forward pass of the BertBlocks model.

Parameters:
  • input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.

  • attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.

  • token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.

  • output_attentions – Whether to return attention weights from all layers. Defaults to None.

  • output_hidden_states – Whether to return hidden states from all layers. Defaults to False.

Returns:

  • last_hidden_state: Hidden states from the last layer

  • pooler_output: Pooler output from the last layer (optional)

  • hidden_states: Hidden states from all layers (optional)

  • attentions: Attention weights from all layers (optional)

Return type:

BaseModelOutput or BaseModelOutputWithPooling containing

get_input_embeddings() Embedding[source]

Get the input token embeddings.

Returns:

The input token embedding layer.

Return type:

nn.Embedding

set_input_embeddings(value: Embedding) None[source]

Set the input token embeddings.

Parameters:

value – The new input token embedding layer to use.

unpad_input(
input_ids: Tensor,
attention_mask: Tensor | None,
) tuple[Tensor, Tensor, Tensor, int][source]

Unpad input tensors.

Task Heads

class bertblocks.modeling.model.BertBlocksForMaskedLM(config: BertBlocksConfig)[source]

Bases: BertBlocksPreTrainedModel

BertBlocks model for masked language modeling tasks.

This model extends the base BertBlocks model with a prediction head and decoder for masked language modeling. It can be used for pre-training or fine-tuning on masked language modeling tasks.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

  • vocab_size: Size of the vocabulary for token embeddings

  • hidden_size: Dimensionality of hidden layers

forward(
input_ids: Tensor,
attention_mask: Tensor | None = None,
token_type_ids: Tensor | None = None,
labels: Tensor | None = None,
output_attentions: bool | None = False,
output_hidden_states: bool | None = False,
) MaskedLMOutput[source]

Forward pass for masked language modeling.

Parameters:
  • input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.

  • attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.

  • token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.

  • labels (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor of target token ids for computing loss. Defaults to None.

  • output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.

  • output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.

Returns:

MaskedLMOutput

  • loss: Masked language modeling loss if labels provided

  • logits: Prediction scores over vocabulary

  • hidden_states: Hidden states from all layers if requested

  • attentions: Attention weights from all layers if requested

get_input_embeddings() Module[source]

Return the encoder input embeddings.

get_output_embeddings() Module[source]

Return the decoder embeddings.

set_output_embeddings(new_embeddings: Linear) None[source]

Replace the decoder embeddings with given one (e.g., the encoder side).

class bertblocks.modeling.model.BertBlocksForSequenceClassification(config: BertBlocksConfig)[source]

Bases: BertBlocksForTasksBase

BertBlocks model for sequence classification tasks.

This model extends the base BertBlocks model with a classification head for sequence-level prediction tasks. It supports regression, single-label classification, and multi-label classification.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

  • hidden_size: Dimensionality of hidden layers

  • num_classes: Number of output labels for classification tasks

  • problem_type: Problem type for automatic loss selection

forward(
input_ids: Tensor,
attention_mask: Tensor | None = None,
token_type_ids: Tensor | None = None,
labels: Tensor | None = None,
output_attentions: bool | None = False,
output_hidden_states: bool | None = False,
) SequenceClassifierOutput[source]

Forward pass for sequence classification.

Parameters:
  • input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.

  • attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.

  • token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.

  • labels (torch.Tensor, shape [batch_size,] or [batch_size, num_classes], optional) – Tensor of target labels for computing loss. Defaults to None.

  • output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.

  • output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.

Returns:

SequenceClassifierOutput

  • loss: Classification loss if labels provided

  • logits: Classification scores

  • hidden_states: Hidden states from all layers if requested

  • attentions: Attention weights from all layers if requested

class bertblocks.modeling.model.BertBlocksForTokenClassification(config: BertBlocksConfig)[source]

Bases: BertBlocksForTasksBase

BertBlocks model for token classification tasks.

This model extends the base BertBlocks model with a classification head for token-level prediction tasks such as named entity recognition, part-of-speech tagging, and other sequence labeling tasks.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

  • hidden_size: Dimensionality of hidden layers

  • num_classes: Number of output labels for classification tasks

forward(
input_ids: Tensor,
attention_mask: Tensor | None = None,
token_type_ids: Tensor | None = None,
labels: Tensor | None = None,
output_attentions: bool | None = False,
output_hidden_states: bool | None = False,
) TokenClassifierOutput[source]

Forward pass for token classification.

Parameters:
  • input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.

  • attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.

  • token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.

  • labels (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor of target labels for computing loss. Defaults to None.

  • output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.

  • output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.

Returns:

TokenClassifierOutput

  • loss: Token classification loss if labels provided

  • logits: Classification scores for each token

  • hidden_states: Hidden states from all layers if requested

  • attentions: Attention weights from all layers if requested

class bertblocks.modeling.model.BertBlocksForQuestionAnswering(config: BertBlocksConfig)[source]

Bases: BertBlocksForTasksBase

BertBlocks model for extractive question answering tasks.

This model extends the base BertBlocks model with a classification head that predicts start and end positions of answers in the input sequence. It is designed for tasks like SQuAD where the answer is a span of text within the provided context.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

  • hidden_size: Dimensionality of hidden layers

forward(
input_ids: Tensor,
attention_mask: Tensor | None = None,
token_type_ids: Tensor | None = None,
start_positions: Tensor | None = None,
end_positions: Tensor | None = None,
output_attentions: bool | None = False,
output_hidden_states: bool | None = False,
) QuestionAnsweringModelOutput[source]

Forward pass for question answering.

Parameters:
  • input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.

  • attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.

  • token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.

  • start_positions (torch.Tensor, shape [batch_size,], optional) – Tensor of start positions for computing loss. Values should be in [0, sequence_length-1]. Defaults to None.

  • end_positions (torch.Tensor, shape [batch_size,], optional) – Tensor of end positions for computing loss. Values should be in [0, sequence_length-1]. Defaults to None.

  • output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.

  • output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.

Returns:

QuestionAnsweringModelOutput

  • loss: Span prediction loss if start_positions and end_positions provided

  • start_logits: Scores for start position of answer span

  • end_logits: Scores for end position of answer span

  • hidden_states: Hidden states from all layers if requested

  • attentions: Attention weights from all layers if requested

class bertblocks.modeling.model.BertBlocksForMaskedDiffusion(config: BertBlocksConfig)[source]

Bases: BertBlocksForMaskedLM, GenerationMixin

Implementation of a masked diffusion model.

Closely follows https://github.com/kuleshov-group/mdlm

forward(
input_ids: Tensor,
attention_mask: Tensor | None = None,
token_type_ids: Tensor | None = None,
labels: Tensor | None = None,
output_attentions: bool | None = False,
output_hidden_states: bool | None = False,
) MaskedLMOutput[source]

Forward pass for diffusion language modeling.

Parameters:
  • input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids. When training, should be timestep-corrupted token IDs.

  • attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None (all tokens are attended to).

  • token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.

  • labels (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating uncorrupted token IDs.

  • output_attentions (bool) – Whether to return attention weights from all layers. Defaults to False.

  • output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.

generate(
input_ids: Tensor | None = None,
attention_mask: Tensor | None = None,
max_length: int | None = None,
num_samples: int = 1,
num_steps: int = 100,
temperature: float = 1.0,
eps: float = 1e-05,
block_size: int | None = None,
) Tensor[source]

Generate samples using iterative denoising from noise to data.

Supports both unconditional generation and prefix-conditioned generation. Compatible with HuggingFace tokenizer output.

Parameters:
  • input_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Input token IDs to condition on. If provided, tokens where attention_mask=1 will be preserved during sampling. If None, generates unconditionally from scratch.

  • attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Mask indicating which input_ids positions to preserve (1) vs denoise (0). If None and input_ids is provided, all input positions are preserved.

  • max_length (int, optional) – Maximum sequence length to generate. If None, uses self.max_seq_len. If input_ids is shorter than max_length, extends with MASK tokens.

  • num_samples (int, optional) – Number of sequences to generate when input_ids is None. Defaults to 1.

  • num_steps (int, optional) – Number of denoising steps (more = higher quality, slower). Defaults to 100.

  • temperature (float, optional) – Temperature parameter. Defaults to 1.0.

  • eps (float, optional) – Final noise level. Defaults to 1e-5.

  • block_size (int, optional) – Size of blocks for block-wise denoising. If None, processes the whole sequence in parallel. Block denoising processes the sequence left-to-right in chunks, which can improve coherence for longer sequences.

Returns:

Generated token sequences.

Return type:

torch.Tensor (shape [batch_size, max_length] or [num_samples, max_length])

Examples

# Unconditional generation >>> sequences = model.generate(num_samples=4, max_length=128)

# Prefix-conditioned generation >>> inputs = tokenizer(“The cat sat on”, return_tensors=”pt”) >>> sequences = model.generate(**inputs, max_length=128)

# Block denoising for longer sequences >>> sequences = model.generate(**inputs, max_length=256, block_size=64)

get_input_embeddings() Module[source]

Return the encoder input embeddings.

get_output_embeddings() Module[source]

Return the decoder embeddings.

infill(
input_ids: Tensor,
attention_mask: Tensor | None = None,
num_steps: int = 100,
temperature: float = 1.0,
eps: float = 1e-05,
block_size: int | None = None,
) Tensor[source]

Fill masked positions in the input using iterative diffusion denoising.

Unlike generate() which extends a prefix, this method fills in MASK tokens at arbitrary positions within the sequence. All non-MASK tokens are preserved.

Parameters:
  • input_ids (torch.Tensor, shape [batch_size, seq_len]) – Input sequences containing MASK tokens at positions to be filled. Non-MASK tokens will be preserved.

  • attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Mask indicating which positions are valid (1) vs padding (0). If None, all positions are valid.

  • num_steps (int, optional) – Number of denoising steps. Defaults to 100.

  • temperature (float, optional) – Sampling temperature. Defaults to 1.0.

  • eps (float, optional) – Final noise level. Defaults to 1e-5.

  • block_size (int, optional) – Size of blocks for block-wise denoising. If None, processes the whole sequence in parallel.

Returns:

Sequences with MASK positions filled.

Return type:

torch.Tensor (shape [batch_size, seq_len])

Examples

# Fill middle of sequence >>> text = “The cat [MASK] [MASK] [MASK] the mat.” >>> inputs = tokenizer(text, return_tensors=”pt”) >>> filled = model.infill(inputs[“input_ids”])

# Block denoising for longer sequences >>> filled = model.infill(inputs[“input_ids”], block_size=64)

prepare_inputs_for_generation(
input_ids: Tensor,
attention_mask: Tensor | None = None,
target_length: int | None = None,
) tuple[Tensor, Tensor][source]

Modify input arguments to be ready for generation.

set_input_embeddings(value: Module) None[source]

Update the encoder input embeddings.

set_output_embeddings(new_embeddings: Linear) None[source]

Replace the decoder embeddings with given one (e.g., the encoder side).

class bertblocks.modeling.model.BertBlocksForEnhancedMaskedLM(
config: BertBlocksConfig,
masking_strategy: Literal['random'] = 'random',
masking_probability: float = 0.5,
)[source]

Bases: BertBlocksForMaskedLM

BertBlocks model for enhanced masked language modeling tasks.

This model extends the base BertBlocks model with a prediction head and decoder for enhanced masked language modeling. It can be used for pre-training or fine-tuning on enhanced masked language modeling tasks. Enhanced masked language modeling uses one additional transformer layer to handle the masking, instead of masking input tokens.

Parameters:
  • config (BertBlocksConfig) –

    Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

    • vocab_size: Size of the vocabulary for token embeddings

    • hidden_size: Dimensionality of hidden layers

  • masking_strategy (str) – Masking strategy to use. Available options: “random”.

  • masking_probability (float) – Probability of masking tokens. Defaults to 0.5.

forward(
input_ids: Tensor,
attention_mask: Tensor | None = None,
token_type_ids: Tensor | None = None,
labels: Tensor | None = None,
output_attentions: bool | None = False,
output_hidden_states: bool | None = False,
) MaskedLMOutput[source]

Forward pass for masked language modeling.

Parameters:
  • input_ids (torch.Tensor, shape [batch_size, seq_len]) – Tensor of token ids.

  • attention_mask (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating which tokens should be attended to. Defaults to None.

  • token_type_ids (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor indicating type of tokens. Defaults to None.

  • labels (torch.Tensor, shape [batch_size, seq_len], optional) – Tensor of target token ids for computing loss. Defaults to None.

  • output_attentions (bool) – Whether to return attention weights from all layers. Defaults to None.

  • output_hidden_states (bool) – Whether to return hidden states from all layers. Defaults to False.

Returns:

MaskedLMOutput

  • loss: Masked language modeling loss if labels provided

  • logits: Prediction scores over vocabulary

  • hidden_states: Hidden states from all layers if requested

  • attentions: Attention weights from all layers if requested

Output Types

class bertblocks.modeling.model.MaybeUnpaddedBaseModelOutput(
last_hidden_state: torch.FloatTensor | None = None,
hidden_states: tuple[torch.FloatTensor, ...] | None = None,
attentions: tuple[torch.FloatTensor, ...] | None = None,
cu_seqlens: torch.FloatTensor | None = None,
indices: torch.FloatTensor | None = None,
seq_len: int | None = None,
batch_size: int | None = None,
)[source]
class bertblocks.modeling.model.MaybeUnpaddedBaseModelOutputWithPooling(
last_hidden_state: torch.FloatTensor | None = None,
pooler_output: torch.FloatTensor | None = None,
hidden_states: tuple[torch.FloatTensor, ...] | None = None,
attentions: tuple[torch.FloatTensor, ...] | None = None,
cu_seqlens: torch.FloatTensor | None = None,
indices: torch.FloatTensor | None = None,
seq_len: int | None = None,
batch_size: int | None = None,
)[source]

Transformer Block

class bertblocks.modeling.block.Block(config: BertBlocksConfig, layer_id: int)[source]

Bases: Module

A single transformer block.

Implements a standard transformer block with attention and feed-forward layers, supporting both pre-normalization and post-normalization schemes.

The block consists of:

  • Multi-head self-attention with residual connection

  • Feed-forward network with residual connection

  • Layer normalization (pre/post/both/none)

Variables:
  • layer_id (int) – index position of the layer in the models’ encoder stack.

  • attn (Attention) – Attention module.

  • ffwd (nn.Module) – Feed-forward module.

  • pre_norm_attn (nn.Module) – Pre-normalization layer for attention module. Falls back to nn.Identity if not configured.

  • pre_norm_ffwd (nn.Module) – Pre-normalization layer for feed-forward module. Falls back to nn.Identity if not configured.

  • post_norm_attn (nn.Module) – Pre-normalization function for attention module. Falls back to nn.Identity if not configured.

  • post_norm_ffwd (nn.Module) – Post-normalization function for feed-forward module. Falls back to nn.Identity if not configured.

  • attn_drop (nn.Dropout) – Post-attention dropout layer. Falls back to nn.Identity if not configured.

  • ffwd_drop (nn.Dropout) – Post-Feed-forward dropout layer. Falls back to nn.Identity if not configured.

Parameters:
  • config (BertBlocksConfig) –

    Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

    • norm_kind: Normalization layer type

    • attn_dropout_prob: Dropout probability for attention layer

    • hidden_dropout_prob: Dropout probability for feed-forward layers

  • layer_id (int) – zero-indexed layer id indicating index in the encoder stack.

References

forward(
x: Tensor,
attention_mask: Tensor | None = None,
cu_seqlens: Tensor | None = None,
max_seq_len: int | None = None,
) tuple[Tensor, Tensor | None][source]

Forward pass of the transformer block.

Applies a sequence of operations: pre-norm -> attention -> residual -> post-norm -> pre-norm -> feed-forward -> residual -> post-norm. Supports both padded and unpadded sequences.

Parameters:
  • x (Tensor, shape [batch_size, seq_len, hidden_size] or [total_seq_len, hidden_size]) – Hidden state to process. For padded sequences, use [batch_size, seq_len, hidden_size]. For unpadded sequences, use [total_seq_len, hidden_size].

  • attention_mask (Tensor, shape [batch_size, 1, seq_len, seq_len], optional) – 4D attention mask for padded sequences. Boolean or float mask with shape [batch_size, num_heads, seq_len, seq_len]. Ignored if cu_seqlens is provided. Defaults to None.

  • cu_seqlens (Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences. If provided, enables flash attention optimized path. Defaults to None.

  • max_seq_len (int, optional) – Maximum sequence length in the batch when using unpadded format. Required when cu_seqlens is provided. Defaults to None.

Returns:

A tuple containing:

  • output (Tensor): Transformed hidden state with same shape and dtype as input.

  • attention_weights (Tensor | None): Attention weights if returned by backend, otherwise None.

Return type:

tuple[Tensor, Tensor | None]

References

class bertblocks.modeling.block.EnhancedMaskingBlock(
config: BertBlocksConfig,
layer_id: int,
masking_strategy: Literal['random'],
masking_probability: float = 0.5,
)[source]

Bases: Block

A single transformer block.

Implements an enhanced masking transformer block which allows for custom modifications of the attention mask.

Variables:
  • layer_id (int) – index position of the layer in the models’ encoder stack.

  • attn (Attention) – Attention module.

  • ffwd (nn.Module) – Feed-forward module.

  • pre_norm_attn (nn.Module) – Pre-normalization layer for attention module. Falls back to nn.Identity if not configured.

  • pre_norm_ffwd (nn.Module) – Pre-normalization layer for feed-forward module. Falls back to nn.Identity if not configured.

  • post_norm_attn (nn.Module) – Pre-normalization function for attention module. Falls back to nn.Identity if not configured.

  • post_norm_ffwd (nn.Module) – Post-normalization function for feed-forward module. Falls back to nn.Identity if not configured.

  • attn_drop (nn.Dropout) – Post-attention dropout layer. Falls back to nn.Identity if not configured.

  • ffwd_drop (nn.Dropout) – Post-Feed-forward dropout layer. Falls back to nn.Identity if not configured.

Parameters:
  • config (BertBlocksConfig) –

    Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

    • norm_kind: Normalization layer type

    • attn_dropout_prob: Dropout probability for attention layer

    • hidden_dropout_prob: Dropout probability for feed-forward layers

  • layer_id (int) – layer id indicating index in the encoder stack.

  • masking_strategy (str) – Masking strategy to use. Available options: “random”.

  • masking_probability (float) – Probability of masking tokens. Defaults to 0.5.

References

forward(
x: Tensor,
attention_mask: Tensor | None = None,
cu_seqlens: Tensor | None = None,
max_seq_len: int | None = None,
) tuple[Tensor, Tensor | None][source]

Forward pass of the enhanced masking transformer block.

Applies custom masking strategy to attention before processing through the transformer. Supports random masking with configurable probability.

Parameters:
  • x (Tensor, shape [batch_size, seq_len, hidden_size] or [total_seq_len, hidden_size]) – Hidden state to process. For padded sequences, use [batch_size, seq_len, hidden_size]. For unpadded sequences, use [total_seq_len, hidden_size].

  • attention_mask (Tensor, shape [batch_size, seq_len], optional) – 2D binary mask indicating which tokens are valid (1) vs padding (0). If None, all tokens are considered valid. Defaults to None.

  • cu_seqlens (Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences. Defaults to None.

  • max_seq_len (int, optional) – Maximum sequence length in the batch when using unpadded format. Defaults to None.

Returns:

A tuple containing:

  • output (Tensor): Transformed hidden state with same shape and dtype as input.

  • attention_weights (Tensor | None): Attention weights if returned by backend, otherwise None.

Return type:

tuple[Tensor, Tensor | None]

Note

Diagonal of attention mask is set to 0 to prevent tokens from attending to themselves.

class bertblocks.modeling.block.Encoder(config: BertBlocksConfig)[source]

Bases: Module

Multi-layer transformer encoder.

Uses sequence packing for higher efficiency.

Variables:

blocks (nn.ModuleList) – Stack of Block modules.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

  • num_blocks: Number of transformer blocks

  • num_attention_heads: Number of transformer attention heads

forward(
x: Tensor,
attention_mask: Tensor | None,
cu_seqlens: Tensor | None,
max_seq_len: int | None,
output_attentions: bool | None = False,
output_hidden_states: bool | None = False,
) tuple[Tensor, tuple[Tensor, ...] | None, tuple[Tensor, ...] | None][source]

Forward pass of the encoder.

Processes input hidden state sequentially through all transformer blocks. Supports both padded and unpadded (packed) sequences for efficient processing.

Parameters:
  • x (Tensor, shape [batch_size, seq_len, hidden_size] or [total_seq_len, hidden_size]) – Hidden state to process. For padded sequences, use [batch_size, seq_len, hidden_size]. For unpadded sequences, use [total_seq_len, hidden_size].

  • attention_mask (Tensor, shape [batch_size, 1, seq_len, seq_len], optional) – 4D attention mask for padded sequences. Ignored if cu_seqlens is provided. Defaults to None.

  • cu_seqlens (Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences. Defaults to None.

  • max_seq_len (int, optional) – Maximum sequence length in the batch when using unpadded format. Defaults to None.

  • output_attentions (bool, optional) – Whether to return attention weights from all layers. Defaults to False.

  • output_hidden_states (bool, optional) – Whether to return hidden states from all layers. Defaults to False.

Returns:

A tuple containing:

  • last_hidden_state (Tensor): Output of the final transformer layer with same shape as input.

  • all_hidden_states (tuple[Tensor, …] | None): Tuple of hidden states from all layers (including input embedding). Only returned if output_hidden_states=True, length = num_blocks + 1.

  • all_attentions (tuple[Tensor, …] | None): Tuple of attention weights from all layers. Only returned if output_attentions=True, length = num_blocks.

Return type:

tuple[Tensor, tuple[Tensor, …] | None, tuple[Tensor, …] | None]

References

bertblocks.modeling.block.convert_to_4d_attention_mask(attention_mask: Tensor) Tensor[source]

Convert a 2D attention mask to 4D.

Parameters:

attention_mask (Tensor, shape [batch_size, seq_length]) – The input attention mask.

Returns:

The converted 4D attention mask.

Return type:

Tensor

Attention

class bertblocks.modeling.attention.Attention(config: BertBlocksConfig, layer_id: int)[source]

Bases: Module

Attention with configurable positional encodings.

Variables:
  • num_heads (int) – Number of attention heads.

  • head_dim (int) – Dimension size of attention heads.

  • max_seq_len (int) – Maximum sequence length.

  • dropout_p (float) – Dropout probability for attention.

  • local_attention (tuple[int, int]) – Local attention size, if applied.

  • deterministic (bool) – Whether to use deterministic attention.

  • proj (nn.Linear) – Fused QKV projection layer.

  • ffwd (nn.Linear) – Feed-forward layer to combine heads after attention.

  • qk_norm (bool) – Whether to apply query-key-normalization.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

  • num_attention_heads: Number of attention heads in multi-head attention

  • hidden_size: Dimensionality of hidden layers (must be divisible by num_attention_heads)

  • max_sequence_length: Maximum sequence length for positional encodings

  • attn_proj_bias: Whether to include bias in QKV projection

  • attn_out_bias: Whether to include bias in output projection

  • attn_dropout_prob: Dropout probability for attention weights

  • block_pos_enc_kind: Type of positional embedding (“alibi”, “rope”, “relative”, etc.)

layer_id (int): layer id indicating index in the encoder stack.

forward(
x: Tensor,
attention_mask: Tensor | None = None,
cu_seqlens: Tensor | None = None,
max_seq_len: int | None = None,
) tuple[Tensor, Tensor | None][source]

Forward pass of the attention mechanism.

Automatically routes to padded or unpadded implementation based on backend capabilities. Supports both standard padded sequences and packed (unpadded) sequences via flash attention.

Parameters:
  • x (torch.Tensor, shape [batch_size, seq_len, hidden_size] or [total_seq_len, hidden_size]) – Hidden state to apply attention to. For padded inputs, use [batch_size, seq_len, hidden_size]. For unpadded inputs, use [total_seq_len, hidden_size].

  • attention_mask (torch.Tensor, shape [batch_size, 1, seq_len, seq_len], optional) – 4D attention mask for padded sequences. Should be in causal or full attention format. Ignored if cu_seqlens is provided. Defaults to None.

  • cu_seqlens (torch.Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences in packed format. If provided, enables flash attention optimized path. Defaults to None.

  • max_seq_len (int, optional) – Maximum sequence length in the batch when using unpadded format. Required when cu_seqlens is provided. Defaults to None.

Returns:

A tuple containing:
  • output (torch.Tensor): Attention output with shape [batch_size, seq_len, hidden_size] (padded)

    or [total_seq_len, hidden_size] (unpadded).

  • attention_weights (torch.Tensor | None): Optional attention weights. None for most backends.

Return type:

tuple

Raises:

ValueError – If neither attention_mask nor cu_seqlens is provided.

References

class bertblocks.modeling.attention.AttentionGate(config: BertBlocksConfig)[source]

Bases: Module

A multiplicative attention gate that should be positioned ahead of the final feed-forward module.

Gating values are computed from the query vectors, which act as the input signal.

Variables:
  • num_heads (int) – Number of attention heads.

  • head_dim (int) – Dimension size of attention heads.

  • attention_gate_type (AttentionGate) – Attention gate type.

  • gate_proj (nn.Linear) – Gating layer.

Parameters:

config (BertBlocksConfig) – Configuration object determining model hyperparameters. May be passed to other submodules.

References

forward(q: Tensor, x: Tensor) Tensor[source]

Forward pass of the attention gate.

Parameters:
  • (torch.Tensor (x) – or [total_seq_len, num_heads, head_dim]): Query tensor.

  • [batch_size (shape) – or [total_seq_len, num_heads, head_dim]): Query tensor.

  • seq_len – or [total_seq_len, num_heads, head_dim]): Query tensor.

  • num_heads – or [total_seq_len, num_heads, head_dim]): Query tensor.

  • head_dim] (num_heads *) – or [total_seq_len, num_heads, head_dim]): Query tensor.

  • (torch.Tensor – or [total_seq_len, num_heads * head_dim]): Hidden state after attention.

  • [batch_size – or [total_seq_len, num_heads * head_dim]): Hidden state after attention.

  • seq_len – or [total_seq_len, num_heads * head_dim]): Hidden state after attention.

  • head_dim] – or [total_seq_len, num_heads * head_dim]): Hidden state after attention.

Returns: torch.Tensor

Hidden state modulated by query projection.

Attention Backends

class bertblocks.modeling.backends.AttentionBackend[source]

Abstract base class for attention backends.

forward_padded(
q: Tensor,
k: Tensor,
v: Tensor,
attention_mask: Tensor,
dropout_p: float = 0.0,
deterministic: bool = False,
) tuple[Tensor, Tensor | None][source]

Forward pass with padded sequences.

Parameters:
  • q (Tensor, shape [batch_size, seq_len, num_heads, head_dim]) – Query tensor.

  • k (Tensor, shape [batch_size, seq_len, num_kv_heads, head_dim]) – Key tensor.

  • v (Tensor, shape [batch_size, seq_len, num_kv_heads, head_dim]) – Value tensor.

  • attention_mask (Tensor) – Attention mask.

  • dropout_p (float) – Dropout probability.

  • deterministic (bool) – Whether to use deterministic attention.

Returns:

Output tensor [batch_size, seq_len, num_heads * head_dim] and optional

attention weights.

Return type:

tuple[Tensor, Tensor | None]

forward_unpadded(
q: Tensor,
k: Tensor,
v: Tensor,
cu_seqlens: Tensor,
max_seq_len: int,
alibi_slopes: Tensor | None = None,
local_attention: tuple[int, int] = (-1, -1),
dropout_p: float = 0.0,
deterministic: bool = False,
) tuple[Tensor, Tensor | None][source]

Forward pass with unpadded sequences.

Parameters:
  • q (Tensor, shape [total_seq_len, num_heads, head_dim]) – Query tensor.

  • k (Tensor, shape [total_seq_len, num_kv_heads, head_dim]) – Key tensor.

  • v (Tensor, shape [total_seq_len, num_kv_heads, head_dim]) – Value tensor.

  • cu_seqlens (Tensor, shape [batch_size + 1]) – Cumulative sequence lengths.

  • max_seq_len (int) – Maximum sequence length in batch.

  • alibi_slopes (Tensor, optional) – ALiBi slopes for positional bias.

  • local_attention (tuple[int, int]) – Local attention window size.

  • dropout_p (float) – Dropout probability.

  • deterministic (bool) – Whether to use deterministic attention.

Returns:

Output tensor [total_seq_len, num_heads * head_dim] and optional attention

weights.

Return type:

tuple[Tensor, Tensor | None]

class bertblocks.modeling.backends.FlashBackend[source]

Bases: AttentionBackend

Flash Attention 2 backend.

class bertblocks.modeling.backends.SDPABackend[source]

Bases: AttentionBackend

PyTorch SDPA backend - works efficiently with padded sequences.

class bertblocks.modeling.backends.EagerBackend[source]

Bases: AttentionBackend

Native PyTorch backend.

bertblocks.modeling.backends.get_attention(config: BertBlocksConfig) AttentionBackend[source]

Get the Attention backend specified in the configuration.

This factory function returns the appropriate attention backend based on the configuration.

Parameters:

config (BertBlocksConfig) – Configuration object determining model hyperparameters.

Returns:

An attention backend module.

Raises:

ValueError – If the specified attention backend is not supported.

Supported backends:

  • flash_attention_2: Flash Attention

  • sdpa: torch scaled dot product attention

  • eager: native torch attention

Embeddings

class bertblocks.modeling.embedding.TokenEmbedding(config: BertBlocksConfig)[source]

Bases: Module

Token embedding layer.

Implements the token embedding layer that converts input token IDs to dense vector representations. Optionally applies positional encodings and/or token type encodings.

Variables:
  • embd (nn.Embedding) – Token type embedding layer.

  • pose (nn.Module | None) – Positional encoding layer.

  • tokt (nn.Module | None) – Token type embedding layer.

  • norm (nn.Module) – Normalization layer. Falls back to nn.Identity if not configured.

  • drop (nn.Dropout) – Dropout layer. Falls back to nn.Identity if not configured.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

  • vocab_size (int): Size of the vocabulary for token embeddings

  • hidden_size: Dimensionality of embeddings and hidden states

  • pad_token_id: Token ID used for padding sequences

  • emb_pos_enc_kind: Type of positional encoding (“sinusoidal”, “learned”, etc.)

  • max_sequence_length: Maximum sequence length for positional encodings

  • add_token_type_emb: Whether to add token type embeddings

  • norm_kind: When to apply normalization (“post”, “both”, etc.)

  • emb_dropout_prob: Dropout probability for embedding layer output

forward(
input_ids: LongTensor,
cu_seqlens: LongTensor | None = None,
token_type_ids: LongTensor | None = None,
) Tensor[source]

Forward pass of the token embedding layer.

Combines token embeddings, optional token type embeddings, optional positional encodings, normalization, and dropout.

Parameters:
  • input_ids (torch.Tensor, shape [batch_size, seq_len] or [total_seq_len]) – Token IDs to embed. For padded inputs, shape is [batch_size, seq_len]. For unpadded inputs, shape is [total_seq_len].

  • cu_seqlens (torch.Tensor, shape [batch_size + 1], optional) – Cumulative sequence lengths for unpadded sequences. Used by positional encodings to compute per-sequence position indices. Defaults to None.

  • token_type_ids (torch.Tensor, shape [batch_size, seq_len] or [total_seq_len], optional) – Segment IDs indicating token type (e.g., 0 for sentence A, 1 for sentence B in NSP). Only used if add_token_type_emb is True in config. Defaults to None.

Returns:

Embedded token representations with shape:
  • [batch_size, seq_len, hidden_size] for padded inputs

  • [total_seq_len, hidden_size] for unpadded inputs

Return type:

torch.Tensor

References

class bertblocks.modeling.embedding.TokenTypeEmbedding(config: BertBlocksConfig)[source]

Bases: Module

Token type embedding layer.

Implements the token type embedding layer that converts token type IDs to dense vector representations.

Variables:

embd (nn.Embedding) – Token type embedding layer.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

  • type_vocab_size: Size of the token type vocabulary

  • hidden_size: Dimensionality of embeddings and hidden states

forward(x: Tensor, token_type_ids: Tensor | None = None) Tensor[source]

Forward pass of the token type embeddings.

Uses supplied token type ids if given, otherwise defaults to constant token type ids.

Parameters:
  • x (torch.Tensor, shape [total_seq_len, hidden_size] or [batch_size, seq_len, hidden_size]) – Hidden state to add token type ids to.

  • (torch.Tensor (token_type_ids) – optional): Indicates the token type of each token in the sequence.

  • [total_seq_len (shape) – optional): Indicates the token type of each token in the sequence.

  • [batch_size (hidden_size] or) – optional): Indicates the token type of each token in the sequence.

  • seq_len – optional): Indicates the token type of each token in the sequence.

  • hidden_size] – optional): Indicates the token type of each token in the sequence.

:param : optional): Indicates the token type of each token in the sequence.

Returns:

Hidden state with token type embedding added, shape [total_seq_len, hidden_size] or

[batch_size, seq_len, hidden_size].

Return type:

torch.Tensor

Positional Encodings

class bertblocks.modeling.position.SinusoidalPositionalEncoding(dim: int, max_seq_len: int = 1024, base: float = 10000.0)[source]

Bases: Module

Implementation of Sinusoidal Positional Encodings.

References

forward(x: Tensor, cu_seqlens: Tensor | None = None) Tensor[source]

Add sinusoidal positional encoding to a given tensor.

Parameters:
  • x (torch.Tensor) – The tensor to add positional encoding to. - For unpadded: shape [total_seq_len, embedding_dim] - For padded: shape [batch_size, seq_len, embedding_dim]

  • cu_seqlens (torch.Tensor, shape [batch_size + 1,], optional) – Cumulative sequence lengths for unpadded sequences. If None, assumes padded format.

Returns:

The tensor after adding positional encoding, same shape as input.

Return type:

torch.Tensor

class bertblocks.modeling.position.LearnedPositionalEncoding(dim: int, max_seq_len: int)[source]

Bases: Module

Learned Positional Encodings.

Variables:

embd (nn.Embedding) – The embedding layer encoding position.

Parameters:
  • dim (int) – Hidden size of the model.

  • max_seq_len (int) – Maximum sequence length for the model.

forward(x: Tensor, cu_seqlens: Tensor | None = None) Tensor[source]

Add learned positional encodings to a given tensor.

Parameters:
  • x (torch.Tensor) – The tensor to add positional encodings to. - For unpadded: shape [total_seq_len, embedding_dim] - For padded: shape [batch_size, seq_len, embedding_dim]

  • cu_seqlens (torch.Tensor, shape [batch_size + 1,], optional) – Cumulative sequence lengths for unpadded sequences. If None, assumes padded format.

Returns:

The tensor after adding learned positional encodings, same shape as input.

Return type:

torch.Tensor

class bertblocks.modeling.position.AlibiPositionalEncoding(num_heads: int)[source]

Bases: Module

Alibi Positional Encodings.

Variables:

slopes (torch.Tensor) – The alibi slope tensor indicating degree of positional bias for each head.

Parameters:

num_heads (int) – Number of attention heads.

forward(attention_mask: Tensor) Tensor[source]

Add AliBi biases to a given attention mask.

Parameters:

attention_mask (torch.Tensor, shape [batch_size, num_heads, seq_len, seq_len]) – The attention mask.

Returns:

The attention mask after adding alibi biases. Same shape as input.

Return type:

torch.Tensor

static get_slopes(num_heads: int) Tensor[source]

Construct ALiBi slopes.

class bertblocks.modeling.position.RotaryPositionalEncoding(
rope_dim: int,
head_dim: int,
base: float | None = 10000.0,
interleaved: bool | None = False,
max_seq_len: int = 512,
device: device | str = 'cuda',
)[source]

Bases: Module

Implementation of rotary positional encodings.

Parameters:
  • rope_dim (int) – dimensionality of positional encoding. Equal to head_dim for full RoPE.

  • head_dim (int) – dimensionality of attention heads.

  • base (float, optional) – frequency base for positional encodings. Defaults to 10_000.0

  • interleaved (bool, optional) – indicates whether to rotate pairs of even and odd dimensions (True, GPT-J style) instead of 1st half and 2nd half (False, GPT-NeoX style). Defaults to False.

  • device (torch.device, optional) – device on which to allocate the frequency buffer. Defaults to None (cpu).

References

forward(
q: Tensor,
k: Tensor,
cu_seqlens: Tensor | None = None,
max_seqlen: int | None = None,
) tuple[Tensor, Tensor][source]

Apply rotary positional encoding to query and key tensors.

Parameters:
  • (Tensor (k) – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.

  • [batch (shape) – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.

  • seqlen – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.

  • num_heads – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.

  • or (head_dim] if padded) – shape [total_seqlen, num_heads, head_dim] if unpadded): Query tensor.

  • (Tensor – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.

  • [batch – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.

  • seqlen – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.

  • num_kv_heads – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.

  • or – shape [total_seqlen, num_kv_heads, head_dim] if unpadded): Key tensor.

  • cu_seqlens (Tensor, shape [batch_size + 1,], optional) – Cumulative sequence lengths if unpadded. Defaults to None.

  • max_seqlen (int, optional) – Maximum sequence length in batch. Defaults to None.

Returns:

(q, k) with rotary position encoding applied, same shapes as input.

Return type:

tuple[Tensor, Tensor]

Feed-Forward Networks

class bertblocks.modeling.mlp.MLP(hidden_size: int, intermediate_size: int, actv_fn: str, in_bias: bool = True, out_bias: bool = True)[source]

Bases: Module

Standard Multi-Layer Perceptron for BertBlocks.

This class implements a standard two-layer MLP (feedforward network).

Variables:
  • uprj (nn.Linear) – up projection layer, from hidden size to intermediate size.

  • actv (nn.Module) – Activation function.

  • dprj (nn.Linear) – down projection layer, from intermediate size to hidden size.

Parameters:
  • hidden_size (int) – Dimensionality of hidden layers (input/output dimension).

  • intermediate_size (int) – Dimensionality of feed-forward layers.

  • actv_fn (str) – Activation function used in feed-forward networks.

  • in_bias (bool) – Whether to include bias in the input projection layer. Defaults to True.

  • out_bias (bool) – Whether to include bias in the output projection layer. Defaults to True.

forward(x: torch.Tensor) torch.Tensor[source]

Forward pass of the MLP layer.

Applies standard feedforward transformation: activation(W1*x + b1)*W2 + b2 where biases are optional based on configuration.

Parameters:

x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Input tensor.

Returns:

Transformed tensor after two linear projections, activation, and dropout,

shape [batch_size, sequence_length, hidden_size].

Return type:

torch.Tensor

class bertblocks.modeling.mlp.GLU(hidden_size: int, intermediate_size: int, actv_fn: str, in_bias: bool = True, out_bias: bool = True)[source]

Bases: Module

Gated Linear Unit (GLU) implementation for BertBlocks.

This class implements a GLU-style MLP layer that uses gating to control information flow.

Variables:
  • uprj (nn.Linear) – up projection layer, from hidden size to 2 * intermediate size.

  • actv (nn.Module) – Activation function.

  • dprj (nn.Linear) – down projection layer, from intermediate size to hidden size.

Parameters:
  • hidden_size (int) – Dimensionality of hidden layers (input/output dimension).

  • intermediate_size (int) – Dimensionality of feed-forward layers.

  • actv_fn (str) – Activation function used in feed-forward networks.

  • in_bias (bool) – Whether to include bias in the input projection layer. Defaults to True.

  • out_bias (bool) – Whether to include bias in the output projection layer. Defaults to True.

forward(x: torch.Tensor) torch.Tensor[source]

Forward pass of the GLU layer.

Implements the gated linear unit computation: value * activation(gate) where both value and gate are linear projections of the input.

Parameters:

x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Input tensor.

Returns:

Transformed tensor after gated projection, down-projection, and dropout,

shape [batch_size, sequence_length, hidden_size].

Return type:

torch.Tensor

class bertblocks.modeling.mlp.Linear(hidden_size: int, actv_fn: str, bias: bool = True)[source]

Bases: Module

Linear layer wrapper implementation for BertBlocks.

Variables:
  • ffwd (nn.Linear) – linear feed-forward layer.

  • actv (nn.Module) – activation function.

Parameters:
  • hidden_size (int) – Dimensionality of hidden layers (input/output dimension).

  • actv_fn (str) – Activation function.

  • bias (bool) – Whether to include bias in the layer. Defaults to True.

forward(x: torch.Tensor) torch.Tensor[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

bertblocks.modeling.mlp.get_mlp(config: BertBlocksConfig) nn.Module[source]

Get the MLP layer specified in the configuration.

This factory function returns the appropriate MLP architecture based on the configuration. Supports both standard MLP and GLU variants.

Parameters:

config (BertBlocksConfig) – Configuration object determining model hyperparameters.

Returns:

An MLP module (nn.Module) that can transform hidden states.

Raises:

ValueError – If the specified MLP type is not supported.

Supported MLP types:
  • linear: Standard single feed-forward layer.

  • mlp: Standard two-layer feedforward network

  • glu: Gated Linear Unit with learned gating mechanism

Normalization

class bertblocks.modeling.norms.DynamicTanhNorm(alpha: float, dim: int)[source]

Bases: Module

Dynamic Tanh normalization.

Variables:
  • alpha (nn.Parameter) – learnable scalar input scale parameter.

  • beta (nn.Parameter) – learnable, per-channel shift parameter.

  • gamma (nn.Parameter) – learnable, per-channel scale parameter.

Parameters:
  • alpha (float) – Initial alpha value.

  • dim (int) – Dimensionality of the input.

References

forward(x: Tensor) Tensor[source]

Apply dynamic tanh normalization.

Parameters:

x (torch.Tensor) – Input tensor to normalize.

Returns:

Normalized tensor.

Return type:

torch.Tensor

class bertblocks.modeling.norms.DeepNorm(alpha: float, normalized_shape: int | list[int], eps: float = 1e-05, **norm_kwargs: Any)[source]

Bases: Module

DeepNorm normalization.

References: - DeepNet: Scaling Transformers to 1,000 Layers (https://ieeexplore.ieee.org/document/10496231)

forward(x: Tensor, gx: Tensor) Tensor[source]

Apply DeepNorm.

Parameters:
Returns:

Normalized tensor.

Return type:

torch.Tensor

bertblocks.modeling.norms.get_norm(config: BertBlocksConfig) Module[source]

Get the normalization layer specified in the configuration.

This factory function returns the appropriate normalization layer based on the configuration. Supports different normalization techniques commonly used in transformer architectures.

Parameters:
  • config (BertBlocksConfig) – Configuration object determining model hyperparameters.

  • layer_id (int, optional) – Layer ID to index into per-layer config definitions. Unused for scalar config values.

Returns:

A normalization module (nn.Module) that can normalize tensors.

Raises:

ValueError – If the specified normalization type is not supported.

Supported normalization types:

  • group: Group normalization

  • layer: Layer normalization across the hidden dimension

  • rms: Root Mean Square layer normalization

  • deep: DeepNorm

  • dynamictanh: DynamicTanhNorm

Prediction Heads

class bertblocks.modeling.head.Pooler(config: BertBlocksConfig)[source]

Bases: Module

Pooling layer.

Applies a linear layer and activation function to the first token of the last hidden state.

Variables:
  • ffwd – Feed-forward layer from hidden size to hidden size.

  • actv – Activation function.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

  • hidden_size: Dimensionality of hidden layers

  • actv_fn: Activation function used in feed-forward networks

forward(x: torch.Tensor) torch.Tensor[source]

Forward pass of the pooling layer.

Parameters:

x (torch.Tensor, shape [batch_size, seq_len, hidden_size]) – Padded input hidden states.

Returns:

Pooled representation of the first token. Shape [batch_size, hidden_size].

Return type:

torch.Tensor

class bertblocks.modeling.head.ProjectionPredictionHead(config: BertBlocksConfig)[source]

Bases: Module

Prediction head with linear projection.

Variables:
  • pre_norm (nn.Module) – Pre-norm function. Falls back to nn.Identity if not configured.

  • ffwd (nn.Linear) – Feed-forward projection layer.

  • post_norm (nn.Module) – Post-norm function. Falls back to nn.Identity if not configured.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

  • norm_kind: When to apply normalization (“pre”, “post”, “both”, “none”)

forward(x: torch.Tensor) torch.Tensor[source]

Forward pass of the prediction head.

Parameters:

x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Padded input hidden state.

Returns:

Transformed hidden state, shape [batch_size, sequence_length, hidden_size].

Return type:

torch.Tensor

class bertblocks.modeling.head.GLUPredictionHead(config: BertBlocksConfig)[source]

Bases: Module

Prediction head with gated activation.

Variables:
  • pre_norm (nn.Module) – Pre-norm function. Falls back to nn.Identity if not configured.

  • ffwd (nn.Module) – Feed-forward projection layer.

  • post_norm (nn.Module) – Post-norm function. Falls back to nn.Identity if not configured.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

  • norm_kind: When to apply normalization (“pre”, “post”, “both”, “none”)

forward(x: torch.Tensor) torch.Tensor[source]

Forward pass of the prediction head.

Parameters:

x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Padded input hidden state.

Returns:

Transformed hidden state, shape [batch_size, sequence_length, hidden_size].

Return type:

torch.Tensor

class bertblocks.modeling.head.MLPPredictionHead(config: BertBlocksConfig)[source]

Bases: Module

MLP Prediction head.

Variables:
  • pre_norm (nn.Module) – Pre-norm function. Falls back to nn.Identity if not configured.

  • ffwd (nn.Module) – Feed-forward projection layer.

  • post_norm (nn.Module) – Post-norm function. Falls back to nn.Identity if not configured.

Parameters:

config (BertBlocksConfig) –

Configuration object determining model hyperparameters. May be passed to other submodules. Keys used at top level:

  • norm_kind: When to apply normalization (“pre”, “post”, “both”, “none”)

forward(x: torch.Tensor) torch.Tensor[source]

Forward pass of the prediction head.

Parameters:

x (torch.Tensor, shape [batch_size, sequence_length, hidden_size]) – Padded input hidden state.

Returns:

Transformed hidden state, shape [batch_size, sequence_length, hidden_size].

Return type:

torch.Tensor

bertblocks.modeling.head.get_prediction_head(config: BertBlocksConfig) Module[source]

Get the prediction head layer specified in the configuration.

This factory function returns the appropriate prediction head architecture based on the configuration. Supports both standard MLP and GLU variants.

Parameters:

config (BertBlocksConfig) – Configuration object determining model hyperparameters.

Returns:

An prediction head module that can transform hidden states.

Raises:

ValueError – If the specified prediction head type is not supported.

Supported prediction head types:

  • proj: Projection prediction head.

  • mlp: Standard two-layer feedforward network

  • glu: Gated Linear Unit

Activations

bertblocks.modeling.activations.get_actv_fn(actv_fn: str) Module[source]

Get the activation function specified in the configuration.

Parameters:

actv_fn (str) – Kind of activation function.

Returns:

An activation function module that can be called on tensors.

Return type:

nn.Module

Raises:

ValueError – If the specified activation function is not supported.

Supported activation functions:

  • relu: Rectified Linear Unit

  • silu: Sigmoid Linear Unit (Swish)

  • gelu: Gaussian Error Linear Unit

  • leakyrelu: Leaky Rectified Linear Unit

  • selu: Scaled Exponential Linear Unit

  • logsigmoid: Log-sigmoid activation

  • sigmoid: Standard sigmoid activation

  • prelu: Parametric Rectified Linear Unit

Loss Functions

bertblocks.modeling.loss.get_loss_function(
problem_type: Literal['regression', 'single_label_classification', 'multi_label_classification'] | None,
) Module[source]

Return the applicable loss function for a given problem type.

Parameters:

problem_type (Literal["regression", "single_label_classification", "multi_label_classification"] | None) – The type of problem.

Returns:

The appropriate loss function module.

Return type:

nn.Module

Raises:

ValueError – If the problem type is not supported.

Padding Utilities

bertblocks.modeling.padding.unpad_input(
input_ids: Tensor,
attention_mask: Tensor | None,
pad_token_id: int | None = None,
) tuple[Tensor, Tensor, Tensor, int][source]

Remove padding from input sequences.

Automatically detects and handles both standard (binary 0/1) and packed (sequence-indexed) attention mask formats.

Parameters:
  • input_ids (torch.Tensor, shape [batch, seqlen, ...]) – tensor of token IDs.

  • attention_mask (torch.Tensor | None, shape [batch, seqlen]) – token mask. Can be binary (standard) or sequence-indexed (packed).

  • pad_token_id (int | None) – id of the padding token to remove, optional. Only used if attention_mask is None. If both are None, assumes full inputs.

Returns:

tuple[torch.Tensor, torch.Tensor, torch.Tensor, int]

  • unpadded_inputs (torch.Tensor, shape [total_seq_len, …]): the fused unpadded token IDs

  • indices (torch.Tensor, shape [total_seq_len, …]): the sequence indices

  • cu_seqlens (torch.Tensor, [batch + 1,]): the cumulative sequence lengths

  • max_seqlen_in_batch (int): the maximum unpadded sequence length encountered in the batch

bertblocks.modeling.padding.pad_output(
inputs: Tensor,
indices: Tensor,
batch: int,
seqlen: int,
pad_token_id: int | None = None,
) Tensor[source]

Add padding to sequences.

Parameters:
  • inputs (torch.Tensor, shape [total_nnz, ...]) – Input tensor, unpadded.

  • indices (torch.Tensor, shape [total_nnz,]) – Indices tensor.

  • batch (int) – batch size

  • seqlen (int) – sequence length

  • pad_token_id (int) – token ID to insert for padding.

Returns:

torch.Tensor

The padded inputs, shape [batch, seqlen, …]

Scaling

class bertblocks.modeling.scale.LayerScaler(layer_id: int)[source]

Bases: Module

Scales an input inversely to the layer depth.

Variables:

scaling_factor (torch.Tensor) – scaling factor.

Parameters:

layer_id (int) – layer position in the encoder stack (0-indexed).

References

forward(x: Tensor) Tensor[source]

Apply layer scaling.

Parameters:

x (torch.Tensor) – Input tensor to scale.

Returns:

Scaled tensor.

Return type:

torch.Tensor

class bertblocks.modeling.scale.LearnableLayerScaler(layer_id: int)[source]

Bases: Module

Scales an input with a learnable per-layer scale parameter.

Unlike LayerScaler which uses a fixed formula based on depth, this module learns an independent scale parameter for each layer during training.

Variables:

scale (nn.Parameter) – Learnable scaling parameter.

Parameters:

layer_id (int) – layer position in the encoder stack (0-indexed). Used to maintain interface compatibility with LayerScaler.

forward(x: Tensor) Tensor[source]

Apply learnable layer scaling.

Parameters:

x (torch.Tensor) – Input tensor to scale.

Returns:

Scaled tensor.

Return type:

torch.Tensor