Skip to content

pmetal embed-train

Sentence-transformer fine-tuning for BERT/encoder models with contrastive losses.

Fine-tune sentence embedding models (BERT, encoder architectures) with contrastive learning objectives. Supports pair and triplet datasets with configurable pooling and normalization.

Terminal window
pmetal embed-train \
--model <MODEL> \
--dataset <DATASET> \
--output <OUTPUT_DIR> \
[OPTIONS]
Terminal window
# InfoNCE contrastive training with pairs
pmetal embed-train \
--model BAAI/bge-small-en-v1.5 \
--dataset pairs.jsonl \
--output ./output/embeddings \
--loss infonce
# Triplet training with margin
pmetal embed-train \
--model BAAI/bge-small-en-v1.5 \
--dataset triplets.jsonl \
--loss triplet
# CoSENT with mean pooling
pmetal embed-train \
--model sentence-transformers/all-MiniLM-L6-v2 \
--dataset pairs.jsonl \
--loss cosent --pooling mean

Pair JSONL (for InfoNCE/CoSENT):

{"text_a": "What is ML?", "text_b": "Machine learning is...", "label": 1}

Triplet JSONL (for Triplet loss):

{"anchor": "What is ML?", "positive": "Machine learning is...", "negative": "Cooking recipes..."}
ParameterDefaultDescription
--modelrequiredBERT/encoder model path or HuggingFace ID
--datasetrequiredTraining dataset (pair or triplet JSONL)
--output./outputOutput directory
--lossinfonceLoss function: infonce, triplet, cosent
--poolingclsPooling strategy: cls, mean, last_token
--normalizetrueL2 normalize embeddings
--learning-rate2e-5Learning rate
--batch-size32Batch size
--epochs3Training epochs