Hyperparameter guidance
The minimalist design of CONCORD reduces the number of tunable parameters, though several hyperparameters remain critical for optimal performance. While the default settings have been validated across diverse datasets, users are encouraged to explore the following parameters to further fine-tune performance.
To check the default parameters, use following code:
import concord as ccd
import scanpy as sc
adata = sc.datasets.pbmc3k_processed() # Use this example data or replace this with your own anndata object
adata = adata.raw.to_adata()
model = ccd.Concord(adata)
model.get_default_params()
concord - WARNING - No input feature list provided. It is recommended to first select features using the command `concord.ul.select_features()`. concord - WARNING - domain/batch information not found, all samples will be treated as from single domain/batch. concord - WARNING - Only one domain found in the data. Setting p_intra_domain to 1.0.
{'seed': 0, 'input_feature': None, 'normalize_total': False, 'log1p': False, 'batch_size': 256, 'n_epochs': 15, 'lr': 0.01, 'schedule_ratio': 0.97, 'train_frac': 1.0, 'latent_dim': 100, 'encoder_dims': [1000], 'decoder_dims': [1000], 'element_mask_prob': 0.4, 'feature_mask_prob': 0.3, 'domain_key': None, 'class_key': None, 'domain_embedding_dim': 8, 'covariate_embedding_dims': {}, 'use_decoder': False, 'decoder_final_activation': 'relu', 'decoder_weight': 1.0, 'clr_temperature': 0.4, 'clr_beta': 1.0, 'clr_weight': 1.0, 'use_classifier': False, 'classifier_weight': 1.0, 'unlabeled_class': None, 'use_importance_mask': False, 'importance_penalty_weight': 0, 'importance_penalty_type': 'L1', 'dropout_prob': 0.0, 'norm_type': 'layer_norm', 'knn_warmup_epochs': 2, 'sampler_knn': None, 'sampler_emb': None, 'sampler_domain_minibatch_strategy': 'proportional', 'domain_coverage': None, 'dist_metric': 'euclidean', 'p_intra_knn': 0.0, 'p_intra_domain': 1.0, 'use_faiss': True, 'use_ivf': True, 'ivf_nprobe': 10, 'pretrained_model': None, 'preload_dense': False, 'num_workers': None, 'chunked': False, 'chunk_size': 10000, 'device': device(type='cpu')}
Only a few of the above hyperparameters significantly influence the latent representation, and we recommend keeping most default hyperparameters as is. You can override default hyperparameters by supplying the argument when constructing the Concord object. For example, here's a typical workflow:
feature_list = ccd.ul.select_features(adata, n_top_features=2000, flavor='seurat_v3') # Loosely select features based on Seurat v3 method (so that enough information is preserved)
# Make sure adata.X is log-normalized (or use your preferred normalization method) before running Concord, otherwise specify 'normalize_total': True, 'log1p': True,
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
concord_args = {
'adata': adata,
'input_feature': feature_list,
'batch_size':128, # Batch size for training, adjust as needed
'latent_dim': 300, # Latent dimension size, adjust as needed
'encoder_dims':[1000], # Encoder dimensions, recommended to be larger than latent_dim
'element_mask_prob': 0.4, # Probability of masking elements, recommended to be between 0.2 and 0.6
'feature_mask_prob': 0.2, # Probability of masking features, recommended to be between 0.0 and 0.5
'clr_temperature': 0.4, # Temperature for contrastive loss, recommended to be between 0.1 and 0.5
'clr_beta': 1.0, # Weight for hcl sampling, recommended to be between 0.5 and 2.0
'p_intra_domain': 1.0, # Enrichment probability for intra-domain sampling, recommended to be between 0.9 and 1.0, note the lower the value, the more dataset-specific information (may contain batch effects) is preserved
'n_epochs': 15, # Number of epochs for training, adjust as needed
'domain_key': 'batch', # Key in adata.obs for batch labels
'verbose': True, # Verbosity level, set to True for more detailed output
'preload_dense': True, # Whether to preload dense matrix, set to False for very large datasets or if using multi-worker data loading
'seed': 0, # random seed for reproducibility
'device': 'cpu', # Device for training, can be 'cpu', 'cuda' ('cuda:0,1,...'), or 'mps'
'save_dir': '../../save' # Directory to save the model and results
}
cur_ccd = ccd.Concord(**concord_args)
cur_ccd.fit_transform(output_key='Concord')
Below are a few key hyperparameters we recommend users pay particular attention to (also check out supplemental figure in our paper for our hyperparameter benchmarking result):
General hyperparameters¶
input_feature
: As highlighted in a recent study by Zappia et al., feature selection can significantly influence the performance of data integration and, more broadly, dimensionality reduction. Although CONCORD is generally less sensitive to the input feature set compared to other methods, we still recommend applying mild feature selection to enrich for informative features.Current CONCORD supports two flavors of feature selection: the Seurat method and informative feature selection (Zhu et al., 2020). See the
select_features
API for details.Note that you may also supply your own feature set—generated by any method—via the
feature_list
argument, which simply takes a list of feature names (strings).
latent_dim
: Sets the dimensionality of CONCORD’s latent space. This should be large enough to capture the biological heterogeneity in your dataset. In general, there's little downside to using a moderately high value (e.g., ~30-100) — especially if you have a reasonable number of cells (e.g., >10,000). For instance, if your dataset contains at least 10 distinct cell types, consider settinglatent_dim
to a value above 10. In our experience, 30-50 dimensions suffice for the intestinal developmental atlas, while we used 300 dimensions for the C. elegans and C. briggsae atlas to capture variation across more than 600 lineages. Therefore, for highly heterogeneous datasets, consider increasing this value. If you do, we recommend adjusting encoder_dims to be between the input dimension (i.e., number of genes) and the latent dimension, e.g., encoder_dims = [500].
element_mask_prob
andfeature_mask_prob
: Specifies the fraction of elements and features masked during contrastive learning. This parameter controls the level of corruption applied to each input view and has a direct impact on the balance between robustness and resolution in the learned embedding.Masking a larger fraction of genes encourages the model to rely on coarse-grained co-expression programs. This typically improves robustness to noise and dropout but may reduce sensitivity to finer gene-level distinctions.
Conversely, masking fewer genes allows the model to focus on smaller co-expression modules and subtle signals, which can enhance resolution. However, it also increases the risk of overfitting to noise and reduces the variation between augmented views, potentially weakening the contrastive learning signal and requiring more training epochs for convergence.
We recommend setting
element_mask_prob
between 0.2 and 0.6, andfeature_mask_prob
between 0.0 and 0.4 which empirically works well for most single-cell datasets. Users are encouraged to tune this hyperparameter based on dataset sparsity and heterogeneity.
clr_temperature
: Controls the sharpness of the similarity distribution in the contrastive loss, as described by Wang & Liu, 2021.This parameter influences the geometry of the learned latent space. Lower temperatures increase the emphasis on distinguishing between samples, leading to tighter alignment of positive pairs and stronger repulsion of negatives. This often results in more distinct and compact representations — ideal for discrete clusters — but may lead to over-separation in datasets with continuous structures (e.g., differentiation trajectories).
In contrast, higher temperatures yield smoother embeddings that better preserve gradual transitions or continuous trajectories, albeit potentially at the cost of reduced cluster separation.
We recommend experiment
clr_temperature
within the range of 0.1 to 1.0.
batch_size
: Increasing this value above the default (256) can substantially improve training speed. However, overly large batch sizes may degrade performance on small datasets. As a rule of thumb, set batch_size to between 1/100 and 1/1000 of your total number of cells and adjust based on your GPU memory capacity.
n_epochs
: The default value of 15 is sufficient for most cases, as CONCORD generally converges quickly. Reducing this value can decrease runtime, potentially at the cost of slightly reduced resolution. To monitor the contrastive loss convergence pattern, setverbose=True
.
p_intra_domain
: Controls the probability of sampling cells from within the same dataset or domain (e.g., batch, condition, technology, or species). The default value of 1.0 generally works well for most cases. If you want the latent representation to capture more batch-specific signals, you may reduce this value to allow more inter-dataset sampling and contrasting.
HCL and kNN mode for hard-negative sampling¶
CONCORD supports two modes of hard-negative sampling: the hcl
method from Robinson et al., and the kNN
method implemented by us. For details, refer to the Methods section of our paper. The default mode is hcl
, but you can specify kNN
or a hybrid of the two (though the hybrid has not been benchmarked). The sections below describe how to configure each mode.
hcl
mode: Enabled by setting a non-zero value forclr_beta
, which acts as a concentration parameter. Larger values ofclr_beta
focus sampling on points whose embeddings are already close to the anchor (resulting in harder negatives), which can lead to refined resolution. Settingclr_beta
= 0 reverts to uniform sampling. The default value is 1.0; we do not recommend values larger than 5, as they may obscure global structure.
kNN
mode: Enabled by setting a non-zero value forp_intra_knn
, which controls the probability of sampling cells from within the same k-nearest-neighbor (kNN) neighborhood. The neighborhood size is defined bysampler_knn
(default: 1/10 of the total number of cells, capped at a maximum of 1000). Neighborhoods are computed using the embedding specified bysampler_emb
.The default for
sampler_emb
isNone
, which means CONCORD first computes an initial embedding using only dataset-aware sampling (without hard-negative sampling) forknn_warmup_epochs
epochs (default: 2). After this warmup, it switches to joint dataset-aware and hard-negative sampling. Alternatively, you can provide a custom embedding (likesampler_emb
= 'X_pca') fromadata.obsm
.Higher values of
p_intra_knn
encourage the model to contrast cells from nearby states, helping capture subtle local differences. However, we do not recommend setting this value above 0.5, as it may limit global sampling and hinder the model's ability to learn broad cell type distinctions.
hcl
andkNN
hybrid mode: This can be enabled by setting non-zero values for bothclr_beta
andp_intra_knn
. Note that this configuration has not been benchmarked.