Hyperparameter guidance
The minimalist design of CONCORD reduces the number of tunable parameters, though several hyperparameters remain critical for optimal performance. While the default settings have been validated across diverse datasets, users are encouraged to explore the following parameters to further fine-tune performance.
To check the default parameters, use following code:
import concord as ccd
import scanpy as sc
adata = sc.datasets.pbmc3k_processed() # Use this example data or replace this with your own anndata object
adata = adata.raw.to_adata()
model = ccd.Concord(adata)
model.get_default_params()
Concord - WARNING - No input feature list provided. It is recommended to first select features using the command `concord.ul.select_features()`. Concord - WARNING - domain/batch information not found, all samples will be treated as from single domain/batch.
{'seed': 0, 'project_name': 'concord', 'input_feature': None, 'batch_size': 64, 'n_epochs': 10, 'lr': 0.01, 'schedule_ratio': 0.97, 'train_frac': 1.0, 'latent_dim': 100, 'encoder_dims': [512], 'decoder_dims': [512], 'augmentation_mask_prob': 0.3, 'domain_key': None, 'class_key': None, 'domain_embedding_dim': 8, 'covariate_embedding_dims': {}, 'use_decoder': False, 'decoder_final_activation': 'relu', 'decoder_weight': 1.0, 'clr_mode': 'aug', 'clr_temperature': 0.3, 'clr_weight': 1.0, 'use_classifier': False, 'classifier_weight': 1.0, 'unlabeled_class': None, 'use_importance_mask': False, 'importance_penalty_weight': 0, 'importance_penalty_type': 'L1', 'dropout_prob': 0.1, 'norm_type': 'layer_norm', 'sampler_emb': 'X_pca', 'sampler_knn': None, 'dist_metric': 'euclidean', 'p_intra_knn': 0.3, 'p_intra_domain': 0.95, 'min_p_intra_domain': 0.9, 'max_p_intra_domain': 1.0, 'pca_n_comps': 50, 'use_faiss': True, 'use_ivf': True, 'ivf_nprobe': 10, 'pretrained_model': None, 'classifier_freeze_param': False, 'chunked': False, 'chunk_size': 10000, 'device': device(type='cpu')}
Only a few of the above hyperparameters significantly influence the latent representation, and we recommend keeping most default hyperparameters as is. You can override default hyperparameters by supplying the argument when constructing the Concord object. For example, here's a typical workflow:
feature_list = ccd.ul.select_features(adata, n_top_features=2000, flavor='seurat_v3') # Loosely select features based on Seurat v3 method (so that enough information is preserved)
concord_args = {
'input_feature': feature_list,
'latent_dim': 100, # Latent dimension size, adjust as needed
'encoder_dims':[1000], # Encoder dimensions, recommended to be larger than latent_dim
'use_decoder': False, # Whether to use a decoder, set to True if you want to use the decoder
'decoder_dims':[1000], # Decoder dimensions, ignored if use_decoder is False
'augmentation_mask_prob': 0.3, # Probability of masking features, recommended to be between 0.2 and 0.6
'clr_temperature': 0.3, # Temperature for contrastive loss, recommended to be between 0.1 and 0.8
'p_intra_knn': 0.3, # Probability of intra-neighborhood sampling, must be less than 0.5
'sampler_knn': 300, # Size of neighbohood for intra-neighborhood sampling
'p_intra_domain': .95, # Enrichment probability for intra-domain sampling, recommended to be between 0.85 and 1.0, note the lower the value, the more dataset-specific information (may contain batch effects) is preserved
'domain_key': None, # Key in adata.obs for batch labels
'verbose': False, # Verbosity level, set to True for more detailed output
'inplace': True, # Whether to modify adata in place, if not you need to copy the obsm out of ccd.adata
'seed': 1, # random seed for reproducibility
}
cur_ccd = ccd.Concord(adata, **concord_args)
cur_ccd.encode_adata(input_layer_key='X_log1p', output_key='Concord')
Below are a few key hyperparameters we recommend users pay particular attention to:
input_feature
: As highlighted in a recent study by Zappia et al., feature selection can significantly influence the performance of data integration and, more broadly, dimensionality reduction. Although CONCORD is generally less sensitive to the input feature set compared to other methods, we still recommend applying mild feature selection to enrich for informative features.CONCORD supports two flavors of feature selection: the Seurat method and informative feature selection (Zhu et al., 2020). See the
select_features
API for details.Note that you may also supply your own feature set—generated by any method—via the
feature_list
argument, which simply takes a list of feature names (strings).
latent_dim
: Sets the dimensionality of CONCORD’s latent space. This should be large enough to capture the biological heterogeneity in your dataset. In general, there's little downside to using a higher value — especially if you have a moderate number of cells (e.g., >10,000). For instance, if your dataset contains at least 10 distinct cell types, consider settinglatent_dim
to a value above 10. In our experience, 30-50 dimensions suffice for the intestinal developmental atlas, while we used 300 dimensions for the C. elegans and C. briggsae atlas to capture variation across more than 600 lineages. Therefore, for highly heterogeneous datasets, consider increasing this value. If you do, we recommend adjusting encoder_dims to fall between the input dimension (i.e., number of genes) and the latent dimension.
augmentation_mask_prob
: Specifies the fraction of features masked during contrastive learning. This parameter controls the level of corruption applied to each input view and has a direct impact on the balance between robustness and resolution in the learned embedding.Masking a larger fraction of genes encourages the model to rely on broader contextual signals, enabling it to learn more stable and coarse-grained co-expression programs. This typically improves robustness to noise and dropout but may reduce sensitivity to finer gene-level distinctions.
Conversely, masking fewer genes allows the model to focus on smaller co-expression modules and subtle signals, which can enhance resolution. However, it also increases the risk of overfitting to noise and reduces the variation between augmented views, potentially weakening the contrastive learning signal and requiring more training epochs for convergence.
We recommend setting this value between 0.2 and 0.6, which empirically works well for most single-cell datasets. Users are encouraged to tune this hyperparameter based on dataset sparsity and heterogeneity.
clr_temperature
: Controls the sharpness of the similarity distribution in the contrastive loss, as described by Wang & Liu, 2021.This parameter influences the geometry of the learned latent space. Lower temperatures increase the emphasis on distinguishing between samples, leading to tighter alignment of positive pairs and stronger repulsion of negatives. This often results in more distinct and compact representations — ideal for discrete clusters — but may lead to over-separation in datasets with continuous structures (e.g., differentiation trajectories).
In contrast, higher temperatures yield smoother embeddings that better preserve gradual transitions or continuous trajectories, albeit potentially at the cost of reduced cluster separation.
We recommend tuning
clr_temperature
within the range of 0.1 to 1.0.
p_intra_knn
: Controls the probability of sampling cells from within the same k-nearest-neighbor (KNN) neighborhood. The neighborhood size is defined bysampler_knn
(default: 1/10 of the total number of cells), and neighborhoods are computed using the embedding specified bysampler_emb
(default: PCA, though custom embeddings fromadata.obsm
can also be used).Higher values of
p_intra_knn
encourage the model to contrast cells from nearby states, helping it capture subtle, local differences. However, we do not recommend setting this value above 0.5, as doing so may limit global sampling. Maintaining a balance between local and global sampling is important for learning both fine-grained variation and broad cell type distinctions.
p_intra_domain
: Controls the probability of sampling cells from within the same dataset or domain (e.g., batch, condition, technology, or species). For datasets with mild to moderate batch effects, the default value of 0.95 generally works well. If your data exhibit strong batch effects, consider increasing this value to 1.0 to avoid cross-dataset comparisons. Conversely, if you want the model to capture more batch-specific signals, you may reduce this value to allow more inter-dataset sampling and contrasting.
batch_size
: Increasing this value above the default (64) can substantially improve training speed. However, overly large batch sizes may degrade performance on small datasets. As a rule of thumb, set batch_size to between 1/100 and 1/1000 of your total number of cells, and adjust based on your GPU memory capacity.