Utilities

utils can be replaced by ul, e.g., concord.utils.list_adata_files can be concord.ul.list_adata_files

`concord.utils.benchmark_topology(diagrams, expected_betti_numbers=[1, 0, 0], n_bins=100, save_dir=None, file_suffix=None)`

Benchmark the topological properties of persistence diagrams.

Parameters:

Name	Description	Default
`diagrams`	dict A dictionary where keys are method names and values are persistence diagrams.	required
`expected_betti_numbers`	list, optional A list specifying the expected Betti numbers for different homology dimensions. Default is [1, 0, 0].	`[1, 0, 0]`
`n_bins`	int, optional Number of bins to use for Betti curve calculations. Default is 100.	`100`
`save_dir`	str, optional Directory to save benchmarking results as CSV files. If None, results are not saved.	`None`
`file_suffix`	str, optional Suffix to append to saved filenames.	`None`

Returns:

Type	Description
	dict A dictionary containing: - `'betti_stats'`: DataFrame summarizing Betti statistics. - `'distance_metrics'`: DataFrame of computed distances between Betti curves. - `'combined_metrics'`: DataFrame of entropy, variance, and L1 distance metrics.

`concord.utils.compute_persistent_homology(adata, key='X_pca', homology_dimensions=[0, 1, 2])`

Computes persistent homology using Vietoris-Rips complex.

Parameters:

Name	Description	Default
`adata`	anndata.AnnData The AnnData object containing the data.	required
`key`	str, optional The key in `adata.obsm` specifying the embedding to use. Default is 'X_pca'.	`'X_pca'`
`homology_dimensions`	list, optional List of homology dimensions to compute. Default is [0, 1, 2].	`[0, 1, 2]`

Returns:

Type	Description
	np.ndarray Persistence diagrams representing homology classes across filtration values.

`concord.utils.compute_betti_median_or_mode(betti_values, statistic='median')`

Computes the median or mode of Betti numbers.

Parameters:

Name	Type	Description	Default
`betti_values`		np.ndarray Array of Betti numbers across filtration values.	required
`statistic`		str, optional Statistic to compute ('median' or 'mode'). Default is 'median'.	`'median'`

Returns:

Type	Description
	float The computed median or mode of the Betti numbers.

`concord.utils.compute_betti_entropy(betti_values)`

Computes the entropy of the Betti curve.

Parameters:

Name	Type	Description	Default
`betti_values`		np.ndarray Array of Betti numbers across filtration values.	required

Returns:

Type	Description
	float The entropy of the Betti curve.

`concord.utils.interpolate_betti_curve(betti_values, original_sampling, common_sampling)`

Interpolates Betti curve onto a common filtration grid.

Parameters:

Name	Description	Default
`betti_values`	np.ndarray Array of Betti numbers.	required
`original_sampling`	np.ndarray The original filtration values associated with the Betti numbers.	required
`common_sampling`	np.ndarray The target filtration values for interpolation.	required

Returns:

Type	Description
	np.ndarray Interpolated Betti curve.

`concord.utils.compute_betti_statistics(diagram, expected_betti_numbers, n_bins=100)`

Computes Betti statistics given a persistence diagram.

Parameters:

Name	Description	Default
`diagram`	np.ndarray Persistence diagram from Giotto-TDA.	required
`expected_betti_numbers`	np.ndarray Expected Betti numbers for different homology dimensions.	required
`n_bins`	int, optional Number of bins for the Betti curve computation. Default is 100.	`100`

Returns:

Type	Description
	dict A dictionary containing: - `'betti_stats'`: Dictionary of Betti statistics. - `'observed_betti_numbers'`: Observed Betti numbers. - `'expected_betti_numbers'`: Expected Betti numbers. - `'l1_distance'`: L1 distance between observed and expected Betti numbers. - `'l2_distance'`: L2 distance between observed and expected Betti numbers. - `'total_relative_error'`: Total relative error.

`concord.utils.summarize_betti_statistics(betti_stats)`

Summarizes Betti statistics into pandas DataFrames.

Parameters:

Name	Type	Description	Default
`betti_stats`		dict Dictionary containing Betti statistics for different methods.	required

Returns:

Type	Description
	tuple - `betti_stats_pivot`: DataFrame of Betti statistics. - `distance_metrics_df`: DataFrame of distance metrics.

concord.utils.benchmark_geometry(adata, keys, eval_metrics=['pseudotime', 'cell_distance_corr', 'local_distal_corr', 'trustworthiness', 'state_distance_corr', 'state_dispersion_corr', 'state_batch_distance_ratio'], dist_metric='cosine', groundtruth_key='PCA_no_noise', state_key='cluster', batch_key='batch', groundtruth_dispersion=None, ground_truth_dispersion_key='wt_noise', corr_types=['pearsonr', 'spearmanr', 'kendalltau'], trustworthiness_n_neighbors=np.arange(10, 101, 10), dispersion_metric='var', return_type='dataframe', local_percentile=0.1, distal_percentile=0.9, start_point=0, end_point=None, pseudotime_k=30, truetime_key='time', verbose=True, save_dir=None, file_suffix=None)

Benchmark the geometric properties of different embeddings.

Parameters:

Name	Description	Default
`adata`	anndata.AnnData The AnnData object containing cell embeddings.	required
`keys`	list List of embeddings (keys in `adata.obsm`) to evaluate.	required
`eval_metrics`	list, optional Metrics to compute, such as 'pseudotime', 'cell_distance_corr', etc. Default includes multiple metrics.	`['pseudotime', 'cell_distance_corr', 'local_distal_corr', 'trustworthiness', 'state_distance_corr', 'state_dispersion_corr', 'state_batch_distance_ratio']`
`dist_metric`	str, optional Distance metric for computing cell distances. Default is 'cosine'.	`'cosine'`
`groundtruth_key`	str, optional Key in `adata.obsm` containing the ground truth embedding. Default is 'PCA_no_noise'.	`'PCA_no_noise'`
`state_key`	str, optional Key in `adata.obs` representing cell states or clusters.	`'cluster'`
`batch_key`	str, optional Key in `adata.obs` representing batch information.	`'batch'`
`groundtruth_dispersion`	dict, optional Precomputed dispersion values for ground truth, if available.	`None`
`ground_truth_dispersion_key`	str, optional Key used when computing dispersion correlations. Default is 'wt_noise'.	`'wt_noise'`
`corr_types`	list, optional List of correlation methods to compute. Default includes 'pearsonr', 'spearmanr', and 'kendalltau'.	`['pearsonr', 'spearmanr', 'kendalltau']`
`trustworthiness_n_neighbors`	np.ndarray, optional Range of neighborhood sizes for trustworthiness computation. Default is `np.arange(10, 101, 10)`.	`arange(10, 101, 10)`
`dispersion_metric`	str, optional Metric to compute dispersion, e.g., 'var' (variance). Default is 'var'.	`'var'`
`return_type`	str, optional If 'dataframe', returns summary statistics; if 'full', returns additional details. Default is 'dataframe'.	`'dataframe'`
`local_percentile`	float, optional Percentile threshold for local distance correlations. Default is 0.1.	`0.1`
`distal_percentile`	float, optional Percentile threshold for distal distance correlations. Default is 0.9.	`0.9`
`start_point`	int, optional Index of the starting cell for pseudotime computation. Must be specified.	`0`
`end_point`	int, optional Index of the ending cell for pseudotime computation. Must be specified.	`None`
`pseudotime_k`	int, optional Number of neighbors used in k-NN graph for pseudotime computation. Default is 30.	`30`
`truetime_key`	str, optional Key in `adata.obs` representing ground truth time. Default is 'time'.	`'time'`
`verbose`	bool, optional Whether to enable logging. Default is True.	`True`
`save_dir`	str, optional Directory to save benchmarking results. If None, results are not saved.	`None`
`file_suffix`	str, optional Suffix for saved filenames.	`None`

Returns:

Type	Description
	pd.DataFrame or tuple If `return_type='dataframe'`, returns a DataFrame summarizing benchmark results. If `return_type='full'`, returns both the DataFrame and a detailed results dictionary.

`concord.utils.pairwise_distance(adata, keys, metric='cosine')`

`concord.utils.local_vs_distal_corr(X_high, X_low, local_percentile=25, distal_percentile=75, method='pearsonr')`

Computes correlation between local and distal pairwise distances.

Parameters:

Name	Type	Description	Default
`X_high`	`ndarray`	High-dimensional data matrix.	required
`X_low`	`ndarray`	Low-dimensional embedding matrix.	required
`local_percentile`	`int`	Percentile threshold for local distances. Defaults to 25.	`25`
`distal_percentile`	`int`	Percentile threshold for distal distances. Defaults to 75.	`75`
`method`	`str`	Correlation method; 'pearsonr', 'spearmanr', or 'kendalltau'. Defaults to 'pearsonr'.	`'pearsonr'`

Returns:

Name	Type	Description
`float`		Correlation for local distances.
`float`		Correlation for distal distances.

`concord.utils.compute_state_batch_distance_ratio(adata, basis='X_latent', batch_key='batch', state_key='cluster', metric='cosine')`

Computes the Batch-to-State Distance Ratio using centroids to evaluate batch correction.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	AnnData object containing latent embeddings.	required
`basis`	`str`	Key for latent embeddings in adata.obsm. Defaults to 'X_latent'.	`'X_latent'`
`batch_key`	`str`	Key for batch labels in adata.obs. Defaults to 'batch'.	`'batch'`
`state_key`	`str`	Key for cell state labels in adata.obs. Defaults to 'cluster'.	`'cluster'`
`metric`	`str`	Distance metric to use, e.g., 'cosine' or 'euclidean'. Defaults to 'cosine'.	`'cosine'`

Returns:

Name	Type	Description
`float`		Ratio of average batch distance to average state distance.

`concord.utils.compute_trustworthiness(adata, embedding_keys, groundtruth, metric='euclidean', n_neighbors=10)`

Evaluates trustworthiness of embeddings in an AnnData object.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	AnnData object containing embeddings in adata.obsm.	required
`embedding_keys`	`list`	List of keys in adata.obsm to evaluate (e.g., ['X_umap', 'X_tsne']).	required
`groundtruth`	`str or ndarray`	Key in adata.obsm or adata.layers for ground truth data, or a precomputed matrix.	required
`metric`	`str`	Distance metric for trustworthiness calculation, e.g., 'euclidean' or 'cosine'. Defaults to 'euclidean'.	`'euclidean'`
`n_neighbors`	`int or list`	Neighborhood sizes for trustworthiness evaluation. Defaults to 10.	`10`

Returns:

Type	Description
	pandas.DataFrame: Trustworthiness scores for each embedding at each neighborhood size.
	pandas.DataFrame: Summary statistics with average trustworthiness and decay rate.

`concord.utils.Simulation`

A class for simulating single-cell gene expression data with various structures and batch effects.

Parameters:

Name	Type	Description	Default
`n_cells`	`int`	Number of cells to simulate. Defaults to 1000.	`1000`
`n_genes`	`int`	Number of genes to simulate. Defaults to 1000.	`1000`
`n_batches`	`int`	Number of batches to simulate. Defaults to 2.	`2`
`n_states`	`int`	Number of states (e.g., clusters, trajectories). Defaults to 3.	`3`
`state_type`	`str`	Type of state to simulate; options include 'cluster', 'trajectory', 'tree', etc. Defaults to 'cluster'.	`'cluster'`
`batch_type`	`str or list`	Type of batch effect; options include 'batch_specific_features', 'variance_inflation', etc. Defaults to 'batch_specific_features'.	`'batch_specific_features'`
`state_distribution`	`str`	Distribution type for states; e.g., 'normal', 'poisson'. Defaults to 'normal'.	`'normal'`
`state_level`	`float`	Mean expression level for states. Defaults to 1.0.	`1.0`
`state_min_level`	`float`	Minimum expression level. Defaults to 0.0.	`0.0`
`state_dispersion`	`float`	Dispersion of state expression. Defaults to 0.1.	`0.1`
`program_structure`	`str`	Gene expression program structure; e.g., 'linear', 'bidirectional'. Defaults to "linear".	`'linear'`
`program_on_time_fraction`	`float`	Fraction of time the program is on. Defaults to 0.3.	`0.3`
`program_gap_size`	`int`	Size of gaps in expression programs. Defaults to 1.	`1`
`program_noise_in_block`	`bool`	Whether to add noise within each expression block. Defaults to True.	`True`
`trajectory_program_num`	`int`	Number of programs in a trajectory simulation. Defaults to 3.	`3`
`trajectory_cell_block_size_ratio`	`float`	Ratio of cell block sizes in a trajectory. Defaults to 0.3.	`0.3`
`trajectory_loop_to`	`int or list`	Loop connection in trajectory simulations. Defaults to None.	`None`
`tree_branching_factor`	`int`	Number of branches per tree level. Defaults to 2.	`2`
`tree_depth`	`int`	Depth of the simulated tree. Defaults to 3.	`3`
`tree_program_decay`	`float`	Decay factor for tree programs across branches. Defaults to 0.5.	`0.5`
`tree_cellcount_decay`	`float`	Decay factor for cell numbers across tree branches. Defaults to 1.0.	`1.0`
`batch_distribution`	`str or list`	Distribution for batch effects. Defaults to 'normal'.	`'normal'`
`batch_level`	`float or list`	Magnitude of batch effects. Defaults to 1.0.	`1.0`
`batch_dispersion`	`float or list`	Dispersion of batch effects. Defaults to 0.1.	`0.1`
`batch_cell_proportion`	`list`	Proportion of cells per batch. Defaults to None.	`None`
`batch_feature_frac`	`float or list`	Fraction of genes affected by batch effects. Defaults to 0.1.	`0.1`
`global_non_specific_gene_fraction`	`float`	Fraction of genes that are globally non-specific. Defaults to 0.1.	`0.1`
`pairwise_non_specific_gene_fraction`	`dict`	Pairwise-specific gene fraction between state pairs. Defaults to None.	`None`
`universal_gene_fraction`	`float`	Fraction of universal genes expressed across all cells. Defaults to 0.0.	`0.0`
`non_neg`	`bool`	Whether to enforce non-negative expression values. Defaults to False.	`False`
`to_int`	`bool`	Whether to convert expression values to integers. Defaults to False.	`False`
`seed`	`int`	Random seed for reproducibility. Defaults to 0.	`0`

Methods:

Name	Description
`simulate_data`	Simulates gene expression data, including batch effects.
`simulate_state`	Simulates cell state-specific gene expression patterns.
`simulate_batch`	Simulates batch-specific effects on gene expression.
`simulate_clusters`	Simulates gene expression in discrete clusters.
`simulate_trajectory`	Simulates continuous gene expression trajectories.
`simulate_tree`	Simulates hierarchical branching gene expression.
`simulate_gatto`	Simulates expression patterns similar to Gatto et al., 2023.
`simulate_s_curve`	Simulates an S-curve structure in gene expression.
`simulate_swiss_roll`	Simulates a Swiss roll structure with optional hole.
`simulate_expression_block`	Generates structured gene expression within a cell population.
`simulate_dropout`	Simulates dropout in gene expression data.
`downsample_mtx_umi`	Performs UMI count downsampling.
`simulate_distribution`	Samples values from specified distributions.

`downsample_mtx_umi(mtx, ratio=0.1, seed=1)` `staticmethod`

Simulates downsampling of a gene expression matrix (UMI counts) by a given ratio.

Parameters:

Name	Type	Description	Default
`mtx`	`ndarray`	The input matrix where rows represent genes and columns represent cells.	required
`ratio`	`float`	The downsampling ratio (default 0.1).	`0.1`
`seed`	`int`	Random seed for reproducibility (default 1).	`1`

Returns:

Type	Description
	numpy.ndarray: The downsampled matrix.

`rnegbin(mu, theta, size)` `staticmethod`

Generate random numbers from a negative binomial distribution.

Parameters: n: Number of random numbers to generate. mu: Mean of the distribution. theta: Dispersion parameter.

`simulate_batch(adata, cell_indices=None, cell_proportion=0.3, batch_name='batch_1', effect_type='batch_specific_features', distribution='normal', level=1.0, dispersion=0.1, batch_feature_frac=0.1, seed=42)`

Applies batch-specific effects to an existing simulated dataset.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Base dataset to apply batch effects.	required
`cell_indices`	`array - like`	Indices of cells to modify. Defaults to None.	`None`
`cell_proportion`	`float`	Proportion of cells affected. Defaults to 0.3.	`0.3`
`batch_name`	`str`	Name of the batch. Defaults to 'batch_1'.	`'batch_1'`
`effect_type`	`str`	Type of batch effect (e.g., 'batch_specific_features', 'variance_inflation'). Defaults to 'batch_specific_features'.	`'batch_specific_features'`
`distribution`	`str`	Distribution type for batch effects (e.g., 'normal'). Defaults to 'normal'.	`'normal'`
`level`	`float`	Effect level (e.g., scaling factor). Defaults to 1.0.	`1.0`
`dispersion`	`float`	Dispersion of batch effects. Defaults to 0.1.	`0.1`
`batch_feature_frac`	`float`	Fraction of genes affected by batch effects. Defaults to 0.1.	`0.1`
`seed`	`int`	Random seed for reproducibility. Defaults to 42.	`42`

Returns:

Name	Type	Description
`tuple`		batch_adata (AnnData): Modified dataset with batch effects. batch_adata_pre (AnnData): Dataset before applying batch effects.

`simulate_clusters(n_genes=6, n_cells=12, num_clusters=2, program_structure='uniform', program_on_time_fraction=0.3, distribution='normal', mean_expression=10, min_expression=1, dispersion=1.0, global_non_specific_gene_fraction=0.1, pairwise_non_specific_gene_fraction=None, cluster_key='cluster', permute=False, seed=42)`

Simulates gene expression for discrete cell clusters.

Parameters:

Name	Type	Description	Default
`n_genes`	`int or list`	Number of genes per cluster or total genes. Defaults to 6.	`6`
`n_cells`	`int or list`	Number of cells per cluster or total cells. Defaults to 12.	`12`
`num_clusters`	`int`	Number of clusters to simulate. Defaults to 2.	`2`
`program_structure`	`str`	Expression program structure ('linear', 'uniform', etc.). Defaults to 'uniform'.	`'uniform'`
`program_on_time_fraction`	`float`	Fraction of program duration. Defaults to 0.3.	`0.3`
`distribution`	`str`	Type of distribution for gene expression. Defaults to 'normal'.	`'normal'`
`mean_expression`	`float`	Mean expression level. Defaults to 10.	`10`
`min_expression`	`float`	Minimum expression level. Defaults to 1.	`1`
`dispersion`	`float`	Dispersion in expression levels. Defaults to 1.0.	`1.0`
`global_non_specific_gene_fraction`	`float`	Fraction of globally expressed genes. Defaults to 0.1.	`0.1`
`pairwise_non_specific_gene_fraction`	`dict`	Pairwise-specific genes between cluster pairs. Defaults to None.	`None`
`cluster_key`	`str`	Key for cluster labeling. Defaults to 'cluster'.	`'cluster'`
`permute`	`bool`	Whether to shuffle cells. Defaults to False.	`False`
`seed`	`int`	Random seed. Defaults to 42.	`42`

Returns:

Name	Type	Description
`AnnData`		Simulated dataset with clustered gene expression.

`simulate_data()`

Simulates single-cell gene expression data, integrating state-based and batch effects.

Returns:

Name	Type	Description
`tuple`		adata (AnnData): Simulated gene expression data with batch effects. adata_pre (AnnData): Pre-batch effect simulated data.

`simulate_dropout(mtx, dropout_lambda=1.0, seed=None)` `staticmethod`

Simulates dropout in UMI counts based on the specified dropout lambda.

Parameters:

Name	Type	Description	Default
`mtx`	`ndarray`	The actual UMI counts matrix (genes x cells).	required
`dropout_lambda`	`float`	The lambda parameter controlling the dropout probability.	`1.0`
`seed`	`int`	Seed for the random number generator for reproducibility.	`None`

Returns:

Type	Description
	numpy.ndarray: The UMI counts matrix after applying dropout.

`simulate_state()`

Simulates gene expression profiles for different cell states.

Returns:

Name	Type	Description
`AnnData`		An AnnData object containing simulated state-specific expression data.

`simulate_trajectory(n_genes=10, n_cells=100, cell_block_size_ratio=0.3, program_num=3, program_structure='linear', program_on_time_fraction=0.3, distribution='normal', mean_expression=10, min_expression=0, dispersion=1.0, seed=42, loop_to=None)`

Simulates a continuous trajectory of gene expression.

Parameters:

Name	Type	Description	Default
`n_genes`	`int`	Number of genes. Defaults to 10.	`10`
`n_cells`	`int`	Number of cells. Defaults to 100.	`100`
`cell_block_size_ratio`	`float`	Ratio of cell blocks. Defaults to 0.3.	`0.3`
`program_num`	`int`	Number of gene programs in the trajectory. Defaults to 3.	`3`
`program_structure`	`str`	Structure of gene programs ('linear', 'bidirectional'). Defaults to 'linear'.	`'linear'`
`program_on_time_fraction`	`float`	Fraction of time the program is on. Defaults to 0.3.	`0.3`
`distribution`	`str`	Distribution type. Defaults to 'normal'.	`'normal'`
`mean_expression`	`float`	Mean expression level. Defaults to 10.	`10`
`min_expression`	`float`	Minimum expression level. Defaults to 0.	`0`
`dispersion`	`float`	Dispersion of expression. Defaults to 1.0.	`1.0`
`seed`	`int`	Random seed. Defaults to 42.	`42`
`loop_to`	`int or list`	Defines looping relationships in the trajectory. Defaults to None.	`None`

Returns:

Name	Type	Description
`AnnData`		Simulated dataset with continuous gene expression patterns.

`simulate_tree(n_genes=10, n_cells=100, branching_factor=2, depth=3, program_structure='linear_increasing', program_on_time_fraction=0.3, program_gap_size=1, program_decay=0.5, cellcount_decay=1.0, distribution='normal', mean_expression=10, min_expression=0, dispersion=1.0, seed=42, noise_in_block=True)`

Simulates hierarchical branching gene expression patterns.

Parameters:

Name	Type	Description	Default
`n_genes`	`int`	Number of genes. Defaults to 10.	`10`
`n_cells`	`int`	Number of cells. Defaults to 100.	`100`
`branching_factor`	`int`	Number of branches per level. Defaults to 2.	`2`
`depth`	`int`	Depth of the branching tree. Defaults to 3.	`3`
`program_structure`	`str`	Gene program structure. Defaults to 'linear_increasing'.	`'linear_increasing'`
`program_on_time_fraction`	`float`	Program activation time fraction. Defaults to 0.3.	`0.3`
`program_gap_size`	`int`	Gap size between programs. Defaults to 1.	`1`
`program_decay`	`float`	Decay factor for program effects. Defaults to 0.5.	`0.5`
`cellcount_decay`	`float`	Decay factor for cell counts. Defaults to 1.0.	`1.0`
`distribution`	`str`	Expression distribution type. Defaults to 'normal'.	`'normal'`
`mean_expression`	`float`	Mean gene expression level. Defaults to 10.	`10`
`min_expression`	`float`	Minimum gene expression level. Defaults to 0.	`0`
`dispersion`	`float`	Dispersion of expression. Defaults to 1.0.	`1.0`
`seed`	`int`	Random seed. Defaults to 42.	`42`
`noise_in_block`	`bool`	Whether to add noise within expression blocks. Defaults to True.	`True`

Returns:

Name	Type	Description
`AnnData`		Simulated dataset with hierarchical tree-like gene expression.

`concord.utils.select_features(adata, n_top_features=2000, flavor='seurat_v3', filter_gene_by_counts=False, normalize=False, log1p=False, grouping='cluster', emb_key='X_pca', k=512, knn_samples=100, gini_cut_qt=None, save_path=None, figsize=(10, 3), subsample_frac=1.0, random_state=0)`

Selects top informative features from an AnnData object.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	AnnData object containing gene expression data.	required
`n_top_features`	`int`	Number of top features to select. Defaults to 2000.	`2000`
`flavor`	`str`	Feature selection method. Options: - 'seurat_v3': Highly variable gene selection based on Seurat v3. - 'iff': Uses Informative Feature Filtering (IFF) method. Defaults to "seurat_v3".	`'seurat_v3'`
`filter_gene_by_counts`	`Union[int, bool]`	Minimum count threshold for feature filtering. Defaults to False.	`False`
`normalize`	`bool`	Whether to normalize the data before feature selection. Defaults to False.	`False`
`log1p`	`bool`	Whether to apply log1p transformation before feature selection. Defaults to False.	`False`
`grouping`	`Union[str, Series, List[str]]`	Clustering/grouping strategy for IFF method. Defaults to 'cluster'.	`'cluster'`
`emb_key`	`str`	Embedding key in `adata.obsm` used for clustering. Defaults to 'X_pca'.	`'X_pca'`
`k`	`int`	Number of neighbors for k-NN if `grouping='knn'`. Defaults to 512.	`512`
`knn_samples`	`int`	Number of k-NN samples if `grouping='knn'`. Defaults to 100.	`100`
`gini_cut_qt`	`float`	Quantile threshold for selecting features by Gini coefficient in IFF. Defaults to None.	`None`
`save_path`	`Optional[Union[str, Path]]`	Path to save Gini coefficient plot. Defaults to None.	`None`
`figsize`	`tuple`	Size of Gini coefficient plot. Defaults to (10, 3).	`(10, 3)`
`subsample_frac`	`float`	Fraction of data to subsample for feature selection. Defaults to 1.0.	`1.0`
`random_state`	`int`	Random seed for reproducibility. Defaults to 0.	`0`

Returns:

Type	Description
`List[str]`	List[str]: List of selected feature names.

`concord.utils.generate_synthetic_doublets(adata, doublet_synth_ratio, seed, batch_key, droplet_type_key, mean=0.5, var=0.1, clip_range=(0.2, 0.8), plot_histogram=True, combine_with_original=False)`

Generate synthetic doublets from singlet data in an AnnData object within each batch.

Parameters:

Name	Description	Default
`adata`	AnnData object containing the singlet data (with maybe unclassified doublets)	required
`doublet_synth_ratio`	float, the ratio of synthetic doublets to true singlets	required
`seed`	int, random seed for reproducibility	required
`batch_key`	str, the key in .obs indicating batch information	required
`droplet_type_key`	str, the key in .obs indicating droplet type	required
`mean`	float, mean of the normal distribution for generating fractions (default: 0.5)	`0.5`
`var`	float, variance of the normal distribution for generating fractions (default: 0.1)	`0.1`
`clip_range`	tuple, range to clip the generated fractions (default: (0.2, 0.8))	`(0.2, 0.8)`
`plot_histogram`	bool, whether to plot the histogram of synthetic doublet fractions	`True`

Returns:

Name	Type	Description
`adata_synthetic_doublets`		AnnData object containing the synthetic doublets

`concord.utils.list_adata_files(folder_path, substring=None, extension='*.h5ad')`

List all .h5ad files in a directory (recursively) that match a given substring.

Parameters:

Name	Description	Default
`folder_path`	str Path to the folder where `.h5ad` files are located.	required
`substring`	str, optional A substring to filter filenames (default is None, meaning no filtering).	`None`
`extension`	str, optional File extension to search for (default is "*.h5ad").	`'*.h5ad'`

Returns:

Type	Description
	list A list of file paths matching the criteria.

`concord.utils.read_and_concatenate_adata(adata_files, merge='unique', add_dataset_col=False, dataset_col_name='dataset', output_file=None)`

Read and concatenate multiple AnnData .h5ad files into a single AnnData object.

Parameters:

Name	Description	Default
`adata_files`	list List of file paths to `.h5ad` files to be concatenated.	required
`merge`	str, optional How to handle conflicting columns, e.g., 'unique' (default), 'first', etc.	`'unique'`
`add_dataset_col`	bool, optional Whether to add a new column in `adata.obs` identifying the source dataset.	`False`
`dataset_col_name`	str, optional Name of the new column storing dataset names.	`'dataset'`
`output_file`	str, optional Path to save the concatenated AnnData object. If None, the object is not saved.	`None`

Returns:

Type	Description
	ad.AnnData The concatenated AnnData object.

`concord.utils.filter_and_copy_attributes(adata_target, adata_source)`

Filter adata_target to match the cells in adata_source, then copy .obs and .obsm.

Parameters:

Name	Type	Description	Default
`adata_target`		ad.AnnData The AnnData object to be filtered.	required
`adata_source`		ad.AnnData The reference AnnData object containing the desired cells and attributes.	required

Returns:

Type	Description
	ad.AnnData The filtered AnnData object with updated `.obs` and `.obsm`.

`concord.utils.ensure_categorical(adata, obs_key=None, drop_unused=True)`

Convert an .obs column to categorical dtype.

Parameters:

Name	Description	Default
`adata`	ad.AnnData The AnnData object.	required
`obs_key`	str Column in `.obs` to be converted to categorical.	`None`
`drop_unused`	bool, optional Whether to remove unused categories (default is True).	`True`

`concord.utils.save_obsm_to_hdf5(adata, filename)`

Save the .obsm attribute of an AnnData object to an HDF5 file.

Parameters:

Name	Type	Description	Default
`adata`		anndata.AnnData The AnnData object containing the `.obsm` attribute to be saved.	required
`filename`		str The path to the HDF5 file where `.obsm` data will be stored.	required

Returns:

Type	Description
	None Saves `.obsm` data to the specified HDF5 file.

`concord.utils.load_obsm_from_hdf5(filename)`

Load the .obsm attribute from an HDF5 file.

Parameters:

Name	Type	Description	Default
`filename`		str Path to the HDF5 file containing `.obsm` data.	required

Returns:

Type	Description
	dict A dictionary where keys are `.obsm` names and values are corresponding matrices.

`concord.utils.subset_adata_to_obsm_indices(adata, obsm)`

Subset an AnnData object to match the indices present in .obsm.

Parameters:

Name	Type	Description	Default
`adata`		anndata.AnnData The original AnnData object.	required
`obsm`		dict A dictionary containing `.obsm` data, where keys are embedding names, and values are arrays.	required

Returns:

Type	Description
	anndata.AnnData A subsetted AnnData object that contains only the indices available in `.obsm`.

`concord.utils.anndata_to_viscello(adata, output_dir, project_name='MyProject', organism='hsa', clist_only=False)`

Converts an AnnData object to a VisCello project directory.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	AnnData object containing single-cell data.	required
`output_dir`	`str`	Directory where the VisCello project will be created.	required
`project_name`	`str`	Name of the project. Defaults to "MyProject".	`'MyProject'`
`organism`	`str`	Organism code (e.g., 'hsa' for human). Defaults to 'hsa'.	`'hsa'`
`clist_only`	`bool`	Whether to generate only the clist file. Defaults to False.	`False`

Returns:

Type	Description
	None

Side Effects

Creates a directory with the necessary files for VisCello.
Saves eset.rds (ExpressionSet), config.yml, and clist.rds.

`concord.utils.update_clist_with_subsets(global_adata, adata_subsets, viscello_dir, cluster_key=None)`

Updates an existing VisCello clist with new subsets.

Parameters:

Name	Type	Description	Default
`global_adata`	`AnnData`	The full AnnData object.	required
`adata_subsets`	`dict`	Dictionary mapping subset names to AnnData objects.	required
`viscello_dir`	`str`	Path to the existing VisCello directory.	required
`cluster_key`	`str`	Key in `adata.obs` for cluster assignments. Defaults to None.	`None`

Returns:

Type	Description
	None

Side Effects

Reads the existing clist.rds file from viscello_dir.
Adds new subsets as Cello objects to the clist.
Saves the updated clist.rds file in viscello_dir.

Utilities

concord.utils.benchmark_topology(diagrams, expected_betti_numbers=[1, 0, 0], n_bins=100, save_dir=None, file_suffix=None)

concord.utils.compute_persistent_homology(adata, key='X_pca', homology_dimensions=[0, 1, 2])

concord.utils.compute_betti_median_or_mode(betti_values, statistic='median')

concord.utils.compute_betti_entropy(betti_values)

concord.utils.interpolate_betti_curve(betti_values, original_sampling, common_sampling)

concord.utils.compute_betti_statistics(diagram, expected_betti_numbers, n_bins=100)

concord.utils.summarize_betti_statistics(betti_stats)

concord.utils.pairwise_distance(adata, keys, metric='cosine')

concord.utils.local_vs_distal_corr(X_high, X_low, local_percentile=25, distal_percentile=75, method='pearsonr')

concord.utils.compute_state_batch_distance_ratio(adata, basis='X_latent', batch_key='batch', state_key='cluster', metric='cosine')

concord.utils.compute_trustworthiness(adata, embedding_keys, groundtruth, metric='euclidean', n_neighbors=10)

concord.utils.Simulation

downsample_mtx_umi(mtx, ratio=0.1, seed=1) staticmethod

rnegbin(mu, theta, size) staticmethod

simulate_batch(adata, cell_indices=None, cell_proportion=0.3, batch_name='batch_1', effect_type='batch_specific_features', distribution='normal', level=1.0, dispersion=0.1, batch_feature_frac=0.1, seed=42)

simulate_data()

simulate_dropout(mtx, dropout_lambda=1.0, seed=None) staticmethod

simulate_state()

simulate_trajectory(n_genes=10, n_cells=100, cell_block_size_ratio=0.3, program_num=3, program_structure='linear', program_on_time_fraction=0.3, distribution='normal', mean_expression=10, min_expression=0, dispersion=1.0, seed=42, loop_to=None)

concord.utils.select_features(adata, n_top_features=2000, flavor='seurat_v3', filter_gene_by_counts=False, normalize=False, log1p=False, grouping='cluster', emb_key='X_pca', k=512, knn_samples=100, gini_cut_qt=None, save_path=None, figsize=(10, 3), subsample_frac=1.0, random_state=0)

concord.utils.generate_synthetic_doublets(adata, doublet_synth_ratio, seed, batch_key, droplet_type_key, mean=0.5, var=0.1, clip_range=(0.2, 0.8), plot_histogram=True, combine_with_original=False)

concord.utils.list_adata_files(folder_path, substring=None, extension='*.h5ad')

concord.utils.read_and_concatenate_adata(adata_files, merge='unique', add_dataset_col=False, dataset_col_name='dataset', output_file=None)

concord.utils.filter_and_copy_attributes(adata_target, adata_source)

concord.utils.ensure_categorical(adata, obs_key=None, drop_unused=True)

concord.utils.save_obsm_to_hdf5(adata, filename)

concord.utils.load_obsm_from_hdf5(filename)

concord.utils.subset_adata_to_obsm_indices(adata, obsm)

concord.utils.anndata_to_viscello(adata, output_dir, project_name='MyProject', organism='hsa', clist_only=False)

concord.utils.update_clist_with_subsets(global_adata, adata_subsets, viscello_dir, cluster_key=None)

`concord.utils.benchmark_topology(diagrams, expected_betti_numbers=[1, 0, 0], n_bins=100, save_dir=None, file_suffix=None)`

`concord.utils.compute_persistent_homology(adata, key='X_pca', homology_dimensions=[0, 1, 2])`

`concord.utils.compute_betti_median_or_mode(betti_values, statistic='median')`

`concord.utils.compute_betti_entropy(betti_values)`

`concord.utils.interpolate_betti_curve(betti_values, original_sampling, common_sampling)`

`concord.utils.compute_betti_statistics(diagram, expected_betti_numbers, n_bins=100)`

`concord.utils.summarize_betti_statistics(betti_stats)`

`concord.utils.pairwise_distance(adata, keys, metric='cosine')`

`concord.utils.local_vs_distal_corr(X_high, X_low, local_percentile=25, distal_percentile=75, method='pearsonr')`

`concord.utils.compute_state_batch_distance_ratio(adata, basis='X_latent', batch_key='batch', state_key='cluster', metric='cosine')`

`concord.utils.compute_trustworthiness(adata, embedding_keys, groundtruth, metric='euclidean', n_neighbors=10)`

`concord.utils.Simulation`

`downsample_mtx_umi(mtx, ratio=0.1, seed=1)` `staticmethod`

`rnegbin(mu, theta, size)` `staticmethod`

`simulate_batch(adata, cell_indices=None, cell_proportion=0.3, batch_name='batch_1', effect_type='batch_specific_features', distribution='normal', level=1.0, dispersion=0.1, batch_feature_frac=0.1, seed=42)`

`simulate_data()`

`simulate_dropout(mtx, dropout_lambda=1.0, seed=None)` `staticmethod`

`simulate_state()`

`simulate_trajectory(n_genes=10, n_cells=100, cell_block_size_ratio=0.3, program_num=3, program_structure='linear', program_on_time_fraction=0.3, distribution='normal', mean_expression=10, min_expression=0, dispersion=1.0, seed=42, loop_to=None)`

`concord.utils.select_features(adata, n_top_features=2000, flavor='seurat_v3', filter_gene_by_counts=False, normalize=False, log1p=False, grouping='cluster', emb_key='X_pca', k=512, knn_samples=100, gini_cut_qt=None, save_path=None, figsize=(10, 3), subsample_frac=1.0, random_state=0)`

`concord.utils.generate_synthetic_doublets(adata, doublet_synth_ratio, seed, batch_key, droplet_type_key, mean=0.5, var=0.1, clip_range=(0.2, 0.8), plot_histogram=True, combine_with_original=False)`

`concord.utils.list_adata_files(folder_path, substring=None, extension='*.h5ad')`

`concord.utils.read_and_concatenate_adata(adata_files, merge='unique', add_dataset_col=False, dataset_col_name='dataset', output_file=None)`

`concord.utils.filter_and_copy_attributes(adata_target, adata_source)`

`concord.utils.ensure_categorical(adata, obs_key=None, drop_unused=True)`

`concord.utils.save_obsm_to_hdf5(adata, filename)`

`concord.utils.load_obsm_from_hdf5(filename)`

`concord.utils.subset_adata_to_obsm_indices(adata, obsm)`

`concord.utils.anndata_to_viscello(adata, output_dir, project_name='MyProject', organism='hsa', clist_only=False)`

`concord.utils.update_clist_with_subsets(global_adata, adata_subsets, viscello_dir, cluster_key=None)`