Skip to content

Utilities

utils can be replaced by ul, e.g., concord.utils.list_adata_files can be concord.ul.list_adata_files

concord.utils.benchmark_topology(diagrams, expected_betti_numbers=[1, 0, 0], n_bins=100, save_dir=None, file_suffix=None)

Benchmark the topological properties of persistence diagrams.

Parameters:

Name Type Description Default
diagrams

dict A dictionary where keys are method names and values are persistence diagrams.

required
expected_betti_numbers

list, optional A list specifying the expected Betti numbers for different homology dimensions. Default is [1, 0, 0].

[1, 0, 0]
n_bins

int, optional Number of bins to use for Betti curve calculations. Default is 100.

100
save_dir

str, optional Directory to save benchmarking results as CSV files. If None, results are not saved.

None
file_suffix

str, optional Suffix to append to saved filenames.

None

Returns:

Type Description

dict A dictionary containing: - 'betti_stats': DataFrame summarizing Betti statistics. - 'distance_metrics': DataFrame of computed distances between Betti curves. - 'combined_metrics': DataFrame of entropy, variance, and L1 distance metrics.

concord.utils.compute_persistent_homology(adata, key='X_pca', homology_dimensions=[0, 1, 2])

Computes persistent homology using Vietoris-Rips complex.

Parameters:

Name Type Description Default
adata

anndata.AnnData The AnnData object containing the data.

required
key

str, optional The key in adata.obsm specifying the embedding to use. Default is 'X_pca'.

'X_pca'
homology_dimensions

list, optional List of homology dimensions to compute. Default is [0, 1, 2].

[0, 1, 2]

Returns:

Type Description

np.ndarray Persistence diagrams representing homology classes across filtration values.

concord.utils.compute_betti_median_or_mode(betti_values, statistic='median')

Computes the median or mode of Betti numbers.

Parameters:

Name Type Description Default
betti_values

np.ndarray Array of Betti numbers across filtration values.

required
statistic

str, optional Statistic to compute ('median' or 'mode'). Default is 'median'.

'median'

Returns:

Type Description

float The computed median or mode of the Betti numbers.

concord.utils.compute_betti_entropy(betti_values)

Computes the entropy of the Betti curve.

Parameters:

Name Type Description Default
betti_values

np.ndarray Array of Betti numbers across filtration values.

required

Returns:

Type Description

float The entropy of the Betti curve.

concord.utils.interpolate_betti_curve(betti_values, original_sampling, common_sampling)

Interpolates Betti curve onto a common filtration grid.

Parameters:

Name Type Description Default
betti_values

np.ndarray Array of Betti numbers.

required
original_sampling

np.ndarray The original filtration values associated with the Betti numbers.

required
common_sampling

np.ndarray The target filtration values for interpolation.

required

Returns:

Type Description

np.ndarray Interpolated Betti curve.

concord.utils.compute_betti_statistics(diagram, expected_betti_numbers, n_bins=100)

Computes Betti statistics given a persistence diagram.

Parameters:

Name Type Description Default
diagram

np.ndarray Persistence diagram from Giotto-TDA.

required
expected_betti_numbers

np.ndarray Expected Betti numbers for different homology dimensions.

required
n_bins

int, optional Number of bins for the Betti curve computation. Default is 100.

100

Returns:

Type Description

dict A dictionary containing: - 'betti_stats': Dictionary of Betti statistics. - 'observed_betti_numbers': Observed Betti numbers. - 'expected_betti_numbers': Expected Betti numbers. - 'l1_distance': L1 distance between observed and expected Betti numbers. - 'l2_distance': L2 distance between observed and expected Betti numbers. - 'total_relative_error': Total relative error.

concord.utils.summarize_betti_statistics(betti_stats)

Summarizes Betti statistics into pandas DataFrames.

Parameters:

Name Type Description Default
betti_stats

dict Dictionary containing Betti statistics for different methods.

required

Returns:

Type Description

tuple - betti_stats_pivot: DataFrame of Betti statistics. - distance_metrics_df: DataFrame of distance metrics.

concord.utils.benchmark_geometry(adata, keys, eval_metrics=['pseudotime', 'cell_distance_corr', 'local_distal_corr', 'trustworthiness', 'state_distance_corr', 'state_dispersion_corr', 'state_batch_distance_ratio'], dist_metric='cosine', groundtruth_key='PCA_no_noise', state_key='cluster', batch_key='batch', groundtruth_dispersion=None, ground_truth_dispersion_key='wt_noise', corr_types=['pearsonr', 'spearmanr', 'kendalltau'], trustworthiness_n_neighbors=np.arange(10, 101, 10), dispersion_metric='var', return_type='dataframe', local_percentile=0.1, distal_percentile=0.9, start_point=0, end_point=None, pseudotime_k=30, truetime_key='time', verbose=True, save_dir=None, file_suffix=None)

Benchmark the geometric properties of different embeddings.

Parameters:

Name Type Description Default
adata

anndata.AnnData The AnnData object containing cell embeddings.

required
keys

list List of embeddings (keys in adata.obsm) to evaluate.

required
eval_metrics

list, optional Metrics to compute, such as 'pseudotime', 'cell_distance_corr', etc. Default includes multiple metrics.

['pseudotime', 'cell_distance_corr', 'local_distal_corr', 'trustworthiness', 'state_distance_corr', 'state_dispersion_corr', 'state_batch_distance_ratio']
dist_metric

str, optional Distance metric for computing cell distances. Default is 'cosine'.

'cosine'
groundtruth_key

str, optional Key in adata.obsm containing the ground truth embedding. Default is 'PCA_no_noise'.

'PCA_no_noise'
state_key

str, optional Key in adata.obs representing cell states or clusters.

'cluster'
batch_key

str, optional Key in adata.obs representing batch information.

'batch'
groundtruth_dispersion

dict, optional Precomputed dispersion values for ground truth, if available.

None
ground_truth_dispersion_key

str, optional Key used when computing dispersion correlations. Default is 'wt_noise'.

'wt_noise'
corr_types

list, optional List of correlation methods to compute. Default includes 'pearsonr', 'spearmanr', and 'kendalltau'.

['pearsonr', 'spearmanr', 'kendalltau']
trustworthiness_n_neighbors

np.ndarray, optional Range of neighborhood sizes for trustworthiness computation. Default is np.arange(10, 101, 10).

arange(10, 101, 10)
dispersion_metric

str, optional Metric to compute dispersion, e.g., 'var' (variance). Default is 'var'.

'var'
return_type

str, optional If 'dataframe', returns summary statistics; if 'full', returns additional details. Default is 'dataframe'.

'dataframe'
local_percentile

float, optional Percentile threshold for local distance correlations. Default is 0.1.

0.1
distal_percentile

float, optional Percentile threshold for distal distance correlations. Default is 0.9.

0.9
start_point

int, optional Index of the starting cell for pseudotime computation. Must be specified.

0
end_point

int, optional Index of the ending cell for pseudotime computation. Must be specified.

None
pseudotime_k

int, optional Number of neighbors used in k-NN graph for pseudotime computation. Default is 30.

30
truetime_key

str, optional Key in adata.obs representing ground truth time. Default is 'time'.

'time'
verbose

bool, optional Whether to enable logging. Default is True.

True
save_dir

str, optional Directory to save benchmarking results. If None, results are not saved.

None
file_suffix

str, optional Suffix for saved filenames.

None

Returns:

Type Description

pd.DataFrame or tuple If return_type='dataframe', returns a DataFrame summarizing benchmark results. If return_type='full', returns both the DataFrame and a detailed results dictionary.

concord.utils.pairwise_distance(adata, keys, metric='cosine')

concord.utils.local_vs_distal_corr(X_high, X_low, local_percentile=25, distal_percentile=75, method='pearsonr')

Computes correlation between local and distal pairwise distances.

Parameters:

Name Type Description Default
X_high ndarray

High-dimensional data matrix.

required
X_low ndarray

Low-dimensional embedding matrix.

required
local_percentile int

Percentile threshold for local distances. Defaults to 25.

25
distal_percentile int

Percentile threshold for distal distances. Defaults to 75.

75
method str

Correlation method; 'pearsonr', 'spearmanr', or 'kendalltau'. Defaults to 'pearsonr'.

'pearsonr'

Returns:

Name Type Description
float

Correlation for local distances.

float

Correlation for distal distances.

concord.utils.compute_state_batch_distance_ratio(adata, basis='X_latent', batch_key='batch', state_key='cluster', metric='cosine')

Computes the Batch-to-State Distance Ratio using centroids to evaluate batch correction.

Parameters:

Name Type Description Default
adata AnnData

AnnData object containing latent embeddings.

required
basis str

Key for latent embeddings in adata.obsm. Defaults to 'X_latent'.

'X_latent'
batch_key str

Key for batch labels in adata.obs. Defaults to 'batch'.

'batch'
state_key str

Key for cell state labels in adata.obs. Defaults to 'cluster'.

'cluster'
metric str

Distance metric to use, e.g., 'cosine' or 'euclidean'. Defaults to 'cosine'.

'cosine'

Returns:

Name Type Description
float

Ratio of average batch distance to average state distance.

concord.utils.compute_trustworthiness(adata, embedding_keys, groundtruth, metric='euclidean', n_neighbors=10)

Evaluates trustworthiness of embeddings in an AnnData object.

Parameters:

Name Type Description Default
adata AnnData

AnnData object containing embeddings in adata.obsm.

required
embedding_keys list

List of keys in adata.obsm to evaluate (e.g., ['X_umap', 'X_tsne']).

required
groundtruth str or ndarray

Key in adata.obsm or adata.layers for ground truth data, or a precomputed matrix.

required
metric str

Distance metric for trustworthiness calculation, e.g., 'euclidean' or 'cosine'. Defaults to 'euclidean'.

'euclidean'
n_neighbors int or list

Neighborhood sizes for trustworthiness evaluation. Defaults to 10.

10

Returns:

Type Description

pandas.DataFrame: Trustworthiness scores for each embedding at each neighborhood size.

pandas.DataFrame: Summary statistics with average trustworthiness and decay rate.

concord.utils.Simulation

A class for simulating single-cell gene expression data with various structures and batch effects.

Parameters:

Name Type Description Default
n_cells int

Number of cells to simulate. Defaults to 1000.

1000
n_genes int

Number of genes to simulate. Defaults to 1000.

1000
n_batches int

Number of batches to simulate. Defaults to 2.

2
n_states int

Number of states (e.g., clusters, trajectories). Defaults to 3.

3
state_type str

Type of state to simulate; options include 'cluster', 'trajectory', 'tree', etc. Defaults to 'cluster'.

'cluster'
batch_type str or list

Type of batch effect; options include 'batch_specific_features', 'variance_inflation', etc. Defaults to 'batch_specific_features'.

'batch_specific_features'
state_distribution str

Distribution type for states; e.g., 'normal', 'poisson'. Defaults to 'normal'.

'normal'
state_level float

Mean expression level for states. Defaults to 1.0.

1.0
state_min_level float

Minimum expression level. Defaults to 0.0.

0.0
state_dispersion float

Dispersion of state expression. Defaults to 0.1.

0.1
program_structure str

Gene expression program structure; e.g., 'linear', 'bidirectional'. Defaults to "linear".

'linear'
program_on_time_fraction float

Fraction of time the program is on. Defaults to 0.3.

0.3
program_gap_size int

Size of gaps in expression programs. Defaults to 1.

1
program_noise_in_block bool

Whether to add noise within each expression block. Defaults to True.

True
trajectory_program_num int

Number of programs in a trajectory simulation. Defaults to 3.

3
trajectory_cell_block_size_ratio float

Ratio of cell block sizes in a trajectory. Defaults to 0.3.

0.3
trajectory_loop_to int or list

Loop connection in trajectory simulations. Defaults to None.

None
tree_branching_factor int

Number of branches per tree level. Defaults to 2.

2
tree_depth int

Depth of the simulated tree. Defaults to 3.

3
tree_program_decay float

Decay factor for tree programs across branches. Defaults to 0.5.

0.5
tree_cellcount_decay float

Decay factor for cell numbers across tree branches. Defaults to 1.0.

1.0
batch_distribution str or list

Distribution for batch effects. Defaults to 'normal'.

'normal'
batch_level float or list

Magnitude of batch effects. Defaults to 1.0.

1.0
batch_dispersion float or list

Dispersion of batch effects. Defaults to 0.1.

0.1
batch_cell_proportion list

Proportion of cells per batch. Defaults to None.

None
batch_feature_frac float or list

Fraction of genes affected by batch effects. Defaults to 0.1.

0.1
global_non_specific_gene_fraction float

Fraction of genes that are globally non-specific. Defaults to 0.1.

0.1
pairwise_non_specific_gene_fraction dict

Pairwise-specific gene fraction between state pairs. Defaults to None.

None
universal_gene_fraction float

Fraction of universal genes expressed across all cells. Defaults to 0.0.

0.0
non_neg bool

Whether to enforce non-negative expression values. Defaults to False.

False
to_int bool

Whether to convert expression values to integers. Defaults to False.

False
seed int

Random seed for reproducibility. Defaults to 0.

0

Methods:

Name Description
simulate_data

Simulates gene expression data, including batch effects.

simulate_state

Simulates cell state-specific gene expression patterns.

simulate_batch

Simulates batch-specific effects on gene expression.

simulate_clusters

Simulates gene expression in discrete clusters.

simulate_trajectory

Simulates continuous gene expression trajectories.

simulate_tree

Simulates hierarchical branching gene expression.

simulate_gatto

Simulates expression patterns similar to Gatto et al., 2023.

simulate_s_curve

Simulates an S-curve structure in gene expression.

simulate_swiss_roll

Simulates a Swiss roll structure with optional hole.

simulate_expression_block

Generates structured gene expression within a cell population.

simulate_dropout

Simulates dropout in gene expression data.

downsample_mtx_umi

Performs UMI count downsampling.

simulate_distribution

Samples values from specified distributions.

downsample_mtx_umi(mtx, ratio=0.1, seed=1) staticmethod

Simulates downsampling of a gene expression matrix (UMI counts) by a given ratio.

Parameters:

Name Type Description Default
mtx ndarray

The input matrix where rows represent genes and columns represent cells.

required
ratio float

The downsampling ratio (default 0.1).

0.1
seed int

Random seed for reproducibility (default 1).

1

Returns:

Type Description

numpy.ndarray: The downsampled matrix.

rnegbin(mu, theta, size) staticmethod

Generate random numbers from a negative binomial distribution.

Parameters: n: Number of random numbers to generate. mu: Mean of the distribution. theta: Dispersion parameter.

simulate_batch(adata, cell_indices=None, cell_proportion=0.3, batch_name='batch_1', effect_type='batch_specific_features', distribution='normal', level=1.0, dispersion=0.1, batch_feature_frac=0.1, seed=42)

Applies batch-specific effects to an existing simulated dataset.

Parameters:

Name Type Description Default
adata AnnData

Base dataset to apply batch effects.

required
cell_indices array - like

Indices of cells to modify. Defaults to None.

None
cell_proportion float

Proportion of cells affected. Defaults to 0.3.

0.3
batch_name str

Name of the batch. Defaults to 'batch_1'.

'batch_1'
effect_type str

Type of batch effect (e.g., 'batch_specific_features', 'variance_inflation'). Defaults to 'batch_specific_features'.

'batch_specific_features'
distribution str

Distribution type for batch effects (e.g., 'normal'). Defaults to 'normal'.

'normal'
level float

Effect level (e.g., scaling factor). Defaults to 1.0.

1.0
dispersion float

Dispersion of batch effects. Defaults to 0.1.

0.1
batch_feature_frac float

Fraction of genes affected by batch effects. Defaults to 0.1.

0.1
seed int

Random seed for reproducibility. Defaults to 42.

42

Returns:

Name Type Description
tuple
  • batch_adata (AnnData): Modified dataset with batch effects.
  • batch_adata_pre (AnnData): Dataset before applying batch effects.

simulate_clusters(n_genes=6, n_cells=12, num_clusters=2, program_structure='uniform', program_on_time_fraction=0.3, distribution='normal', mean_expression=10, min_expression=1, dispersion=1.0, global_non_specific_gene_fraction=0.1, pairwise_non_specific_gene_fraction=None, cluster_key='cluster', permute=False, seed=42)

Simulates gene expression for discrete cell clusters.

Parameters:

Name Type Description Default
n_genes int or list

Number of genes per cluster or total genes. Defaults to 6.

6
n_cells int or list

Number of cells per cluster or total cells. Defaults to 12.

12
num_clusters int

Number of clusters to simulate. Defaults to 2.

2
program_structure str

Expression program structure ('linear', 'uniform', etc.). Defaults to 'uniform'.

'uniform'
program_on_time_fraction float

Fraction of program duration. Defaults to 0.3.

0.3
distribution str

Type of distribution for gene expression. Defaults to 'normal'.

'normal'
mean_expression float

Mean expression level. Defaults to 10.

10
min_expression float

Minimum expression level. Defaults to 1.

1
dispersion float

Dispersion in expression levels. Defaults to 1.0.

1.0
global_non_specific_gene_fraction float

Fraction of globally expressed genes. Defaults to 0.1.

0.1
pairwise_non_specific_gene_fraction dict

Pairwise-specific genes between cluster pairs. Defaults to None.

None
cluster_key str

Key for cluster labeling. Defaults to 'cluster'.

'cluster'
permute bool

Whether to shuffle cells. Defaults to False.

False
seed int

Random seed. Defaults to 42.

42

Returns:

Name Type Description
AnnData

Simulated dataset with clustered gene expression.

simulate_data()

Simulates single-cell gene expression data, integrating state-based and batch effects.

Returns:

Name Type Description
tuple
  • adata (AnnData): Simulated gene expression data with batch effects.
  • adata_pre (AnnData): Pre-batch effect simulated data.

simulate_dropout(mtx, dropout_lambda=1.0, seed=None) staticmethod

Simulates dropout in UMI counts based on the specified dropout lambda.

Parameters:

Name Type Description Default
mtx ndarray

The actual UMI counts matrix (genes x cells).

required
dropout_lambda float

The lambda parameter controlling the dropout probability.

1.0
seed int

Seed for the random number generator for reproducibility.

None

Returns:

Type Description

numpy.ndarray: The UMI counts matrix after applying dropout.

simulate_state()

Simulates gene expression profiles for different cell states.

Returns:

Name Type Description
AnnData

An AnnData object containing simulated state-specific expression data.

simulate_trajectory(n_genes=10, n_cells=100, cell_block_size_ratio=0.3, program_num=3, program_structure='linear', program_on_time_fraction=0.3, distribution='normal', mean_expression=10, min_expression=0, dispersion=1.0, seed=42, loop_to=None)

Simulates a continuous trajectory of gene expression.

Parameters:

Name Type Description Default
n_genes int

Number of genes. Defaults to 10.

10
n_cells int

Number of cells. Defaults to 100.

100
cell_block_size_ratio float

Ratio of cell blocks. Defaults to 0.3.

0.3
program_num int

Number of gene programs in the trajectory. Defaults to 3.

3
program_structure str

Structure of gene programs ('linear', 'bidirectional'). Defaults to 'linear'.

'linear'
program_on_time_fraction float

Fraction of time the program is on. Defaults to 0.3.

0.3
distribution str

Distribution type. Defaults to 'normal'.

'normal'
mean_expression float

Mean expression level. Defaults to 10.

10
min_expression float

Minimum expression level. Defaults to 0.

0
dispersion float

Dispersion of expression. Defaults to 1.0.

1.0
seed int

Random seed. Defaults to 42.

42
loop_to int or list

Defines looping relationships in the trajectory. Defaults to None.

None

Returns:

Name Type Description
AnnData

Simulated dataset with continuous gene expression patterns.

simulate_tree(n_genes=10, n_cells=100, branching_factor=2, depth=3, program_structure='linear_increasing', program_on_time_fraction=0.3, program_gap_size=1, program_decay=0.5, cellcount_decay=1.0, distribution='normal', mean_expression=10, min_expression=0, dispersion=1.0, seed=42, noise_in_block=True)

Simulates hierarchical branching gene expression patterns.

Parameters:

Name Type Description Default
n_genes int

Number of genes. Defaults to 10.

10
n_cells int

Number of cells. Defaults to 100.

100
branching_factor int

Number of branches per level. Defaults to 2.

2
depth int

Depth of the branching tree. Defaults to 3.

3
program_structure str

Gene program structure. Defaults to 'linear_increasing'.

'linear_increasing'
program_on_time_fraction float

Program activation time fraction. Defaults to 0.3.

0.3
program_gap_size int

Gap size between programs. Defaults to 1.

1
program_decay float

Decay factor for program effects. Defaults to 0.5.

0.5
cellcount_decay float

Decay factor for cell counts. Defaults to 1.0.

1.0
distribution str

Expression distribution type. Defaults to 'normal'.

'normal'
mean_expression float

Mean gene expression level. Defaults to 10.

10
min_expression float

Minimum gene expression level. Defaults to 0.

0
dispersion float

Dispersion of expression. Defaults to 1.0.

1.0
seed int

Random seed. Defaults to 42.

42
noise_in_block bool

Whether to add noise within expression blocks. Defaults to True.

True

Returns:

Name Type Description
AnnData

Simulated dataset with hierarchical tree-like gene expression.

concord.utils.select_features(adata, n_top_features=2000, flavor='seurat_v3', filter_gene_by_counts=False, normalize=False, log1p=False, grouping='cluster', emb_key='X_pca', k=512, knn_samples=100, gini_cut_qt=None, save_path=None, figsize=(10, 3), subsample_frac=1.0, random_state=0)

Selects top informative features from an AnnData object.

Parameters:

Name Type Description Default
adata AnnData

AnnData object containing gene expression data.

required
n_top_features int

Number of top features to select. Defaults to 2000.

2000
flavor str

Feature selection method. Options: - 'seurat_v3': Highly variable gene selection based on Seurat v3. - 'iff': Uses Informative Feature Filtering (IFF) method. Defaults to "seurat_v3".

'seurat_v3'
filter_gene_by_counts Union[int, bool]

Minimum count threshold for feature filtering. Defaults to False.

False
normalize bool

Whether to normalize the data before feature selection. Defaults to False.

False
log1p bool

Whether to apply log1p transformation before feature selection. Defaults to False.

False
grouping Union[str, Series, List[str]]

Clustering/grouping strategy for IFF method. Defaults to 'cluster'.

'cluster'
emb_key str

Embedding key in adata.obsm used for clustering. Defaults to 'X_pca'.

'X_pca'
k int

Number of neighbors for k-NN if grouping='knn'. Defaults to 512.

512
knn_samples int

Number of k-NN samples if grouping='knn'. Defaults to 100.

100
gini_cut_qt float

Quantile threshold for selecting features by Gini coefficient in IFF. Defaults to None.

None
save_path Optional[Union[str, Path]]

Path to save Gini coefficient plot. Defaults to None.

None
figsize tuple

Size of Gini coefficient plot. Defaults to (10, 3).

(10, 3)
subsample_frac float

Fraction of data to subsample for feature selection. Defaults to 1.0.

1.0
random_state int

Random seed for reproducibility. Defaults to 0.

0

Returns:

Type Description
List[str]

List[str]: List of selected feature names.

concord.utils.generate_synthetic_doublets(adata, doublet_synth_ratio, seed, batch_key, droplet_type_key, mean=0.5, var=0.1, clip_range=(0.2, 0.8), plot_histogram=True, combine_with_original=False)

Generate synthetic doublets from singlet data in an AnnData object within each batch.

Parameters:

Name Type Description Default
adata

AnnData object containing the singlet data (with maybe unclassified doublets)

required
doublet_synth_ratio

float, the ratio of synthetic doublets to true singlets

required
seed

int, random seed for reproducibility

required
batch_key

str, the key in .obs indicating batch information

required
droplet_type_key

str, the key in .obs indicating droplet type

required
mean

float, mean of the normal distribution for generating fractions (default: 0.5)

0.5
var

float, variance of the normal distribution for generating fractions (default: 0.1)

0.1
clip_range

tuple, range to clip the generated fractions (default: (0.2, 0.8))

(0.2, 0.8)
plot_histogram

bool, whether to plot the histogram of synthetic doublet fractions

True

Returns:

Name Type Description
adata_synthetic_doublets

AnnData object containing the synthetic doublets

concord.utils.list_adata_files(folder_path, substring=None, extension='*.h5ad')

List all .h5ad files in a directory (recursively) that match a given substring.

Parameters:

Name Type Description Default
folder_path

str Path to the folder where .h5ad files are located.

required
substring

str, optional A substring to filter filenames (default is None, meaning no filtering).

None
extension

str, optional File extension to search for (default is "*.h5ad").

'*.h5ad'

Returns:

Type Description

list A list of file paths matching the criteria.

concord.utils.read_and_concatenate_adata(adata_files, merge='unique', add_dataset_col=False, dataset_col_name='dataset', output_file=None)

Read and concatenate multiple AnnData .h5ad files into a single AnnData object.

Parameters:

Name Type Description Default
adata_files

list List of file paths to .h5ad files to be concatenated.

required
merge

str, optional How to handle conflicting columns, e.g., 'unique' (default), 'first', etc.

'unique'
add_dataset_col

bool, optional Whether to add a new column in adata.obs identifying the source dataset.

False
dataset_col_name

str, optional Name of the new column storing dataset names.

'dataset'
output_file

str, optional Path to save the concatenated AnnData object. If None, the object is not saved.

None

Returns:

Type Description

ad.AnnData The concatenated AnnData object.

concord.utils.filter_and_copy_attributes(adata_target, adata_source)

Filter adata_target to match the cells in adata_source, then copy .obs and .obsm.

Parameters:

Name Type Description Default
adata_target

ad.AnnData The AnnData object to be filtered.

required
adata_source

ad.AnnData The reference AnnData object containing the desired cells and attributes.

required

Returns:

Type Description

ad.AnnData The filtered AnnData object with updated .obs and .obsm.

concord.utils.ensure_categorical(adata, obs_key=None, drop_unused=True)

Convert an .obs column to categorical dtype.

Parameters:

Name Type Description Default
adata

ad.AnnData The AnnData object.

required
obs_key

str Column in .obs to be converted to categorical.

None
drop_unused

bool, optional Whether to remove unused categories (default is True).

True

concord.utils.save_obsm_to_hdf5(adata, filename)

Save the .obsm attribute of an AnnData object to an HDF5 file.

Parameters:

Name Type Description Default
adata

anndata.AnnData The AnnData object containing the .obsm attribute to be saved.

required
filename

str The path to the HDF5 file where .obsm data will be stored.

required

Returns:

Type Description

None Saves .obsm data to the specified HDF5 file.

concord.utils.load_obsm_from_hdf5(filename)

Load the .obsm attribute from an HDF5 file.

Parameters:

Name Type Description Default
filename

str Path to the HDF5 file containing .obsm data.

required

Returns:

Type Description

dict A dictionary where keys are .obsm names and values are corresponding matrices.

concord.utils.subset_adata_to_obsm_indices(adata, obsm)

Subset an AnnData object to match the indices present in .obsm.

Parameters:

Name Type Description Default
adata

anndata.AnnData The original AnnData object.

required
obsm

dict A dictionary containing .obsm data, where keys are embedding names, and values are arrays.

required

Returns:

Type Description

anndata.AnnData A subsetted AnnData object that contains only the indices available in .obsm.

concord.utils.anndata_to_viscello(adata, output_dir, project_name='MyProject', organism='hsa', clist_only=False)

Converts an AnnData object to a VisCello project directory.

Parameters:

Name Type Description Default
adata AnnData

AnnData object containing single-cell data.

required
output_dir str

Directory where the VisCello project will be created.

required
project_name str

Name of the project. Defaults to "MyProject".

'MyProject'
organism str

Organism code (e.g., 'hsa' for human). Defaults to 'hsa'.

'hsa'
clist_only bool

Whether to generate only the clist file. Defaults to False.

False

Returns:

Type Description

None

Side Effects
  • Creates a directory with the necessary files for VisCello.
  • Saves eset.rds (ExpressionSet), config.yml, and clist.rds.

concord.utils.update_clist_with_subsets(global_adata, adata_subsets, viscello_dir, cluster_key=None)

Updates an existing VisCello clist with new subsets.

Parameters:

Name Type Description Default
global_adata AnnData

The full AnnData object.

required
adata_subsets dict

Dictionary mapping subset names to AnnData objects.

required
viscello_dir str

Path to the existing VisCello directory.

required
cluster_key str

Key in adata.obs for cluster assignments. Defaults to None.

None

Returns:

Type Description

None

Side Effects
  • Reads the existing clist.rds file from viscello_dir.
  • Adds new subsets as Cello objects to the clist.
  • Saves the updated clist.rds file in viscello_dir.