Utilities
utils
can be replaced by ul
, e.g., concord.utils.list_adata_files
can be concord.ul.list_adata_files
concord.utils.benchmark_topology(diagrams, expected_betti_numbers=[1, 0, 0], n_bins=100, save_dir=None, file_suffix=None)
Benchmark the topological properties of persistence diagrams.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
diagrams
|
dict A dictionary where keys are method names and values are persistence diagrams. |
required | |
expected_betti_numbers
|
list, optional A list specifying the expected Betti numbers for different homology dimensions. Default is [1, 0, 0]. |
[1, 0, 0]
|
|
n_bins
|
int, optional Number of bins to use for Betti curve calculations. Default is 100. |
100
|
|
save_dir
|
str, optional Directory to save benchmarking results as CSV files. If None, results are not saved. |
None
|
|
file_suffix
|
str, optional Suffix to append to saved filenames. |
None
|
Returns:
Type | Description |
---|---|
dict
A dictionary containing:
- |
concord.utils.compute_persistent_homology(adata, key='X_pca', homology_dimensions=[0, 1, 2])
Computes persistent homology using Vietoris-Rips complex.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata
|
anndata.AnnData The AnnData object containing the data. |
required | |
key
|
str, optional
The key in |
'X_pca'
|
|
homology_dimensions
|
list, optional List of homology dimensions to compute. Default is [0, 1, 2]. |
[0, 1, 2]
|
Returns:
Type | Description |
---|---|
np.ndarray Persistence diagrams representing homology classes across filtration values. |
concord.utils.compute_betti_median_or_mode(betti_values, statistic='median')
Computes the median or mode of Betti numbers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
betti_values
|
np.ndarray Array of Betti numbers across filtration values. |
required | |
statistic
|
str, optional Statistic to compute ('median' or 'mode'). Default is 'median'. |
'median'
|
Returns:
Type | Description |
---|---|
float The computed median or mode of the Betti numbers. |
concord.utils.compute_betti_entropy(betti_values)
Computes the entropy of the Betti curve.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
betti_values
|
np.ndarray Array of Betti numbers across filtration values. |
required |
Returns:
Type | Description |
---|---|
float The entropy of the Betti curve. |
concord.utils.interpolate_betti_curve(betti_values, original_sampling, common_sampling)
Interpolates Betti curve onto a common filtration grid.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
betti_values
|
np.ndarray Array of Betti numbers. |
required | |
original_sampling
|
np.ndarray The original filtration values associated with the Betti numbers. |
required | |
common_sampling
|
np.ndarray The target filtration values for interpolation. |
required |
Returns:
Type | Description |
---|---|
np.ndarray Interpolated Betti curve. |
concord.utils.compute_betti_statistics(diagram, expected_betti_numbers, n_bins=100)
Computes Betti statistics given a persistence diagram.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
diagram
|
np.ndarray Persistence diagram from Giotto-TDA. |
required | |
expected_betti_numbers
|
np.ndarray Expected Betti numbers for different homology dimensions. |
required | |
n_bins
|
int, optional Number of bins for the Betti curve computation. Default is 100. |
100
|
Returns:
Type | Description |
---|---|
dict
A dictionary containing:
- |
concord.utils.summarize_betti_statistics(betti_stats)
Summarizes Betti statistics into pandas DataFrames.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
betti_stats
|
dict Dictionary containing Betti statistics for different methods. |
required |
Returns:
Type | Description |
---|---|
tuple
- |
concord.utils.benchmark_geometry(adata, keys, eval_metrics=['pseudotime', 'cell_distance_corr', 'local_distal_corr', 'trustworthiness', 'state_distance_corr', 'state_dispersion_corr', 'state_batch_distance_ratio'], dist_metric='cosine', groundtruth_key='PCA_no_noise', state_key='cluster', batch_key='batch', groundtruth_dispersion=None, ground_truth_dispersion_key='wt_noise', corr_types=['pearsonr', 'spearmanr', 'kendalltau'], trustworthiness_n_neighbors=np.arange(10, 101, 10), dispersion_metric='var', return_type='dataframe', local_percentile=0.1, distal_percentile=0.9, start_point=0, end_point=None, pseudotime_k=30, truetime_key='time', verbose=True, save_dir=None, file_suffix=None)
Benchmark the geometric properties of different embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata
|
anndata.AnnData The AnnData object containing cell embeddings. |
required | |
keys
|
list
List of embeddings (keys in |
required | |
eval_metrics
|
list, optional Metrics to compute, such as 'pseudotime', 'cell_distance_corr', etc. Default includes multiple metrics. |
['pseudotime', 'cell_distance_corr', 'local_distal_corr', 'trustworthiness', 'state_distance_corr', 'state_dispersion_corr', 'state_batch_distance_ratio']
|
|
dist_metric
|
str, optional Distance metric for computing cell distances. Default is 'cosine'. |
'cosine'
|
|
groundtruth_key
|
str, optional
Key in |
'PCA_no_noise'
|
|
state_key
|
str, optional
Key in |
'cluster'
|
|
batch_key
|
str, optional
Key in |
'batch'
|
|
groundtruth_dispersion
|
dict, optional Precomputed dispersion values for ground truth, if available. |
None
|
|
ground_truth_dispersion_key
|
str, optional Key used when computing dispersion correlations. Default is 'wt_noise'. |
'wt_noise'
|
|
corr_types
|
list, optional List of correlation methods to compute. Default includes 'pearsonr', 'spearmanr', and 'kendalltau'. |
['pearsonr', 'spearmanr', 'kendalltau']
|
|
trustworthiness_n_neighbors
|
np.ndarray, optional
Range of neighborhood sizes for trustworthiness computation. Default is |
arange(10, 101, 10)
|
|
dispersion_metric
|
str, optional Metric to compute dispersion, e.g., 'var' (variance). Default is 'var'. |
'var'
|
|
return_type
|
str, optional If 'dataframe', returns summary statistics; if 'full', returns additional details. Default is 'dataframe'. |
'dataframe'
|
|
local_percentile
|
float, optional Percentile threshold for local distance correlations. Default is 0.1. |
0.1
|
|
distal_percentile
|
float, optional Percentile threshold for distal distance correlations. Default is 0.9. |
0.9
|
|
start_point
|
int, optional Index of the starting cell for pseudotime computation. Must be specified. |
0
|
|
end_point
|
int, optional Index of the ending cell for pseudotime computation. Must be specified. |
None
|
|
pseudotime_k
|
int, optional Number of neighbors used in k-NN graph for pseudotime computation. Default is 30. |
30
|
|
truetime_key
|
str, optional
Key in |
'time'
|
|
verbose
|
bool, optional Whether to enable logging. Default is True. |
True
|
|
save_dir
|
str, optional Directory to save benchmarking results. If None, results are not saved. |
None
|
|
file_suffix
|
str, optional Suffix for saved filenames. |
None
|
Returns:
Type | Description |
---|---|
pd.DataFrame or tuple
If |
concord.utils.pairwise_distance(adata, keys, metric='cosine')
concord.utils.local_vs_distal_corr(X_high, X_low, local_percentile=25, distal_percentile=75, method='pearsonr')
Computes correlation between local and distal pairwise distances.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X_high
|
ndarray
|
High-dimensional data matrix. |
required |
X_low
|
ndarray
|
Low-dimensional embedding matrix. |
required |
local_percentile
|
int
|
Percentile threshold for local distances. Defaults to 25. |
25
|
distal_percentile
|
int
|
Percentile threshold for distal distances. Defaults to 75. |
75
|
method
|
str
|
Correlation method; 'pearsonr', 'spearmanr', or 'kendalltau'. Defaults to 'pearsonr'. |
'pearsonr'
|
Returns:
Name | Type | Description |
---|---|---|
float |
Correlation for local distances. |
|
float |
Correlation for distal distances. |
concord.utils.compute_state_batch_distance_ratio(adata, basis='X_latent', batch_key='batch', state_key='cluster', metric='cosine')
Computes the Batch-to-State Distance Ratio using centroids to evaluate batch correction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata
|
AnnData
|
AnnData object containing latent embeddings. |
required |
basis
|
str
|
Key for latent embeddings in adata.obsm. Defaults to 'X_latent'. |
'X_latent'
|
batch_key
|
str
|
Key for batch labels in adata.obs. Defaults to 'batch'. |
'batch'
|
state_key
|
str
|
Key for cell state labels in adata.obs. Defaults to 'cluster'. |
'cluster'
|
metric
|
str
|
Distance metric to use, e.g., 'cosine' or 'euclidean'. Defaults to 'cosine'. |
'cosine'
|
Returns:
Name | Type | Description |
---|---|---|
float |
Ratio of average batch distance to average state distance. |
concord.utils.compute_trustworthiness(adata, embedding_keys, groundtruth, metric='euclidean', n_neighbors=10)
Evaluates trustworthiness of embeddings in an AnnData object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata
|
AnnData
|
AnnData object containing embeddings in adata.obsm. |
required |
embedding_keys
|
list
|
List of keys in adata.obsm to evaluate (e.g., ['X_umap', 'X_tsne']). |
required |
groundtruth
|
str or ndarray
|
Key in adata.obsm or adata.layers for ground truth data, or a precomputed matrix. |
required |
metric
|
str
|
Distance metric for trustworthiness calculation, e.g., 'euclidean' or 'cosine'. Defaults to 'euclidean'. |
'euclidean'
|
n_neighbors
|
int or list
|
Neighborhood sizes for trustworthiness evaluation. Defaults to 10. |
10
|
Returns:
Type | Description |
---|---|
pandas.DataFrame: Trustworthiness scores for each embedding at each neighborhood size. |
|
pandas.DataFrame: Summary statistics with average trustworthiness and decay rate. |
concord.utils.Simulation
A class for simulating single-cell gene expression data with various structures and batch effects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_cells
|
int
|
Number of cells to simulate. Defaults to 1000. |
1000
|
n_genes
|
int
|
Number of genes to simulate. Defaults to 1000. |
1000
|
n_batches
|
int
|
Number of batches to simulate. Defaults to 2. |
2
|
n_states
|
int
|
Number of states (e.g., clusters, trajectories). Defaults to 3. |
3
|
state_type
|
str
|
Type of state to simulate; options include 'cluster', 'trajectory', 'tree', etc. Defaults to 'cluster'. |
'cluster'
|
batch_type
|
str or list
|
Type of batch effect; options include 'batch_specific_features', 'variance_inflation', etc. Defaults to 'batch_specific_features'. |
'batch_specific_features'
|
state_distribution
|
str
|
Distribution type for states; e.g., 'normal', 'poisson'. Defaults to 'normal'. |
'normal'
|
state_level
|
float
|
Mean expression level for states. Defaults to 1.0. |
1.0
|
state_min_level
|
float
|
Minimum expression level. Defaults to 0.0. |
0.0
|
state_dispersion
|
float
|
Dispersion of state expression. Defaults to 0.1. |
0.1
|
program_structure
|
str
|
Gene expression program structure; e.g., 'linear', 'bidirectional'. Defaults to "linear". |
'linear'
|
program_on_time_fraction
|
float
|
Fraction of time the program is on. Defaults to 0.3. |
0.3
|
program_gap_size
|
int
|
Size of gaps in expression programs. Defaults to 1. |
1
|
program_noise_in_block
|
bool
|
Whether to add noise within each expression block. Defaults to True. |
True
|
trajectory_program_num
|
int
|
Number of programs in a trajectory simulation. Defaults to 3. |
3
|
trajectory_cell_block_size_ratio
|
float
|
Ratio of cell block sizes in a trajectory. Defaults to 0.3. |
0.3
|
trajectory_loop_to
|
int or list
|
Loop connection in trajectory simulations. Defaults to None. |
None
|
tree_branching_factor
|
int
|
Number of branches per tree level. Defaults to 2. |
2
|
tree_depth
|
int
|
Depth of the simulated tree. Defaults to 3. |
3
|
tree_program_decay
|
float
|
Decay factor for tree programs across branches. Defaults to 0.5. |
0.5
|
tree_cellcount_decay
|
float
|
Decay factor for cell numbers across tree branches. Defaults to 1.0. |
1.0
|
batch_distribution
|
str or list
|
Distribution for batch effects. Defaults to 'normal'. |
'normal'
|
batch_level
|
float or list
|
Magnitude of batch effects. Defaults to 1.0. |
1.0
|
batch_dispersion
|
float or list
|
Dispersion of batch effects. Defaults to 0.1. |
0.1
|
batch_cell_proportion
|
list
|
Proportion of cells per batch. Defaults to None. |
None
|
batch_feature_frac
|
float or list
|
Fraction of genes affected by batch effects. Defaults to 0.1. |
0.1
|
global_non_specific_gene_fraction
|
float
|
Fraction of genes that are globally non-specific. Defaults to 0.1. |
0.1
|
pairwise_non_specific_gene_fraction
|
dict
|
Pairwise-specific gene fraction between state pairs. Defaults to None. |
None
|
universal_gene_fraction
|
float
|
Fraction of universal genes expressed across all cells. Defaults to 0.0. |
0.0
|
non_neg
|
bool
|
Whether to enforce non-negative expression values. Defaults to False. |
False
|
to_int
|
bool
|
Whether to convert expression values to integers. Defaults to False. |
False
|
seed
|
int
|
Random seed for reproducibility. Defaults to 0. |
0
|
Methods:
Name | Description |
---|---|
simulate_data |
Simulates gene expression data, including batch effects. |
simulate_state |
Simulates cell state-specific gene expression patterns. |
simulate_batch |
Simulates batch-specific effects on gene expression. |
simulate_clusters |
Simulates gene expression in discrete clusters. |
simulate_trajectory |
Simulates continuous gene expression trajectories. |
simulate_tree |
Simulates hierarchical branching gene expression. |
simulate_gatto |
Simulates expression patterns similar to Gatto et al., 2023. |
simulate_s_curve |
Simulates an S-curve structure in gene expression. |
simulate_swiss_roll |
Simulates a Swiss roll structure with optional hole. |
simulate_expression_block |
Generates structured gene expression within a cell population. |
simulate_dropout |
Simulates dropout in gene expression data. |
downsample_mtx_umi |
Performs UMI count downsampling. |
simulate_distribution |
Samples values from specified distributions. |
downsample_mtx_umi(mtx, ratio=0.1, seed=1)
staticmethod
Simulates downsampling of a gene expression matrix (UMI counts) by a given ratio.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mtx
|
ndarray
|
The input matrix where rows represent genes and columns represent cells. |
required |
ratio
|
float
|
The downsampling ratio (default 0.1). |
0.1
|
seed
|
int
|
Random seed for reproducibility (default 1). |
1
|
Returns:
Type | Description |
---|---|
numpy.ndarray: The downsampled matrix. |
rnegbin(mu, theta, size)
staticmethod
Generate random numbers from a negative binomial distribution.
Parameters: n: Number of random numbers to generate. mu: Mean of the distribution. theta: Dispersion parameter.
simulate_batch(adata, cell_indices=None, cell_proportion=0.3, batch_name='batch_1', effect_type='batch_specific_features', distribution='normal', level=1.0, dispersion=0.1, batch_feature_frac=0.1, seed=42)
Applies batch-specific effects to an existing simulated dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata
|
AnnData
|
Base dataset to apply batch effects. |
required |
cell_indices
|
array - like
|
Indices of cells to modify. Defaults to None. |
None
|
cell_proportion
|
float
|
Proportion of cells affected. Defaults to 0.3. |
0.3
|
batch_name
|
str
|
Name of the batch. Defaults to 'batch_1'. |
'batch_1'
|
effect_type
|
str
|
Type of batch effect (e.g., 'batch_specific_features', 'variance_inflation'). Defaults to 'batch_specific_features'. |
'batch_specific_features'
|
distribution
|
str
|
Distribution type for batch effects (e.g., 'normal'). Defaults to 'normal'. |
'normal'
|
level
|
float
|
Effect level (e.g., scaling factor). Defaults to 1.0. |
1.0
|
dispersion
|
float
|
Dispersion of batch effects. Defaults to 0.1. |
0.1
|
batch_feature_frac
|
float
|
Fraction of genes affected by batch effects. Defaults to 0.1. |
0.1
|
seed
|
int
|
Random seed for reproducibility. Defaults to 42. |
42
|
Returns:
Name | Type | Description |
---|---|---|
tuple |
|
simulate_clusters(n_genes=6, n_cells=12, num_clusters=2, program_structure='uniform', program_on_time_fraction=0.3, distribution='normal', mean_expression=10, min_expression=1, dispersion=1.0, global_non_specific_gene_fraction=0.1, pairwise_non_specific_gene_fraction=None, cluster_key='cluster', permute=False, seed=42)
Simulates gene expression for discrete cell clusters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_genes
|
int or list
|
Number of genes per cluster or total genes. Defaults to 6. |
6
|
n_cells
|
int or list
|
Number of cells per cluster or total cells. Defaults to 12. |
12
|
num_clusters
|
int
|
Number of clusters to simulate. Defaults to 2. |
2
|
program_structure
|
str
|
Expression program structure ('linear', 'uniform', etc.). Defaults to 'uniform'. |
'uniform'
|
program_on_time_fraction
|
float
|
Fraction of program duration. Defaults to 0.3. |
0.3
|
distribution
|
str
|
Type of distribution for gene expression. Defaults to 'normal'. |
'normal'
|
mean_expression
|
float
|
Mean expression level. Defaults to 10. |
10
|
min_expression
|
float
|
Minimum expression level. Defaults to 1. |
1
|
dispersion
|
float
|
Dispersion in expression levels. Defaults to 1.0. |
1.0
|
global_non_specific_gene_fraction
|
float
|
Fraction of globally expressed genes. Defaults to 0.1. |
0.1
|
pairwise_non_specific_gene_fraction
|
dict
|
Pairwise-specific genes between cluster pairs. Defaults to None. |
None
|
cluster_key
|
str
|
Key for cluster labeling. Defaults to 'cluster'. |
'cluster'
|
permute
|
bool
|
Whether to shuffle cells. Defaults to False. |
False
|
seed
|
int
|
Random seed. Defaults to 42. |
42
|
Returns:
Name | Type | Description |
---|---|---|
AnnData |
Simulated dataset with clustered gene expression. |
simulate_data()
Simulates single-cell gene expression data, integrating state-based and batch effects.
Returns:
Name | Type | Description |
---|---|---|
tuple |
|
simulate_dropout(mtx, dropout_lambda=1.0, seed=None)
staticmethod
Simulates dropout in UMI counts based on the specified dropout lambda.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mtx
|
ndarray
|
The actual UMI counts matrix (genes x cells). |
required |
dropout_lambda
|
float
|
The lambda parameter controlling the dropout probability. |
1.0
|
seed
|
int
|
Seed for the random number generator for reproducibility. |
None
|
Returns:
Type | Description |
---|---|
numpy.ndarray: The UMI counts matrix after applying dropout. |
simulate_state()
Simulates gene expression profiles for different cell states.
Returns:
Name | Type | Description |
---|---|---|
AnnData |
An AnnData object containing simulated state-specific expression data. |
simulate_trajectory(n_genes=10, n_cells=100, cell_block_size_ratio=0.3, program_num=3, program_structure='linear', program_on_time_fraction=0.3, distribution='normal', mean_expression=10, min_expression=0, dispersion=1.0, seed=42, loop_to=None)
Simulates a continuous trajectory of gene expression.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_genes
|
int
|
Number of genes. Defaults to 10. |
10
|
n_cells
|
int
|
Number of cells. Defaults to 100. |
100
|
cell_block_size_ratio
|
float
|
Ratio of cell blocks. Defaults to 0.3. |
0.3
|
program_num
|
int
|
Number of gene programs in the trajectory. Defaults to 3. |
3
|
program_structure
|
str
|
Structure of gene programs ('linear', 'bidirectional'). Defaults to 'linear'. |
'linear'
|
program_on_time_fraction
|
float
|
Fraction of time the program is on. Defaults to 0.3. |
0.3
|
distribution
|
str
|
Distribution type. Defaults to 'normal'. |
'normal'
|
mean_expression
|
float
|
Mean expression level. Defaults to 10. |
10
|
min_expression
|
float
|
Minimum expression level. Defaults to 0. |
0
|
dispersion
|
float
|
Dispersion of expression. Defaults to 1.0. |
1.0
|
seed
|
int
|
Random seed. Defaults to 42. |
42
|
loop_to
|
int or list
|
Defines looping relationships in the trajectory. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
AnnData |
Simulated dataset with continuous gene expression patterns. |
simulate_tree(n_genes=10, n_cells=100, branching_factor=2, depth=3, program_structure='linear_increasing', program_on_time_fraction=0.3, program_gap_size=1, program_decay=0.5, cellcount_decay=1.0, distribution='normal', mean_expression=10, min_expression=0, dispersion=1.0, seed=42, noise_in_block=True)
Simulates hierarchical branching gene expression patterns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_genes
|
int
|
Number of genes. Defaults to 10. |
10
|
n_cells
|
int
|
Number of cells. Defaults to 100. |
100
|
branching_factor
|
int
|
Number of branches per level. Defaults to 2. |
2
|
depth
|
int
|
Depth of the branching tree. Defaults to 3. |
3
|
program_structure
|
str
|
Gene program structure. Defaults to 'linear_increasing'. |
'linear_increasing'
|
program_on_time_fraction
|
float
|
Program activation time fraction. Defaults to 0.3. |
0.3
|
program_gap_size
|
int
|
Gap size between programs. Defaults to 1. |
1
|
program_decay
|
float
|
Decay factor for program effects. Defaults to 0.5. |
0.5
|
cellcount_decay
|
float
|
Decay factor for cell counts. Defaults to 1.0. |
1.0
|
distribution
|
str
|
Expression distribution type. Defaults to 'normal'. |
'normal'
|
mean_expression
|
float
|
Mean gene expression level. Defaults to 10. |
10
|
min_expression
|
float
|
Minimum gene expression level. Defaults to 0. |
0
|
dispersion
|
float
|
Dispersion of expression. Defaults to 1.0. |
1.0
|
seed
|
int
|
Random seed. Defaults to 42. |
42
|
noise_in_block
|
bool
|
Whether to add noise within expression blocks. Defaults to True. |
True
|
Returns:
Name | Type | Description |
---|---|---|
AnnData |
Simulated dataset with hierarchical tree-like gene expression. |
concord.utils.select_features(adata, n_top_features=2000, flavor='seurat_v3', filter_gene_by_counts=False, normalize=False, log1p=False, grouping='cluster', emb_key='X_pca', k=512, knn_samples=100, gini_cut_qt=None, save_path=None, figsize=(10, 3), subsample_frac=1.0, random_state=0)
Selects top informative features from an AnnData object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata
|
AnnData
|
AnnData object containing gene expression data. |
required |
n_top_features
|
int
|
Number of top features to select. Defaults to 2000. |
2000
|
flavor
|
str
|
Feature selection method. Options: - 'seurat_v3': Highly variable gene selection based on Seurat v3. - 'iff': Uses Informative Feature Filtering (IFF) method. Defaults to "seurat_v3". |
'seurat_v3'
|
filter_gene_by_counts
|
Union[int, bool]
|
Minimum count threshold for feature filtering. Defaults to False. |
False
|
normalize
|
bool
|
Whether to normalize the data before feature selection. Defaults to False. |
False
|
log1p
|
bool
|
Whether to apply log1p transformation before feature selection. Defaults to False. |
False
|
grouping
|
Union[str, Series, List[str]]
|
Clustering/grouping strategy for IFF method. Defaults to 'cluster'. |
'cluster'
|
emb_key
|
str
|
Embedding key in |
'X_pca'
|
k
|
int
|
Number of neighbors for k-NN if |
512
|
knn_samples
|
int
|
Number of k-NN samples if |
100
|
gini_cut_qt
|
float
|
Quantile threshold for selecting features by Gini coefficient in IFF. Defaults to None. |
None
|
save_path
|
Optional[Union[str, Path]]
|
Path to save Gini coefficient plot. Defaults to None. |
None
|
figsize
|
tuple
|
Size of Gini coefficient plot. Defaults to (10, 3). |
(10, 3)
|
subsample_frac
|
float
|
Fraction of data to subsample for feature selection. Defaults to 1.0. |
1.0
|
random_state
|
int
|
Random seed for reproducibility. Defaults to 0. |
0
|
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of selected feature names. |
concord.utils.generate_synthetic_doublets(adata, doublet_synth_ratio, seed, batch_key, droplet_type_key, mean=0.5, var=0.1, clip_range=(0.2, 0.8), plot_histogram=True, combine_with_original=False)
Generate synthetic doublets from singlet data in an AnnData object within each batch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata
|
AnnData object containing the singlet data (with maybe unclassified doublets) |
required | |
doublet_synth_ratio
|
float, the ratio of synthetic doublets to true singlets |
required | |
seed
|
int, random seed for reproducibility |
required | |
batch_key
|
str, the key in .obs indicating batch information |
required | |
droplet_type_key
|
str, the key in .obs indicating droplet type |
required | |
mean
|
float, mean of the normal distribution for generating fractions (default: 0.5) |
0.5
|
|
var
|
float, variance of the normal distribution for generating fractions (default: 0.1) |
0.1
|
|
clip_range
|
tuple, range to clip the generated fractions (default: (0.2, 0.8)) |
(0.2, 0.8)
|
|
plot_histogram
|
bool, whether to plot the histogram of synthetic doublet fractions |
True
|
Returns:
Name | Type | Description |
---|---|---|
adata_synthetic_doublets |
AnnData object containing the synthetic doublets |
concord.utils.list_adata_files(folder_path, substring=None, extension='*.h5ad')
List all .h5ad
files in a directory (recursively) that match a given substring.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder_path
|
str
Path to the folder where |
required | |
substring
|
str, optional A substring to filter filenames (default is None, meaning no filtering). |
None
|
|
extension
|
str, optional File extension to search for (default is "*.h5ad"). |
'*.h5ad'
|
Returns:
Type | Description |
---|---|
list A list of file paths matching the criteria. |
concord.utils.read_and_concatenate_adata(adata_files, merge='unique', add_dataset_col=False, dataset_col_name='dataset', output_file=None)
Read and concatenate multiple AnnData .h5ad
files into a single AnnData object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata_files
|
list
List of file paths to |
required | |
merge
|
str, optional How to handle conflicting columns, e.g., 'unique' (default), 'first', etc. |
'unique'
|
|
add_dataset_col
|
bool, optional
Whether to add a new column in |
False
|
|
dataset_col_name
|
str, optional Name of the new column storing dataset names. |
'dataset'
|
|
output_file
|
str, optional Path to save the concatenated AnnData object. If None, the object is not saved. |
None
|
Returns:
Type | Description |
---|---|
ad.AnnData The concatenated AnnData object. |
concord.utils.filter_and_copy_attributes(adata_target, adata_source)
Filter adata_target
to match the cells in adata_source
, then copy .obs
and .obsm
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata_target
|
ad.AnnData The AnnData object to be filtered. |
required | |
adata_source
|
ad.AnnData The reference AnnData object containing the desired cells and attributes. |
required |
Returns:
Type | Description |
---|---|
ad.AnnData
The filtered AnnData object with updated |
concord.utils.ensure_categorical(adata, obs_key=None, drop_unused=True)
Convert an .obs
column to categorical dtype.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata
|
ad.AnnData The AnnData object. |
required | |
obs_key
|
str
Column in |
None
|
|
drop_unused
|
bool, optional Whether to remove unused categories (default is True). |
True
|
concord.utils.save_obsm_to_hdf5(adata, filename)
Save the .obsm
attribute of an AnnData object to an HDF5 file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata
|
anndata.AnnData
The AnnData object containing the |
required | |
filename
|
str
The path to the HDF5 file where |
required |
Returns:
Type | Description |
---|---|
None
Saves |
concord.utils.load_obsm_from_hdf5(filename)
Load the .obsm
attribute from an HDF5 file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename
|
str
Path to the HDF5 file containing |
required |
Returns:
Type | Description |
---|---|
dict
A dictionary where keys are |
concord.utils.subset_adata_to_obsm_indices(adata, obsm)
Subset an AnnData object to match the indices present in .obsm
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata
|
anndata.AnnData The original AnnData object. |
required | |
obsm
|
dict
A dictionary containing |
required |
Returns:
Type | Description |
---|---|
anndata.AnnData
A subsetted AnnData object that contains only the indices available in |
concord.utils.anndata_to_viscello(adata, output_dir, project_name='MyProject', organism='hsa', clist_only=False)
Converts an AnnData object to a VisCello project directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
adata
|
AnnData
|
AnnData object containing single-cell data. |
required |
output_dir
|
str
|
Directory where the VisCello project will be created. |
required |
project_name
|
str
|
Name of the project. Defaults to "MyProject". |
'MyProject'
|
organism
|
str
|
Organism code (e.g., 'hsa' for human). Defaults to 'hsa'. |
'hsa'
|
clist_only
|
bool
|
Whether to generate only the clist file. Defaults to False. |
False
|
Returns:
Type | Description |
---|---|
None |
Side Effects
- Creates a directory with the necessary files for VisCello.
- Saves
eset.rds
(ExpressionSet),config.yml
, andclist.rds
.
concord.utils.update_clist_with_subsets(global_adata, adata_subsets, viscello_dir, cluster_key=None)
Updates an existing VisCello clist with new subsets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
global_adata
|
AnnData
|
The full AnnData object. |
required |
adata_subsets
|
dict
|
Dictionary mapping subset names to AnnData objects. |
required |
viscello_dir
|
str
|
Path to the existing VisCello directory. |
required |
cluster_key
|
str
|
Key in |
None
|
Returns:
Type | Description |
---|---|
None |
Side Effects
- Reads the existing
clist.rds
file fromviscello_dir
. - Adds new subsets as
Cello
objects to the clist. - Saves the updated
clist.rds
file inviscello_dir
.