Skip to content

Utilities

utils can be replaced by ul, e.g., concord.utils.list_adata_files can be concord.ul.list_adata_files

concord.utils.select_features(adata, n_top_features=2000, flavor='seurat_v3', filter_gene_by_counts=False, normalize=False, log1p=False, grouping='cluster', emb_key='X_pca', k=512, knn_samples=100, gini_cut_qt=None, save_path=None, figsize=(10, 3), subsample_frac=1.0, random_state=0)

Selects top informative features from an AnnData object.

Parameters:

Name Type Description Default
adata AnnData

AnnData object containing gene expression data.

required
n_top_features int

Number of top features to select. Defaults to 2000.

2000
flavor str

Feature selection method. Options: - 'seurat_v3': Highly variable gene selection based on Seurat v3. - 'iff': Uses Informative Feature Filtering (IFF) method. Defaults to "seurat_v3".

'seurat_v3'
filter_gene_by_counts Union[int, bool]

Minimum count threshold for feature filtering. Defaults to False.

False
normalize bool

Whether to normalize the data before feature selection. Defaults to False.

False
log1p bool

Whether to apply log1p transformation before feature selection. Defaults to False.

False
grouping Union[str, Series, List[str]]

Clustering/grouping strategy for IFF method. Defaults to 'cluster'.

'cluster'
emb_key str

Embedding key in adata.obsm used for clustering. Defaults to 'X_pca'.

'X_pca'
k int

Number of neighbors for k-NN if grouping='knn'. Defaults to 512.

512
knn_samples int

Number of k-NN samples if grouping='knn'. Defaults to 100.

100
gini_cut_qt float

Quantile threshold for selecting features by Gini coefficient in IFF. Defaults to None.

None
save_path Optional[Union[str, Path]]

Path to save Gini coefficient plot. Defaults to None.

None
figsize tuple

Size of Gini coefficient plot. Defaults to (10, 3).

(10, 3)
subsample_frac float

Fraction of data to subsample for feature selection. Defaults to 1.0.

1.0
random_state int

Random seed for reproducibility. Defaults to 0.

0

Returns:

Type Description
List[str]

List[str]: List of selected feature names.

concord.utils.generate_synthetic_doublets(adata, doublet_synth_ratio, seed, batch_key, droplet_type_key, mean=0.5, var=0.1, clip_range=(0.2, 0.8), plot_histogram=True, combine_with_original=False)

Generate synthetic doublets from singlet data in an AnnData object within each batch.

Parameters:

Name Type Description Default
adata

AnnData object containing the singlet data (with maybe unclassified doublets)

required
doublet_synth_ratio

float, the ratio of synthetic doublets to true singlets

required
seed

int, random seed for reproducibility

required
batch_key

str, the key in .obs indicating batch information

required
droplet_type_key

str, the key in .obs indicating droplet type

required
mean

float, mean of the normal distribution for generating fractions (default: 0.5)

0.5
var

float, variance of the normal distribution for generating fractions (default: 0.1)

0.1
clip_range

tuple, range to clip the generated fractions (default: (0.2, 0.8))

(0.2, 0.8)
plot_histogram

bool, whether to plot the histogram of synthetic doublet fractions

True

Returns:

Name Type Description
adata_synthetic_doublets

AnnData object containing the synthetic doublets

concord.utils.list_adata_files(folder_path, substring=None, extension='*.h5ad')

List all .h5ad files in a directory (recursively) that match a given substring.

Parameters:

Name Type Description Default
folder_path

str Path to the folder where .h5ad files are located.

required
substring

str, optional A substring to filter filenames (default is None, meaning no filtering).

None
extension

str, optional File extension to search for (default is "*.h5ad").

'*.h5ad'

Returns:

Type Description

list A list of file paths matching the criteria.

concord.utils.read_and_concatenate_adata(adata_files, merge='unique', add_dataset_col=False, dataset_col_name='dataset', output_file=None)

Read and concatenate multiple AnnData .h5ad files into a single AnnData object.

Parameters:

Name Type Description Default
adata_files

list List of file paths to .h5ad files to be concatenated.

required
merge

str, optional How to handle conflicting columns, e.g., 'unique' (default), 'first', etc.

'unique'
add_dataset_col

bool, optional Whether to add a new column in adata.obs identifying the source dataset.

False
dataset_col_name

str, optional Name of the new column storing dataset names.

'dataset'
output_file

str, optional Path to save the concatenated AnnData object. If None, the object is not saved.

None

Returns:

Type Description

ad.AnnData The concatenated AnnData object.

concord.utils.filter_and_copy_attributes(adata_target, adata_source)

Filter adata_target to match the cells in adata_source, then copy .obs and .obsm.

Parameters:

Name Type Description Default
adata_target

ad.AnnData The AnnData object to be filtered.

required
adata_source

ad.AnnData The reference AnnData object containing the desired cells and attributes.

required

Returns:

Type Description

ad.AnnData The filtered AnnData object with updated .obs and .obsm.

concord.utils.ensure_categorical(adata, obs_key=None, drop_unused=True)

Convert an .obs column to categorical dtype.

Parameters:

Name Type Description Default
adata

ad.AnnData The AnnData object.

required
obs_key

str Column in .obs to be converted to categorical.

None
drop_unused

bool, optional Whether to remove unused categories (default is True).

True

concord.utils.save_obsm_to_hdf5(adata, filename)

Save the .obsm attribute of an AnnData object to an HDF5 file.

Parameters:

Name Type Description Default
adata

anndata.AnnData The AnnData object containing the .obsm attribute to be saved.

required
filename

str The path to the HDF5 file where .obsm data will be stored.

required

Returns:

Type Description

None Saves .obsm data to the specified HDF5 file.

concord.utils.load_obsm_from_hdf5(filename)

Load the .obsm attribute from an HDF5 file.

Parameters:

Name Type Description Default
filename

str Path to the HDF5 file containing .obsm data.

required

Returns:

Type Description

dict A dictionary where keys are .obsm names and values are corresponding matrices.

concord.utils.subset_adata_to_obsm_indices(adata, obsm)

Subset an AnnData object to match the indices present in .obsm.

Parameters:

Name Type Description Default
adata

anndata.AnnData The original AnnData object.

required
obsm

dict A dictionary containing .obsm data, where keys are embedding names, and values are arrays.

required

Returns:

Type Description

anndata.AnnData A subsetted AnnData object that contains only the indices available in .obsm.

concord.utils.anndata_to_viscello(adata, output_dir, project_name='MyProject', organism='hsa', clist_only=False)

Converts an AnnData object to a VisCello project directory.

Parameters:

Name Type Description Default
adata AnnData

AnnData object containing single-cell data.

required
output_dir str

Directory where the VisCello project will be created.

required
project_name str

Name of the project. Defaults to "MyProject".

'MyProject'
organism str

Organism code (e.g., 'hsa' for human). Defaults to 'hsa'.

'hsa'
clist_only bool

Whether to generate only the clist file. Defaults to False.

False

Returns:

Type Description

None

Side Effects
  • Creates a directory with the necessary files for VisCello.
  • Saves eset.rds (ExpressionSet), config.yml, and clist.rds.

concord.utils.update_clist_with_subsets(global_adata, adata_subsets, viscello_dir, cluster_key=None)

Updates an existing VisCello clist with new subsets.

Parameters:

Name Type Description Default
global_adata AnnData

The full AnnData object.

required
adata_subsets dict

Dictionary mapping subset names to AnnData objects.

required
viscello_dir str

Path to the existing VisCello directory.

required
cluster_key str

Key in adata.obs for cluster assignments. Defaults to None.

None

Returns:

Type Description

None

Side Effects
  • Reads the existing clist.rds file from viscello_dir.
  • Adds new subsets as Cello objects to the clist.
  • Saves the updated clist.rds file in viscello_dir.