Miscellaneous Utilities

These are function Balblabla

Dictionary utilities

These are utilities for manipulating dictionaries. See notebook4 for an example of why/how to use them.

are_dicts_equal(dict1: dict, dict2: dict, keys_to_include: List[str] = None, keys_to_exclude: List[str] = None) bool[source]

Compare two dictionaries. Returns true if all entries are identical

Parameters:
  • dict1 – first dictionary to compare

  • dict2 – second dictionary to compare

  • keys_to_include – list of keys to use for the comparison. If None (defaults) the union of the keys in the two dictionary is used.

  • keys_to_exclude – list of keys to exclude. If None (defaults) no keys are excluded.

Returns:

result – True if all the entries corresponding to :attr:’keys_to_include’ are identical.

Note

float(1.0) if considered different from int(1)

concatenate_list_of_dict(list_of_dict) dict[source]

Concatenate dictionary with the same set of keys

Parameters:

list_of_dict – list of dictionary to concatenate

Returns:

output_dict – the concatenated dictionary

flatten_dict(input_dict: dict, separator: str = '_', prefix: str = '')[source]

Flatten a (possibly nested) dictionary

Parameters:
  • input_dict – the input dictionary to flatten

  • separator – string used to merge nested keys. It defaults to “_”

  • prefix – used in the recursive calls. Do not set manually

inspect_dict(d, prefix: str = '')[source]

Inspect the content of the dictionary

Parameters:
  • d – the dictionary to inspect

  • prefix – used recursively in case of nested dictionary. Do not set it directly.

sort_dict_according_to_indices(input_dict: dict, list_of_indices: List[int]) dict[source]

Sort dictionaries w.r.t. a list of indices.

Parameters:
  • input_dict – the dictionary to sort

  • list_of_indices – the indices to use in the sorting.

Returns:

output_dict – the sorted dictionary.

Example

>>> input_dict = {'key': ['b', 'c', 'a']}
>>> list_of_indices = [3, 1, 2]
>>> output_dict = sort_dict_according_to_indices(input_dict, list_of_indices)
>>> print(output_dict) # will be a,b,c
subset_dict(input_dict: dict, mask: torch.Tensor)[source]

Subset all the elements of a dictionary according to a mask

Parameters:
  • input_dict – dictionary with multiple entries in the form of list, numpy.arrau or torch.Tensors with the same leading dimensions, (N)

  • mask – boolean tensor of shape (N).

Returns:

output_dict – a new dictionary with the subset values

subset_dict_non_overlapping_patches(input_dict: dict, key_tissue: str, key_patch_xywh: str = 'patches_xywh', iom_threshold: float = 0.0) dict[source]

Subset a dictionary containing overlapping patches to a smaller dictionary containing only (weakly) overlapping ones.

Parameters:
  • input_dict – the dictionary to subset.

  • key_tissue – the dictionary key corresponding to the tissue identifier.

  • key_patch_xywh – the dictionary key corresponding to the coordinates (i.e. x,y,w,h) of the patches.

  • iom_threshold – Threshold value for Intersection Over Minimum (IoM). If two patches have \(\text{IoM} > \text{threshold}\) only one will survive the filtering process. Set :attr:’iom_threshold’ = 0 to have a collection of strictly non-overlapping patches.

Returns:

output_dict – Dictionary containing only patches with overlap less than threshold.

Note

The original dictionary will NOT be overwritten.

transfer_annotations_between_dict(source_dict: dict, dest_dict: dict, annotation_keys: List[Any], anchor_key: Any, metric: str = 'euclidean') dict[source]

Transfer the annotations from the source dictionary to the destination dictionary. For each element in the destination dictionary it findis the closests element in the source dictionary and copies the annotations from there. Closeness is defined as the metric distance between the anchor_elements.

Parameters:
  • source_dict – source dictionary from which the annotations will be read

  • dest_dict – destination dictionary where the annotation will be written

  • annotation_keys – List of keys. It is assumed that these keys are present in the source_dictionary

  • anchor_key – The key of the element to be used to measure distances. It must be present in BOTH source and destination dictionaries.

  • metric – the distance metric to measure distance between elements in the source and destination dictionaries. It defaults to ‘euclidian’.

Returns:

dict – The updated destination dictionary

Validation utilities

These are utilities used during validation to analyze the embeddings. See notebook4 for an example of why/how to use them.

class SmartUmap(*args: Any, **kwargs: Any)[source]

Wrapper around standard UMAP with get_graph() exposed.

__init__(preprocess_strategy: str, compute_all_pairwise_distances: bool = False, **kargs)[source]
Parameters:
  • preprocess_strategy – str, can be ‘center’, ‘z_score’, ‘raw’. This is the operation to perform before UMAP

  • compute_all_pairwise_distances – bool, it True (default is False) compute all pairwise distances

  • **kargs – All the arguments that standard UMAP can accept

fit(data, y=None) SmartUmap[source]

Fit the Umap given the data

Parameters:

data – array of shape \((n, p)\) where n are the points and p the features

fit_transform(data, y=None) numpy.ndarray[source]

Utility method which internally calls fit() and transform()

get_distances() torch.Tensor[source]

Returns the symmetric (dense) matrix with the DISTANCES between elements

get_graph() scipy.sparse.coo_matrix[source]

Returns the symmetric (sparse) matrix with the SIMILARITIES between elements

transform(data) numpy.ndarray[source]

Use previously fitted model (including mean and std for centering and scaling the data). to transform the embeddings.

Parameters:

data – array of shape \((n, p)\) to transfrom

Returns:

embeddings – numpy.tensor of shape (n_sample, n_components)

class SmartLeiden(graph: coo_matrix, directed: bool = True)[source]

Wrapper around standard Leiden algorithm. It can be initialized using the output of the SmartUmap.get_graph()

__init__(graph: coo_matrix, directed: bool = True)[source]
Parameters:
  • graph – Usually a sparse matrix with the similarities among nodes describing the graph

  • directed – if True (default) builds a directed graph.

Note

The matrix obtained by the UMAP algorithm is symmetric, in that case directed should be set to True

cluster(resolution: float = 1.0, use_weights: bool = True, random_state: int = 0, n_iterations: int = -1, partition_type: str = 'RBC') numpy.ndarray[source]

Find the clusters in the data

Parameters:
  • resolution – resolution parameter controlling (indirectly) the number of clusters

  • use_weights – if True (defaults) the graph is weighted, i.e. the edges have different strengths

  • random_state – control the random state. For reproducibility

  • n_iterations – how many iterations of the greedy algorithm to perform. If -1 (defaults) it iterates till convergence.

  • partition_type – The metric to optimize to find clusters. Either ‘CPM’ or ‘RBC’. :

Returns:

labels – the integer cluster labels

class SmartPca(preprocess_strategy: str)[source]

Return the PCA embeddings.

__init__(preprocess_strategy: str)[source]
Parameters:

preprocess_strategy – str, can be ‘center’, ‘z_score’, ‘raw’. This is the operation to perform before PCA

property explained_variance_

For compatibility with scikit_learn

property explained_variance_ratio_

For compatibility with scikit_learn

fit(data) SmartPca[source]

Fit the PCA given the data. It automatically select the algorithm based on the number of features.

Parameters:

data – array of shape \((n, p)\) where n are the points and p the features

fit_transform(data, n_components: int | float = None) numpy.ndarray[source]

Utility method which internally calls fit() and transform().

Parameters:
  • data – tensor of shape \((n, p)\)

  • n_components – If integer specifies the dimensionality of the data after PCA. If float in (0, 1) it auto selects the dimensionality so that the explained variance is at least that value. If none (defaults) uses the value previously used.

Returns:

data_transformed – array of shape \((n, q)\)

transform(data, n_components: int | float = None) numpy.ndarray[source]

Use a previously fitted model to transform the data.

Parameters:
  • data – tensor of shape \((n, p)\) where n is the number of points and p are the features

  • n_components – If integer specifies the dimensionality of the data after PCA. If float in (0, 1) it auto selects the dimensionality so that the explained variance is at least that value. If none it uses the previously used value.

class SmartScaler(quantiles: Tuple[float, float], clamp: bool)[source]

Scale the values using the median and quantiles (with are robust version of mean and variance). \(data = (data - median) / scale\)

If clamp=True, each feature is clamped to the quantile range before applying the transformation. This is a simple way to deal with the outliers.

It does not deal with the situation in which outliers are inside the “box” of acceptable range but far from the reduced manifold. # See situation shown below: # x x # x x # x x o # x x

__init__(quantiles: Tuple[float, float], clamp: bool)[source]
Parameters:
  • quantiles – The lowest and largest quantile used to scale the data. Must be in (0.0, 1.0)

  • clamp – If True, the data is clamped into q_low, q_high before scaling.

fit(data) SmartScaler[source]

Fit the data (i.e. computes quantiles and median)

fit_transform(data) numpy.ndarray[source]

Utility method which internally calls fit() and transform()

transform(data) numpy.ndarray[source]

Transform the data

Parameters:

data – tensor of shape \((n, p)\)

Returns:

out – tensor of the same shape as data with the scaled values.

compute_distance_embedding(ref_embeddings: torch.Tensor, other_embeddings: torch.Tensor, metric: str, temperature: float = 0.5) torch.Tensor[source]

Compute distance between embeddings

Parameters:
  • ref_embeddings – torch.Tensor of shape \((*, k)\) where k is the dimension of the embedding

  • other_embeddings – torch.Tensor of shape \((n, k)\)

  • temperature – float, the temperature used to compute contrastive distance

  • metric – Can be either ‘contrastive’ or ‘euclidean’

Returns:

dist – distance of shape \((*, n)\)

get_percentile(data: torch.Tensor | numpy.ndarray, dim: int) torch.Tensor | numpy.ndarray[source]

Takes some data and convert it into a percentile (in [0.0, 1.0]) along a specified dimension. Useful to convert a tensor into the range [0.0, 1.0] for visualization.

Parameters:
  • data – input data to convert to percentile in [0,1].

  • dim – the dimension along which to compute the quantiles

Returns:

percentile – torch.tensor or numpy.array (depending on the input type) with the same shape as the input with the percentile values. A percentile of 0.9 means that 90% of the input values were smaller.

get_z_score(x: torch.Tensor, dim: int) torch.Tensor[source]

Standardize vector by removing the mean and scaling to unit variance

Parameters:
  • x – torch.Tensor

  • dim – the dimension along which to compute the mean and std

Returns:

The z-score, i.e. z = (x - mean) / std

inverse_one_hot(image_in, bg_label: int = -1, dim: int = -3, threshold: float = 0.1)[source]

Takes float tensor and compute the argmax and max_value along the specified dimension. Returns a integer tensor of the same shape as the input_tensor but with the dim removed. If the max_value is less than the threshold the bg_label is assigned.

Note

It can take an image of size \((C, W, H)\) and generate an integer mask of size \((W, H)\). This operation can be thought as the inverse of the one-hot operation which takes an integer tensor of size (n) and returns a float tensor with an extra dimension, for example (n, num_classes).

Parameters:
  • image_in – any float tensor

  • bg_label – integer, the value assigned to the entries of which are smaller than the threshold

  • dim – int, the dimension along which to compute the max. For images this is usually the channel dimension, i.e. -3.

  • threshold – float, the value of the threshold. Value smaller than this are set assigned to the background

Returns:

out – An integer mask with the same size of the input tensor but with the dim removed.