Miscellaneous Utilities¶
These are function Balblabla
Dictionary utilities¶
These are utilities for manipulating dictionaries. See notebook4 for an example of why/how to use them.
- are_dicts_equal(dict1: dict, dict2: dict, keys_to_include: List[str] = None, keys_to_exclude: List[str] = None) bool[source]¶
Compare two dictionaries. Returns true if all entries are identical
- Parameters:
dict1 – first dictionary to compare
dict2 – second dictionary to compare
keys_to_include – list of keys to use for the comparison. If None (defaults) the union of the keys in the two dictionary is used.
keys_to_exclude – list of keys to exclude. If None (defaults) no keys are excluded.
- Returns:
result – True if all the entries corresponding to :attr:’keys_to_include’ are identical.
Note
float(1.0) if considered different from int(1)
- concatenate_list_of_dict(list_of_dict) dict[source]¶
Concatenate dictionary with the same set of keys
- Parameters:
list_of_dict – list of dictionary to concatenate
- Returns:
output_dict – the concatenated dictionary
- flatten_dict(input_dict: dict, separator: str = '_', prefix: str = '')[source]¶
Flatten a (possibly nested) dictionary
- Parameters:
input_dict – the input dictionary to flatten
separator – string used to merge nested keys. It defaults to “_”
prefix – used in the recursive calls. Do not set manually
- inspect_dict(d, prefix: str = '')[source]¶
Inspect the content of the dictionary
- Parameters:
d – the dictionary to inspect
prefix – used recursively in case of nested dictionary. Do not set it directly.
- sort_dict_according_to_indices(input_dict: dict, list_of_indices: List[int]) dict[source]¶
Sort dictionaries w.r.t. a list of indices.
- Parameters:
input_dict – the dictionary to sort
list_of_indices – the indices to use in the sorting.
- Returns:
output_dict – the sorted dictionary.
Example
>>> input_dict = {'key': ['b', 'c', 'a']} >>> list_of_indices = [3, 1, 2] >>> output_dict = sort_dict_according_to_indices(input_dict, list_of_indices) >>> print(output_dict) # will be a,b,c
- subset_dict(input_dict: dict, mask: torch.Tensor)[source]¶
Subset all the elements of a dictionary according to a mask
- Parameters:
input_dict – dictionary with multiple entries in the form of list, numpy.arrau or torch.Tensors with the same leading dimensions, (N)
mask – boolean tensor of shape (N).
- Returns:
output_dict – a new dictionary with the subset values
- subset_dict_non_overlapping_patches(input_dict: dict, key_tissue: str, key_patch_xywh: str = 'patches_xywh', iom_threshold: float = 0.0) dict[source]¶
Subset a dictionary containing overlapping patches to a smaller dictionary containing only (weakly) overlapping ones.
- Parameters:
input_dict – the dictionary to subset.
key_tissue – the dictionary key corresponding to the tissue identifier.
key_patch_xywh – the dictionary key corresponding to the coordinates (i.e. x,y,w,h) of the patches.
iom_threshold – Threshold value for Intersection Over Minimum (IoM). If two patches have \(\text{IoM} > \text{threshold}\) only one will survive the filtering process. Set :attr:’iom_threshold’ = 0 to have a collection of strictly non-overlapping patches.
- Returns:
output_dict – Dictionary containing only patches with overlap less than threshold.
Note
The original dictionary will NOT be overwritten.
- transfer_annotations_between_dict(source_dict: dict, dest_dict: dict, annotation_keys: List[Any], anchor_key: Any, metric: str = 'euclidean') dict[source]¶
Transfer the annotations from the source dictionary to the destination dictionary. For each element in the destination dictionary it findis the closests element in the source dictionary and copies the annotations from there. Closeness is defined as the metric distance between the anchor_elements.
- Parameters:
source_dict – source dictionary from which the annotations will be read
dest_dict – destination dictionary where the annotation will be written
annotation_keys – List of keys. It is assumed that these keys are present in the source_dictionary
anchor_key – The key of the element to be used to measure distances. It must be present in BOTH source and destination dictionaries.
metric – the distance metric to measure distance between elements in the source and destination dictionaries. It defaults to ‘euclidian’.
- Returns:
dict – The updated destination dictionary
Validation utilities¶
These are utilities used during validation to analyze the embeddings. See notebook4 for an example of why/how to use them.
- class SmartUmap(*args: Any, **kwargs: Any)[source]¶
Wrapper around standard UMAP with
get_graph()exposed.- __init__(preprocess_strategy: str, compute_all_pairwise_distances: bool = False, **kargs)[source]¶
- Parameters:
preprocess_strategy – str, can be ‘center’, ‘z_score’, ‘raw’. This is the operation to perform before UMAP
compute_all_pairwise_distances – bool, it True (default is False) compute all pairwise distances
**kargs – All the arguments that standard UMAP can accept
- fit(data, y=None) SmartUmap[source]¶
Fit the Umap given the data
- Parameters:
data – array of shape \((n, p)\) where n are the points and p the features
- fit_transform(data, y=None) numpy.ndarray[source]¶
Utility method which internally calls
fit()andtransform()
- get_distances() torch.Tensor[source]¶
Returns the symmetric (dense) matrix with the DISTANCES between elements
- class SmartLeiden(graph: coo_matrix, directed: bool = True)[source]¶
Wrapper around standard Leiden algorithm. It can be initialized using the output of the
SmartUmap.get_graph()- __init__(graph: coo_matrix, directed: bool = True)[source]¶
- Parameters:
graph – Usually a sparse matrix with the similarities among nodes describing the graph
directed – if True (default) builds a directed graph.
Note
The matrix obtained by the UMAP algorithm is symmetric, in that case directed should be set to True
- cluster(resolution: float = 1.0, use_weights: bool = True, random_state: int = 0, n_iterations: int = -1, partition_type: str = 'RBC') numpy.ndarray[source]¶
Find the clusters in the data
- Parameters:
resolution – resolution parameter controlling (indirectly) the number of clusters
use_weights – if True (defaults) the graph is weighted, i.e. the edges have different strengths
random_state – control the random state. For reproducibility
n_iterations – how many iterations of the greedy algorithm to perform. If -1 (defaults) it iterates till convergence.
partition_type – The metric to optimize to find clusters. Either ‘CPM’ or ‘RBC’. :
- Returns:
labels – the integer cluster labels
- class SmartPca(preprocess_strategy: str)[source]¶
Return the PCA embeddings.
- __init__(preprocess_strategy: str)[source]¶
- Parameters:
preprocess_strategy – str, can be ‘center’, ‘z_score’, ‘raw’. This is the operation to perform before PCA
- property explained_variance_¶
For compatibility with scikit_learn
- property explained_variance_ratio_¶
For compatibility with scikit_learn
- fit(data) SmartPca[source]¶
Fit the PCA given the data. It automatically select the algorithm based on the number of features.
- Parameters:
data – array of shape \((n, p)\) where n are the points and p the features
- fit_transform(data, n_components: int | float = None) numpy.ndarray[source]¶
Utility method which internally calls
fit()andtransform().- Parameters:
data – tensor of shape \((n, p)\)
n_components – If integer specifies the dimensionality of the data after PCA. If float in (0, 1) it auto selects the dimensionality so that the explained variance is at least that value. If none (defaults) uses the value previously used.
- Returns:
data_transformed – array of shape \((n, q)\)
- transform(data, n_components: int | float = None) numpy.ndarray[source]¶
Use a previously fitted model to transform the data.
- Parameters:
data – tensor of shape \((n, p)\) where n is the number of points and p are the features
n_components – If integer specifies the dimensionality of the data after PCA. If float in (0, 1) it auto selects the dimensionality so that the explained variance is at least that value. If none it uses the previously used value.
- class SmartScaler(quantiles: Tuple[float, float], clamp: bool)[source]¶
Scale the values using the median and quantiles (with are robust version of mean and variance). \(data = (data - median) / scale\)
If clamp=True, each feature is clamped to the quantile range before applying the transformation. This is a simple way to deal with the outliers.
It does not deal with the situation in which outliers are inside the “box” of acceptable range but far from the reduced manifold. # See situation shown below: # x x # x x # x x o # x x
- __init__(quantiles: Tuple[float, float], clamp: bool)[source]¶
- Parameters:
quantiles – The lowest and largest quantile used to scale the data. Must be in (0.0, 1.0)
clamp – If True, the data is clamped into q_low, q_high before scaling.
- fit(data) SmartScaler[source]¶
Fit the data (i.e. computes quantiles and median)
- fit_transform(data) numpy.ndarray[source]¶
Utility method which internally calls
fit()andtransform()
- compute_distance_embedding(ref_embeddings: torch.Tensor, other_embeddings: torch.Tensor, metric: str, temperature: float = 0.5) torch.Tensor[source]¶
Compute distance between embeddings
- Parameters:
ref_embeddings – torch.Tensor of shape \((*, k)\) where k is the dimension of the embedding
other_embeddings – torch.Tensor of shape \((n, k)\)
temperature – float, the temperature used to compute contrastive distance
metric – Can be either ‘contrastive’ or ‘euclidean’
- Returns:
dist – distance of shape \((*, n)\)
- get_percentile(data: torch.Tensor | numpy.ndarray, dim: int) torch.Tensor | numpy.ndarray[source]¶
Takes some data and convert it into a percentile (in [0.0, 1.0]) along a specified dimension. Useful to convert a tensor into the range [0.0, 1.0] for visualization.
- Parameters:
data – input data to convert to percentile in [0,1].
dim – the dimension along which to compute the quantiles
- Returns:
percentile – torch.tensor or numpy.array (depending on the input type) with the same shape as the input with the percentile values. A percentile of 0.9 means that 90% of the input values were smaller.
- get_z_score(x: torch.Tensor, dim: int) torch.Tensor[source]¶
Standardize vector by removing the mean and scaling to unit variance
- Parameters:
x – torch.Tensor
dim – the dimension along which to compute the mean and std
- Returns:
The z-score, i.e. z = (x - mean) / std
- inverse_one_hot(image_in, bg_label: int = -1, dim: int = -3, threshold: float = 0.1)[source]¶
Takes float tensor and compute the argmax and max_value along the specified dimension. Returns a integer tensor of the same shape as the input_tensor but with the
dimremoved. If the max_value is less than the threshold the bg_label is assigned.Note
It can take an image of size \((C, W, H)\) and generate an integer mask of size \((W, H)\). This operation can be thought as the inverse of the one-hot operation which takes an integer tensor of size (n) and returns a float tensor with an extra dimension, for example (n, num_classes).
- Parameters:
image_in – any float tensor
bg_label – integer, the value assigned to the entries of which are smaller than the threshold
dim – int, the dimension along which to compute the max. For images this is usually the channel dimension, i.e. -3.
threshold – float, the value of the threshold. Value smaller than this are set assigned to the background
- Returns:
out – An integer mask with the same size of the input tensor but with the dim removed.