Getting Started
===============

What is Tissue Mosaic?
------------------------

*Tissue Mosaic* is Python library for the analysis of biological tissue and
cellular micro-environments based on self supervised learning.
It is built on `PyTorch <https://pytorch.org/>`_,
`PytorchLightning <https://www.pytorchlightning.ai/>`_,
`Pyro <https://pyro.ai/>`_ and
`Anndata <https://anndata.readthedocs.io/en/latest/>`_.

Spatially resolved transcriptomic technologies (such as 
`SlideSeq <https://pubmed.ncbi.nlm.nih.gov/30923225/>`_,
`MerFish <https://www.sciencedirect.com/science/article/abs/pii/S0076687916001324>`_,
`SmFish <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6101419/>`_,
`BaristaSeq <https://academic.oup.com/nar/article/46/4/e22/4668654>`_,
`ExSeq <https://pubmed.ncbi.nlm.nih.gov/33509999/>`_,
`STARMap <https://pubmed.ncbi.nlm.nih.gov/29930089/>`_
and others) allow measuring gene expression with spatial resolution. 
Deconvolution methods and/or analysis of marker-genes, can be used to assign
a discrete cell-type (such as Macrophage, B-Cells, ...) to each cell. 

This type of data can be nicely organized into anndata objects, which are data-structure 
specifically designed for transcriptomic data. 
Each anndata object contains a list of all the cells in a tissue together with (at the minimum):

1. the gene expression profile 

2. the cell-type label

3. the spatial coordinates (either in 2D or 3D)

This rich data can unlock interesting scientific discoveries, but it is difficult to analyze.
Here is where *Tissue Mosaic* comes in.

**In short, tissues are converted into images and cropped into overlapping patches.
Semantic features are associated to each patch via self supervised learning (ssl). 
The learned features are then used in downstream tasks (such as differential gene expression analysis).**

What's appealing about this approach is that it is *unbiased*, meaning that the researcher does not need to know
*a priori* which features are important. Given enough data and a sufficiently large neural network this approach
should be able to extract biological relevant features useful in solving downstream tasks.

Negative results are also interesting because
they suggest that the task at hand *can not* be solved based on
cellular co-arrangement alone (i.e. cell-type labels and spatial coordinates).
In the latter case, more information (for example histopathology imaging) might be necessary to define
the tissue micro-environments.


.. _Typical workflow:

Typical workflow
----------------

A typical workflow consists of 3 steps:

1. Multiple anndata objects (corresponding to multiple tissues in possibly a diverse set of conditions) 
   are converted to (sparse) images. These images are cropped into overlapping patches of a characteristic
   length and are fed into a ssl framework.
   Importantly, in this step the model has no access to the gene expression profile. 
   It only uses the cell-type labels together with their spatial coordinates to create a multi-channel image
   (in which each channel encodes the density of a specific cell-type). Therefore, the model can only leverage the 
   cellular co-arrangement as a learning signal.
   See `notebook1 <https://github.com/broadinstitute/TissueMosaic/blob/main/notebooks/notebook1.ipynb>`_.

2. Once a model is trained, any (new or old) anndata object can be processed.
   As described above, the anndata object is transformed into a sparse image and cropped into 
   overlapping patches. Semantic features are associated to each patch and then transferred 
   to the cells belonging to the patch. Ultimately each cell acquire a new set of annotations
   describing the local micro-environment of that cell.
   This steps can be repeated multiple times (once for each trained model) to compare
   the quality of the features generated by using different ssl model and/or differen patch sizes.
   See `notebook2 <https://github.com/broadinstitute/TissueMosaic/blob/main/notebooks/notebook2.ipynb>`_.

3. Finally, we evaluate the quality of the features.
   To this end we use the ssl annotations to predict the gene expression profile
   conditioned on the cell-type. We compare multiple baselines to show that the ssl features are biological
   informative.
   See `notebook3 <https://github.com/broadinstitute/TissueMosaic/blob/main/notebooks/notebook3.ipynb>`_.

Why image-based self supervised learning?
-----------------------------------------
Spatial transcriptomic data is a type of tabular data and could be analyzed without converting it to images.
However, image-based approaches offer three remarkable advantages:

1. We can leverage state-of-the-art approaches which are continuously developed by the larger ML community.

2. By changing the patch size, we can easily obtain information about the cellular
   environment at different spatial resolution from local (few cells) and global (thousand of cells).

3. In this approach it is trivial to combine cell-typing information with other imaging modalities
   such as histopathology. The images corresponding to cell-typing and histopathology can be simply
   concatenated before feeding them to the algorithm.

Installation
------------
First, you need Python 3.9 and Pytorch (with CUDA support).
If you run the following command from your terminal it should report True:

.. code-block::

    python -c 'import torch; print(torch.cuda.is_available())'

Next install the most recent version of Pyro (not yet available using pip):

.. code-block::

    git clone https://github.com/pyro-ppl/pyro.git
    cd pyro
    pip install .


Finally install *Tissue Mosaic* and its dependencies:

.. code-block::

    git clone https://github.com/broadinstitute/TissueMosaic.git
    cd tissuemosaic
    pip install -r requirements.txt
    pip install .


Docker Image
------------

A GPU-enabled docker image is available from the Google Container Registry (GCR) as:

``us.gcr.io/broad-dsde-methods/tissuemosaic:latest``

Older versions are available at the same location, for example as

``us.gcr.io/broad-dsde-methods/tissuemosaic:0.0.5``

How to run
----------
There are 3 ways to run the code:

You can run the notebooks sequentially.
Each notebook demonstrate one step on the typical workflow described in `Typical workflow`_:

- `notebook1 <https://github.com/broadinstitute/TissueMosaic/blob/main/notebooks/notebook1.ipynb>`_.

- `notebook2 <https://github.com/broadinstitute/TissueMosaic/blob/main/notebooks/notebook2.ipynb>`_.

- `notebook3 <https://github.com/broadinstitute/TissueMosaic/blob/main/notebooks/notebook3.ipynb>`_.

Or you can run the code locally from the command line.
First download the example data (first published in `Dissecting Mammalian Spermatogenesis Using Spatial Transcriptomics \
by Chen et al. <https://pubmed.ncbi.nlm.nih.gov/34731600/>`_) and untar it in the "testis_anndata" directory.

.. code-block::

    gsutil -m cp gs://ld-data-bucket/tissue-mosaic/slideseq_testis_anndata_h5ad.tar.gz ./
    mkdir -p ./testis_anndata
    tar -xzf slideseq_testis_anndata_h5ad.tar.gz -C /testis_anndata.

Next, navigate to the "TissueMosaic/run" directory and train the model (this will take about 6 hrs on a Nvidia p100):

.. code-block::

    cd tissuemosaic/run
    python main_1_train_ssl.py --config config_barlow_ssl.yaml --data_folder testis_anndata

    # or alternatively
    # python main_1_train_ssl.py --config config_dino_ssl.yaml --data_folder testis_anndata --gpus 2
    # python main_1_train_ssl.py --config config_simclr_ssl.yaml --data_folder testis_anndata --gpus 2
    # python main_1_train_ssl.py --config config_vae_ssl.yaml --data_folder testis_anndata --gpus 2

Next extract the features (this will take only few minutes to run):

.. code-block::

    python main_2_featurize.py
        --anndata_in adata_0_raw.h5ad
        --anndata_out adata_0_annotated.h5ad
        --ckpt_in ckpt_barlow.ckpt
        --feature_key barlow
        --n_patches 500
        --ncv_k 10 25 100

Finally, evaluate the features based on their ability to predict the gene expression profile.

.. code-block::

    python main_3_genex.py --anndata_in XXX --l1 0.1 --n_pca 9 --XXX # DOUBLE CHECK

It might make sense to train your model remotely on google cloud (or another cloud provider)
using `Terra <https://terra.bio>`_ or `cromwell <https://cromwell.readthedocs.io/en/stable/>`_.
and `cromshell <https://github.com/broadinstitute/cromshell>`_.
After installing cromshell and connecting to a cromwell server,
you can submit a run as follow:

.. code-block::

    cd TissueMosaic/run
    ./submit_neptune_ml.sh neptune_ml.wdl --py main_1_train_ssl.py --wdl WDL_parameters.json --ml config_barlow_ssl.yaml

    # or alternatively
    # ./submit_neptune_ml.sh neptune_ml.wdl --py main_1_train_ssl.py --wdl WDL_parameters.json --ml config_dino_ssl.yaml
    # ./submit_neptune_ml.sh neptune_ml.wdl --py main_1_train_ssl.py --wdl WDL_parameters.json --ml config_simclr_ssl.yaml
    # ./submit_neptune_ml.sh neptune_ml.wdl --py main_1_train_ssl.py --wdl WDL_parameters.json --ml config_vae_ssl.yaml

Step 2 and 3 can be run locally since they are much shorter (see above).

Features and Limitations
------------------------

Features:

1. We have implemented multiple ssl strategies (such as convolutional Vae, Dino, BarlowTwin, SimClr)
   based on recent advances in image-based Machine Learning. 

2. Tissue Mosaic can be used to analyze any type of localized quantitative measurement for example spatial proteomics
   (not only mRNA count data).

Current limitations:

1. *Tissue Mosaic* works only with 2D tissue slices. No 3D support at the moment.

Future Improvements
-------------------
We hope to soon support:

1. pairing with histopathology (i.e. dense-image)

2. Extension to handle 3D images

Contributing
------------
We aspire to make *Tissue Mosaic* an easy-to-use and useful software package for the bioinformatics community.
While we test and improve *Tissue Mosaic* together with our research collaborators, your feedback is invaluable to us
and allow us to steer *Tissue Mosaic* in the direction that you find most useful in your research.
If you have an interesting idea or suggestion, please do not hesitate to reach out to us.

If you encounter a bug, please file a detailed github `issue <https://github.com/broadinstitute/TissueMosaic/issues>`_
and we will get back to you as soon as possible.

Citation
--------
This software package was developed by *Sandeep Kambhampati*, *Luca D'Alessio*, and *Fedor Grab*.

..
  If you use TissueMosaic please consider citing:

  ::
    @article{YourName,
    title={Your Title},
    author={Your team},
    journal={Location},
    year={Year}
    }