pyscenic · PyPI

Features

All the functionality of the original R implementation is available and in addition:

You can leverage multi-core and multi-node clusters using dask and its distributed scheduler.

We implemented a version of the recovery of input genes that takes into account weights associated with these genes.

Regulons, i.e. the regulatory network that connects a TF with its target genes, with targets that are repressed are now also derived and used for cell enrichment analysis.

Installation

The lastest stable release of the package itself can be installed via pip install pyscenic .

Caution!

pySCENIC needs a python 3.5 or greater interpreter.

You can also install the bleeding edge (i.e. less stable) version of the package directly from the source:

git clone https://github.com/aertslab/pySCENIC.git
cd pySCENIC/
pip install .

To successfully use this pipeline you also need auxilliary datasets :

Databases ranking the whole genome of your species of interest based on regulatory features (i.e. transcription factors). Ranking databases are typically stored in the feather format.

Database

Species

Search space

# of orthologous species

hg19-500bp-upstream-10species

Homo sapiens

[TSS+500bp,TSS[

hg19-500bp-upstream-7species

Homo sapiens

[TSS+500bp,TSS[

hg19-tss-centered-10kb-10species

Homo sapiens

TSS+/-10kbp

hg19-tss-centered-10kb-7species

Homo sapiens

TSS+/-10kbp

hg19-tss-centered-5kb-10species

Homo sapiens

TSS+/-5kbp

hg19-tss-centered-5kb-7species

Homo sapiens

TSS+/-5kbp

hg38-10kb-up-and-down-tss

Homo sapiens

[TSS+10kb,TSS-10kb]

hg38-500bp-up-100bp-down-tss

Homo sapiens

[TSS+500bp,TSS-100bp]

mm9-500bp-upstream-10species

Mus musculus

[TSS+500bp,TSS[

mm9-500bp-upstream-7species

Mus musculus

[TSS+500bp,TSS[

mm9-tss-centered-10kb-10species

Mus musculus

TSS+/-10kbp

mm9-tss-centered-10kb-7species

Mus musculus

TSS+/-10kbp

mm9-tss-centered-5kb-10species

Mus musculus

TSS+/-5kbp

mm9-tss-centered-5kb-7species

Mus musculus

TSS+/-5kbp

mm10-10kb-up-and-down-tss

Mus musculus

[TSS+10kb,TSS-10kb]

mm10-500bp-up-100bp-down-tss

Mus musculus

[TSS+500bp,TSS-100bp]

dm6-5kb-upstream-full-tx

Drosophila melanogaster

[TSS+5kb,full Tx]

Tutorial

For this tutorial 3,005 single cell transcriptomes taken from the mouse brain (somatosensory cortex and hippocampal regions) are used as an example [ 4 ] . The analysis is done in a Jupyter notebook.

First we import the necessary modules and declare some constants:

import os
import glob
import pickle
import pandas as pd
import numpy as np
from arboretum.utils import load_tf_names
from arboretum.algo import grnboost2
from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase
from pyscenic.utils import modules_from_adjacencies
from pyscenic.prune import prune, prune2df
from pyscenic.aucell import aucell
import seaborn as sns
DATA_FOLDER="~/tmp"
RESOURCES_FOLDER="~/resources"
DATABASE_FOLDER = "~/databases/"
SCHEDULER="123.122.8.24:8786"
FEATHER_GLOB = os.path.join(DATABASE_FOLDER, "mm9-*.feather")
MOTIF_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDER, "motifs-v9-nr.mgi-m0.001-o0.0.tbl")
MM_TFS_FNAME = os.path.join(RESOURCES_FOLDER, 'mm_tfs.txt')
SC_EXP_FNAME = os.path.join(RESOURCES_FOLDER, "GSE60361_C1-3005-Expression.txt")
REGULONS_FNAME = os.path.join(DATA_FOLDER, "regulons.p")
NOMENCLATURE = "MGI"

Preliminary work

The scRNA-Seq data is downloaded from GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60361 and loaded into memory:

ex_matrix = pd.read_csv(SC_EXP_FNAME, sep='\t', header=0, index_col=0)

Subsequently duplicate genes are removed:

ex_matrix = ex_matrix[~ex_matrix.index.duplicated(keep='first')]
ex_matrix.shape

(19970, 3005)

and the list of Transcription Factors (TF) for Mus musculus are read from file. The list of known TFs for Mm was prepared from TFCat (cf. notebooks section).

tf_names = load_tf_names(MM_TFS_FNAME)

Finally the ranking databases are loaded:

db_fnames = glob.glob(FEATHER_GLOB)
def name(fname):
    return os.path.basename(fname).split(".")[0]
dbs = [RankingDatabase(fname=fname, name=name(fname), nomenclature=NOMENCLATURE) for fname in db_fnames]
[FeatherRankingDatabase(name="mm9-tss-centered-10kb-10species",nomenclature=MGI),
 FeatherRankingDatabase(name="mm9-500bp-upstream-7species",nomenclature=MGI),
 FeatherRankingDatabase(name="mm9-500bp-upstream-10species",nomenclature=MGI),
 FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species",nomenclature=MGI),
 FeatherRankingDatabase(name="mm9-tss-centered-10kb-7species",nomenclature=MGI),
 FeatherRankingDatabase(name="mm9-tss-centered-5kb-7species",nomenclature=MGI)]
Phase I: Inference of co-expression modules
In the initial phase of the pySCENIC pipeline the single cell expression profiles are used to infer
co-expression modules from.
Run GENIE3 or GRNBoost from arboretum to infer co-expression modules
The arboretum package is used for this phase of the pipeline. For this notebook only a sample of 1,000 cells is used
for the co-expression module inference is used.
N_SAMPLES = ex_matrix.shape[1] # Full dataset
adjacencies = grnboost2(expression_data=ex_matrix.T.sample(n=N_SAMPLES, replace=False),
                    tf_names=tf_names, verbose=True)
Derive potential regulons from these co-expression modules
Regulons are derived from adjacencies based on three methods.
The first method to create the TF-modules is to select the best targets for each transcription factor:
Targets with weight > 0.001
Targets with weight > 0.005
The second method is to select the top targets for a given TF:
Top 50 targets (targets with highest weight)
The alternative way to create the TF-modules is to select the best regulators for each gene (this is actually how GENIE3 internally works). Then, these targets can be assigned back to each TF to form the TF-modules. In this way we will create three more gene-sets:
Targets for which the TF is within its top 5 regulators
Targets for which the TF is within its top 10 regulators
Targets for which the TF is within its top 50 regulators
A distinction is made between modules which contain targets that are being activated and genes that are being repressed. Relationship between TF and its target, i.e. activator or repressor, is derived using the original expression profiles. The Pearson product-moment correlation coefficient is used to derive this information.
In addition, the transcription factor is added to the module and modules that have less than 20 genes are removed.
modules = list(modules_from_adjacencies(adjacencies, ex_matrix.T, nomenclature=NOMENCLATURE))
Phase II: Prune modules for targets with cis regulatory footprints (aka RcisTarget)
# Calculate a list of enriched motifs and the corresponding target genes for all modules.
df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME)
# Create regulons from this table of enriched motifs.
regulons = df2regulons(df, NOMENCLATURE)
# Save these regulons to disk in binary "pickled" format.
with open(REGULONS_FNAME, "wb") as f:
    pickle.dump(regulons, f)
If running on a single multi-core machine, the following code snippet exploits all cores and provides you
a progress bar:
from dask.diagnostics import ProgressBar
with ProgressBar():
        df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME, client_or_address="dask_multiprocessing")
Directly calculating regulons without the intermediate dataframe of enriched features is also possible:
regulons = prune(dbs, modules, MOTIF_ANNOTATIONS_FNAME)
Clusters can be leveraged in the following way:
# The clusters can be leveraged via the dask framework:
df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME, client_or_address=SCHEDULER)
# or alternatively:
regulons = prune(dbs, modules, MOTIF_ANNOTATIONS_FNAME, client_or_address=SCHEDULER)
Phase III: Cellular regulon enrichment matrix (aka AUCell)
We characterize the different cells in a single-cell transcriptomics experiment via the enrichment of the previously discovered
regulons. Enrichment of a regulon is measured as the Area Under the recovery Curve (AUC) of the genes that define this regulon.
auc_mtx = aucell(ex_matrix.T, regulons, num_workers=4)
sns.clustermap(auc_mtx, figsize=(8,8))
Command Line Interface
A command line version of the tool is included. This tool is available after proper installation of the package via pip.
{ ~ }  » pyscenic                                            ~
usage: pySCENIC [-h] {grnboost,ctx,aucell} ...
Single-CEll regulatory Network Inference and Clustering
positional arguments:
  {grnboost,ctx,aucell}
                        sub-command help
    grnboost            Derive co-expression modules from expression matrix.
    ctx                 Find enriched motifs for a gene signature and
                        optionally prune targets from this signature based on
                        cis-regulatory cues.
    aucell              Find enrichment of regulons across single cells.
optional arguments:
  -h, --help            show this help message and exit
Arguments can be read from file using a @args.txt construct.
Website
For more information, please visit LCB and SCENIC.
License
GNU General Public License v3
Acknowledgments
We are grateful to all providers of TF-annotated position weight matrices, in particular Martha Bulyk (UNIPROBE), Wyeth Wasserman and Albin Sandelin (JASPAR), BioBase (TRANSFAC), Scot Wolfe and Michael Brodsky (FlyFactorSurvey) and Timothy Hughes (cisBP).
References
[1]
Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nat Meth 14, 1083–1086 (2017).
[2]
Rocklin, M. Dask: parallel computation with blocked algorithms and task scheduling. conference.scipy.org
[3]
Huynh-Thu, V. A. et al. Inferring regulatory networks from expression data using tree-based methods. PLoS ONE 5, (2010).
[4]
Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015).
            Download files
            Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
              Source Distribution