===========
Quick start
===========
Import bibench like any Python module::
>>> import bibench
If the import fails, make sure that BiBench is on your PYTHONPATH.
BiBench contains many submodules, but to save time you can use the
``bibench.all`` module, which imports the most useful functions into
one place::
>>> import bibench.all as bb
++++++++++++++++++++++
Synthetic data example
++++++++++++++++++++++
First we need a dataset on which to run our biclustering
algorithms. BiBench can generate synthetic data for testing
purposes. Here we make a dataset with upregulated biclusters::
>>> data, expected = bb.make_const_data(bicluster_signals=[5,5,5], shuffle=True)
The list of hidden biclusters, ``expected``, will be useful later to
see how well an algorithm has done. The dataset is a standard
numpy.ndarray, so we can do all the usual things like examine its
shape::
>>> data.shape
(300, 50)
BiBench supports visualizing data and biclusters in various ways. We
can also view a heatmap of the data::
>>> bb.heatmap(data)
.. image:: static/heatmap1.png
We can also sort the dataset so that the first expected bicluster is
contiguous::
>>> bb.heatmap(data, expected[0], local=False)
.. image:: static/heatmap2.png
Now we want to cluster the dataset using BiMax. But BiMax requires
that the data be binarized. We use QUBIC's method of discretization
to make this dataset binary::
>>> bin_data = bb.qubic_binarize_up(data)
Now we can cluster the dataset and see how well BiMax did::
>>> found = bb.bimax(bin_data)
>>> bb.jaccard_list(expected, found)
ListScore(relevance=0.055462621590093929, recovery=1.0)
It looks like BiMax found all the biclusters, but also found lots of
extra ones that are hurting its relevance score. The ``filter``
function is useful in this case; here filter out small biclusters::
>>> filtered = bb.filter(found, minrows=5, mincols=5)
>>> bb.jaccard_list(expected, filtered)
ListScore(relevance=1.0, recovery=1.0)
Note that ``filter`` also removes duplicate biclusters.
++++++++++++++++++++++++++++
Bioconductor data example
++++++++++++++++++++++++++++
Now let's try clustering a real microarray dataset. If rpy2, R, and
Bioconductor are installed, we have access to all of the functionality
of Bioconductor, including lots of datasets. A list of all experiment
data packages is available `here
`_. (It is
also possible to get a list interactively. See rpy2's `documentation
for interactive work
`_)
Let's install the ALL data, which consists of 128 microarrays from
patients with acute lymphoblastic leukemia (ALL). We first need to get
the bioclite command, which requires an internet connection::
>>> bioclite = bibench.rutil.get_bioclite()
>>> bioclite('ALL')
Now that the data is installed, we can get it::
>>> data = bb.get_bioc_data('ALL')
Gene expression data files have labelled rows (genes/probes) and
columns (samples). We can access them as members of our data. Here we
have Affymetrix probes for our row labels::
>>> data.genes[0:3]
['1000_at', '1001_at', '1002_f_at']
>>> data.samples[0:3]
['01005', '01010', '03002']
This time we use QUBIC for clustering::
>>> found = bb.qubic(data)
Since this is a real dataset, the true biclusters are
unknown. However, we can do `Gene Ontology
`_ enrichment analysis on each
bicluster. Since this dataset was obtained from Bioconductor, we can
find which annotation package to use by simply examing the
``annotation`` attribute::
>>> data.annotation
'hgu95av2'
We check the biclusters for enriched Gene Ontology terms, using
the datasets probes as the gene universe::
>>> enriched = [bb.enrichment(f, data.annotation, data.genes) for f in found]
>>> enriched[0]
It is easy to get a GO id's full annotation. For instance, suppose we
wanted to get `GO:0048870
`_'s
full annotation::
>>> bb.goid_annot('GO:0048870')
GoAnnot(goid='GO:0048870',
term='cell motility',
ontology='BP',
synonym=None,
secondary=None,
definition='Any process involved in the controlled self-propelled movement of a cell that results in translocation of the cell from one place to another.')
Then using ``bb.goid_annot`` we can get the first bicluster's actual
enriched terms::
>>> [annot.term for annot in [bb.goid_annot(e.goid) for e in enriched[0]]]
['cardiovascular system development',
'blood vessel development',
'extracellular matrix organization',
'muscle tissue development',
'cell adhesion',
'collagen fibril organization',
'muscle organ development',
'cell motility',
'muscle contraction',
'angiogenesis']
Finally, we export the first six biclusters in the `BicOverlapper
`_ format, to allow visualization
using the BicOverlapper visualization tool::
>>> bb.write_bicoverlapper([found[0:6]], './all-data-qubic-biclusters.txt', data.genes, data.samples)
After importing the result and choosing the Overlapper view, this is the result:
.. image:: static/bicoverlapper.png
Clearly these biclusters do not overlap much.
+++++++++++++++++++
GDS data example
+++++++++++++++++++
The `Gene Expression Omnibus `_ is a
handy resource for expression data. BiBench provides an interface to
the `geometadb
`_
package, so we can query GEO metadata to find the appropriate
dataset. The first time this functionality is used, the SQLlite
database is automatically downloaded and stored in
``$HOME/.bibench/GEOmetadb.sqlite``.
Suppose we want to find a curated (GDS) dataset that was generated
from an Affymetrix chip::
>>> bb.geo_query("select gds from gds join gpl on gds.gpl=gds.gpl where gpl.manufacturer='Affymetrix'")
>>> len(result[0])
2721
There are 2721 such GDS datasets. The result of our query is an rpy2.robjects.vectors.DataFrame, with
one column for each argument to ``select``::
>>> type(result)
rpy2.robjects.vectors.DataFrame
We examine the first five datasets::
>>> result[0][0:5]
['GDS5', 'GDS6', 'GDS10', 'GDS12', 'GDS15']
BiBench can automatically download and import GDS datasets. Here we
get GDS3715::
>>> data = bb.get_gds_data(3715)
The dataset is cached in ``$HOME/.bibench/gds``, so future calls to
``bb.get_gds_data(3715)`` will not download the whole dataset again.
Cluster the dataset using CPB::
>>> found = bb.cpb(data, 10)
BiBench automatically maps a GDS dataset's GPL platform to the
appropriate Bioconducter annotation database, if possible. Let's make
sure it worked::
>>> data.annotation
'hgu95av2'
It did indeed work. So it is easy to perform a Gene Ontology
enrichment analysis::
>>> enriched = [bb.enrichment(f, data.annotation, data.genes) for f in found]