
CPB: Correlated Patterns Biclustering

General information

CPB is a novel two-step Pearson correlation based biclustering approach to mine genes that are co-regulated with a given reference gene in order to discover genes that function in a common biological process. In the first step, the algorithm identifies subsets of genes with high correlation, reducing false negatives with a nonparametric filtering scheme. In the second step, biclusters from multiple datasets are used to extract and rank gene correlation information.


Latest release: cpb-11-4-2011

Syntetic Dataset Example: data_examples

40 real datasets with the their query results with several probes: real_datasets


Although CPB and correlation combination codes are written in c and c++ and they have no other dependencies, we also provide python wrappers that make the use of them easier. These python codes have the following dependencies:


For installation of CPB:

For the installation of the correlation combination:


A sample usage of CPB with the python wrapper:

[user cpb_source]$ python DF=data.txt BF=found.txt NB=500 PCC=0.9 MO=0.25

This will create a file "found.txt", in which the biclusters are in the format:
<rows indices>
<cols col indices>
seperated by empty lines

For description of the parameters of CPB, run the command:
[user cpb_source]$ python

Examples of 4 different bicluster models can be found here. Each example dataset has 1000 rows and 200 columns with 2 biclusters embedded, without any noise or overlap. The expected biclusters of each data matrix are listed in expected.txt files under the corresponding folders.

Note that, this script creates a temporary folder under "/tmp" folder, which is removed when the job is completed. This directory can be changed by setting the environment variable:

[user cpb_source]$ export TMPDIR=<the_path_to_temporary_directory>

For the usage of correlation calculation:

[user correlation]$ correlation
Usage: correlation <inputfile> <#rows> <#cols> <reference row> <#biclusters>

To combine the results of several datasets, use

[user cpb_source]$ python
  reference_row : is the integer index of reference row where the first row is 0th row.
  input_folder : from which the datasets and their results will be read. See real dataset examples for the example of format of the folder
  output_file: where the output file should be written.

For the usage example of correlation combining code,

These will save the combining results in the given txt files. Note that, expects exactly the same formatted folder, in which a subfolder having name "GDS987" must contain a file named "GDS987.soft" and also the output biclusters found for the reference row 10722 that are saved in file 10722_found.txt.