CPB is a novel two-step Pearson correlation based biclustering approach to mine genes that are co-regulated with a given reference gene in order to discover genes that function in a common biological process. In the first step, the algorithm identifies subsets of genes with high correlation, reducing false negatives with a nonparametric filtering scheme. In the second step, biclusters from multiple datasets are used to extract and rank gene correlation information.
Latest release: cpb-11-4-2011
Syntetic Dataset Example: data_examples
40 real datasets with the their query results with several probes: real_datasets
Although CPB and correlation combination codes are written in c and c++ and they have no other dependencies, we also provide python wrappers that make the use of them easier. These python codes have the following dependencies:
For installation of CPB:
[user cpb]$ make gcc -c -o cpb.o cpb.c -O2 gcc -c ../util/str2arr.c -O2 gcc -c ../util/read_matrix.c ../util/str2arr.c -O2 gcc -c cpb_fitness.c -O2 gcc -o cpb cpb.o str2arr.o read_matrix.o cpb_fitness.o -lm -D_WRITE_CLUSTER0 gcc init_bicluster.c -o init_bicluster -O2 |
For the installation of the correlation combination:
[user correlation]$ make g++ correlation.cpp -o correlation -Wall -O2 -O2 |
A sample usage of CPB with the python wrapper:
[user cpb_source]$ python run_cpb.py DF=data.txt BF=found.txt NB=500 PCC=0.9 MO=0.25 |
<rows indices> <cols col indices> seperated by empty lines |
[user cpb_source]$ python run_cpb.py |
Examples of 4 different bicluster models can be found here. Each example dataset has 1000 rows and 200 columns with 2 biclusters embedded, without any noise or overlap. The expected biclusters of each data matrix are listed in expected.txt files under the corresponding folders.
Note that, this script creates a temporary folder under "/tmp" folder, which is removed when the job is completed. This directory can be changed by setting the environment variable:
[user cpb_source]$ export TMPDIR=<the_path_to_temporary_directory> |
For the usage of correlation calculation:
[user correlation]$ correlation Usage: correlation <inputfile> <#rows> <#cols> <reference row> <#biclusters> |
To combine the results of several datasets, use run_correlation.py.
[user cpb_source]$ python run_correlation.py reference_row : is the integer index of reference row where the first row is 0th row. input_folder : from which the datasets and their results will be read. See real dataset examples for the example of format of the folder output_file: where the output file should be written. |
For the usage example of correlation combining code,
[user cpb_source]$ python run_correlation.py 10722 <path_to_real_datasets_folder> 10722.txt [user cpb_source]$ python run_correlation.py 11244 <path_to_real_datasets_folder> 11244.txt [user cpb_source]$ python run_correlation.py 1273 <path_to_real_datasets_folder> 1273.txt [user cpb_source]$ python run_correlation.py 14102 <path_to_real_datasets_folder> 14102.txt [user cpb_source]$ python run_correlation.py 4057 <path_to_real_datasets_folder> 4057.txt [user cpb_source]$ python run_correlation.py 7868 <path_to_real_datasets_folder> 7868.txt |
These will save the combining results in the given txt files. Note that, run_correlation.py expects exactly the same formatted folder, in which a subfolder having name "GDS987" must contain a file named "GDS987.soft" and also the output biclusters found for the reference row 10722 that are saved in file 10722_found.txt.