CPB is a novel two-step Pearson correlation based biclustering approach to mine genes that are co-regulated with a given reference gene in order to discover genes that function in a common biological process. In the first step, the algorithm identifies subsets of genes with high correlation, reducing false negatives with a nonparametric filtering scheme. In the second step, biclusters from multiple datasets are used to extract and rank gene correlation information.
Latest release: cpb-11-4-2011
Syntetic Dataset Example: data_examples
40 real datasets with the their query results with several probes: real_datasets
Although CPB and correlation combination codes are written in c and c++ and they have no other dependencies, we also provide python wrappers that make the use of them easier. These python codes have the following dependencies:
For installation of CPB:
[user cpb]$ make|
gcc -c -o cpb.o cpb.c -O2
gcc -c ../util/str2arr.c -O2
gcc -c ../util/read_matrix.c ../util/str2arr.c -O2
gcc -c cpb_fitness.c -O2
gcc -o cpb cpb.o str2arr.o read_matrix.o cpb_fitness.o -lm -D_WRITE_CLUSTER0
gcc init_bicluster.c -o init_bicluster -O2
For the installation of the correlation combination:
[user correlation]$ make|
g++ correlation.cpp -o correlation -Wall -O2 -O2
A sample usage of CPB with the python wrapper:
[user cpb_source]$ python run_cpb.py DF=data.txt BF=found.txt NB=500 PCC=0.9 MO=0.25 |
<rows indices> |
<cols col indices>
seperated by empty lines
[user cpb_source]$ python run_cpb.py |
Examples of 4 different bicluster models can be found here. Each example dataset has 1000 rows and 200 columns with 2 biclusters embedded, without any noise or overlap. The expected biclusters of each data matrix are listed in expected.txt files under the corresponding folders.
Note that, this script creates a temporary folder under "/tmp" folder, which is removed when the job is completed. This directory can be changed by setting the environment variable:
[user cpb_source]$ export TMPDIR=<the_path_to_temporary_directory> |
For the usage of correlation calculation:
[user correlation]$ correlation|
Usage: correlation <inputfile> <#rows> <#cols> <reference row> <#biclusters>
To combine the results of several datasets, use run_correlation.py.
[user cpb_source]$ python run_correlation.py |
reference_row : is the integer index of reference row where the first row is 0th row.
input_folder : from which the datasets and their results will be read. See real dataset examples for the example of format of the folder
output_file: where the output file should be written.
For the usage example of correlation combining code,
[user cpb_source]$ python run_correlation.py 10722 <path_to_real_datasets_folder> 10722.txt |
[user cpb_source]$ python run_correlation.py 11244 <path_to_real_datasets_folder> 11244.txt
[user cpb_source]$ python run_correlation.py 1273 <path_to_real_datasets_folder> 1273.txt
[user cpb_source]$ python run_correlation.py 14102 <path_to_real_datasets_folder> 14102.txt
[user cpb_source]$ python run_correlation.py 4057 <path_to_real_datasets_folder> 4057.txt
[user cpb_source]$ python run_correlation.py 7868 <path_to_real_datasets_folder> 7868.txt
These will save the combining results in the given txt files. Note that, run_correlation.py expects exactly the same formatted folder, in which a subfolder having name "GDS987" must contain a file named "GDS987.soft" and also the output biclusters found for the reference row 10722 that are saved in file 10722_found.txt.