TDAlab @ GA Tech.

Software

CPB: Correlated Patterns Biclustering

General information

CPB is a novel two-step Pearson correlation based biclustering approach to mine genes that are co-regulated with a given reference gene in order to discover genes that function in a common biological process. In the first step, the algorithm identifies subsets of genes with high correlation, reducing false negatives with a nonparametric filtering scheme. In the second step, biclusters from multiple datasets are used to extract and rank gene correlation information.

Download

Latest release: cpb-11-4-2011

Syntetic Dataset Example: data_examples

40 real datasets with the their query results with several probes: real_datasets

Dependencies

Although CPB and correlation combination codes are written in c and c++ and they have no other dependencies, we also provide python wrappers that make the use of them easier. These python codes have the following dependencies:

Python
Numpy library for Python

Installation

For installation of CPB:

Use make command to compile in cpb subfolder.

[user cpb]$ make
gcc -c -o cpb.o cpb.c -O2
gcc -c ../util/str2arr.c -O2
gcc -c ../util/read_matrix.c ../util/str2arr.c -O2
gcc -c cpb_fitness.c -O2
gcc -o cpb cpb.o str2arr.o read_matrix.o cpb_fitness.o -lm -D_WRITE_CLUSTER0
gcc init_bicluster.c -o init_bicluster -O2

Add these 2 executables (init_bicluster and cpb) under $PATH.

For the installation of the correlation combination:

Go to correlation directory, run make.
[user correlation]$ make
g++ correlation.cpp -o correlation -Wall -O2 -O2
Add correlation executable under path. This executable calculates correlation in a single dataset.

Usage

A sample usage of CPB with the python wrapper:

[user cpb_source]$ python run_cpb.py DF=data.txt BF=found.txt NB=500 PCC=0.9 MO=0.25

This will create a file "found.txt", in which the biclusters are in the format:

<rows indices>
<cols col indices>
seperated by empty lines

For description of the parameters of CPB, run the command:

[user cpb_source]$ python run_cpb.py

Examples of 4 different bicluster models can be found here. Each example dataset has 1000 rows and 200 columns with 2 biclusters embedded, without any noise or overlap. The expected biclusters of each data matrix are listed in expected.txt files under the corresponding folders.

Note that, this script creates a temporary folder under "/tmp" folder, which is removed when the job is completed. This directory can be changed by setting the environment variable:

[user cpb_source]$ export TMPDIR=<the_path_to_temporary_directory>

For the usage of correlation calculation:

[user correlation]$ correlation
Usage: correlation <inputfile> <#rows> <#cols> <reference row> <#biclusters>

inputfile: is the output bicluster file generated by cpb.
#rows : is the number of rows in the dataset.
#cols : is the number of cols in the dataset.
reference row: is the index of the reference row, where the first row is indexed with 0.
#biclusters: is the number of biclusters contained in output bicluster file.

To combine the results of several datasets, use run_correlation.py.

[user cpb_source]$ python run_correlation.py
reference_row : is the integer index of reference row where the first row is 0th row.
input_folder : from which the datasets and their results will be read. See real dataset examples for the example of format of the folder
output_file: where the output file should be written.

For the usage example of correlation combining code,

Download real dataset examples

Extract the folder, then simply run the commands:

[user cpb_source]$ python run_correlation.py 10722 <path_to_real_datasets_folder> 10722.txt
[user cpb_source]$ python run_correlation.py 11244 <path_to_real_datasets_folder> 11244.txt
[user cpb_source]$ python run_correlation.py 1273 <path_to_real_datasets_folder> 1273.txt
[user cpb_source]$ python run_correlation.py 14102 <path_to_real_datasets_folder> 14102.txt
[user cpb_source]$ python run_correlation.py 4057 <path_to_real_datasets_folder> 4057.txt
[user cpb_source]$ python run_correlation.py 7868 <path_to_real_datasets_folder> 7868.txt

These will save the combining results in the given txt files. Note that, run_correlation.py expects exactly the same formatted folder, in which a subfolder having name "GDS987" must contain a file named "GDS987.soft" and also the output biclusters found for the reference row 10722 that are saved in file 10722_found.txt.