Bootstrap sample for each tree

GoogleCodeExporter commented 9 years ago

Hi Abhishek,

I am trying to extract the exact bootstrap sample used in each tree.
The return value of inbag tells me which samples where in bag for a certain 
tree. However it does not tell me how often each sample was selected.
Is there a way to find this out?

I would need this to be able to reproduce the gini impurity value of each node 
of a tree.

Thank you for your answer.
Johannes

Original issue reported on code.google.com by johannes...@gmail.com on 11 May 2012 at 10:57

GoogleCodeExporter commented 9 years ago

Hi Johannes

i think the package does not tell what examples were inbag, just in general 
that a particular example is inbagged this many number of times overall.

if this information was available if would be very memory intensive to note 
down that a particular set of examples were used at a particular node in the 
tree. maybe you could writeline everything to a file as it is constructed. 
would that help?

also there might be a bit of issue if you are using regression RF, the thing is 
that a new dataset is created from bagging and the original example list is 
sort of lost to the tree creation. but i know the exact place to do this for 
classification

Original comment by abhirana on 11 May 2012 at 8:08

GoogleCodeExporter commented 9 years ago

Hi,

thank you for your answer. I guess I was able to get the bootstrap samples.

To obtain the samples which are in bag you simply have to enable the option in 
the parameters.

rfopt.keep_inbag = true; %keep track if a sample is in-bag for a tree

The classifier then returns a matrix with the size of number_of_samples x 
number_of_trees. However this is a binary matrix not giving the number of times 
a sample was selected.

I digged into the code of the classRF.cpp and found that the variable jin holds 
the inbag values. As I was to lazy to introduce a new variable I simply changed 
the assignment in lines 378 and 414 from

jin[k] = 1;

to

jin[k] = jin[k] + 1; //get number of times selected for the bootstrap sample

This now returns the number of times a sample is selected in each bootstrap 
sample. At first sight it seems to work and not to interfere with any other 
calculations as jin is only checked for zero values.

I will now check if I can reproduce the gini impurity values.

Johannes

Original comment by johannes...@gmail.com on 14 May 2012 at 1:24

GoogleCodeExporter commented 9 years ago

Cool

Lol I totally forgot that option

Yup changing jin is correct.  The logic won't be affected as all what is 
checked is whether it's 0or not 0

Original comment by abhirana on 14 May 2012 at 8:03

GoogleCodeExporter commented 9 years ago

hi,

I managed to reconstruct the Gini Impurity (GI) of every single node. Therefore 
I used the bootstrap sample, the best feature and the threshold. With the GI 
the decrease in GI is calculated. However my results do not match the outputs 
from the RF.

Do you have any idea how Breiman calculates the decrease? Right now I am 
looking into the Fortran code but haven't understood it yet. I guess 
findbestsplit is the right method to look at.

Here is how I calculate the GI and the decrease in GI. 
The GI you can see in image 1. L is the number of total classes and p_l^w is 
the frequency of samples of class l in the node w (p_l^w = n_l_w / n_w).
The decrease in GI is simply the difference of the GI of node w, with its two 
child nodes w1, w2 weighted by the samples of the child  nodes.
dec(w) = i(w) - n_w1/n_w * i(w1) - n_w2/n_w * i(w2) 

Do you see any errors I miss?

Original comment by johannes...@gmail.com on 27 May 2012 at 8:51

Attachments:

gini_impurity.PNG

GoogleCodeExporter commented 9 years ago

hey johannes
the below is how breiman calculates the decrease. i ignored the weights of the 
classes in the equation btw.

parentNum =0, parentDen =0
for i=0:(num_class-1)
   parentNum += tclasspop[i] * tclasspop[i] (make sure these are doubles else you will have overflow)
   parentDen += tclasspop[i]
end

Sort feature k and do the following for the examples according to the sort order

rrn = parentNum;
rrd = parentDen;
rln = 0
rld = 0;
wl[0..(num_class-1)] = 0;
wr[0..(num_class-1)] = tclasspop[0..(num_class-1)]; //put the class population 
in wr

crit0 = parentNum/parentDen;  

//basically 
//crit0 = sum(classpopulation_{i=1:num_class}^2) / 
sum(classpopulation_{i=1:num_class})

for example_num = start to end
    counts = number of times example was oobed
    rld = rld + counts
    rrd = rrd - counts
    rln = rln + counts * (2 * wl[y_label[example_num]] + counts );
    rrn = rrn + counts * (-2 * wr[y_label[example_num]] + counts );
    crit = (rln / (GLOBAL_DATA_TYPE) rld) + (rrn / (GLOBAL_DATA_TYPE) rrd);
    wl[y_label[example_num]] = wl[y_label[example_num]] + counts;
    wr[y_label[example_num]] = wr[y_label[example_num]] - counts;
end
critMax = highest crit

pick the feature with highest crit
and then the decision of the split is
decsplit = critMax - crit0;  

decomposing crit and ignoring the counts for the time being. if class_i is the 
class for the current example then the crit value is
crit = rln/rld + rrn/rrd
     = (2 * num_nodes_of_class_i_on_left)/ num_nodes_on_left  -  (2*num_nodes_of_class_i_on_right) / num_nodes_on_right 

maybe expanding dec(w) should give a similar answer as how breiman calculates 
it?

Original comment by abhirana on 28 May 2012 at 4:03

GoogleCodeExporter commented 9 years ago

Hey abhishek,

thank you for your fast answer. It helped me figuring out some details on 
Breiman's code.
However what I still do not fully understand is the part

for example_num = start to end
   ....
end

Especially the value of the variable 'counts' in your code is unclear to me. 
Could you please explain me in more detail what you mean by "number of times 
example was oobed"? Which example is meant here? And over which examples do I 
have to iterate?

In the code of Breiman there is first the iteration over the mtry samples and 
then the iteration over nsp. Is it correct that the loop over ndstart to ndend 
- 1 is over the possible thresholds for the split?

do mt = 1, mtry
    ...
    do nsp = ndstart, ndend-1
        ...
    end
end

I haven't yet understood the whole impurity criterion as used in the code of 
Breiman. What is the theoretical concept of it? I always thought the Gini 
importance is based on the mean decrease of Gini impurity?

Once again thank you for your help.
Johannes

Original comment by johannes...@gmail.com on 3 Jun 2012 at 10:12

GoogleCodeExporter commented 9 years ago

"for example_num = start to end
   ....
end

Especially the value of the variable 'counts' in your code is unclear to me. 
Could you please explain me in more detail what you mean by "number of times 
example was oobed"? Which example is meant here? And over which examples do I 
have to iterate?"

my code and randomforest-R code uses two different version of creating the 
dataset for individual trees when it comes to regression and classification.
RF uses bagging so roughly only 63% of the original dataset will be used in 
training in the new dataset for each tree, and within that 63% examples, many 
will be chosen multiple time, because we are using bagging with replacement.

if you examine the classification code, what is done for each tree is that a 
new dataset is not created and we keep track of how many times an example was 
in the bag. i had a typo there where i said, oobed, i mean inbagged. whereas 
for regression code, a new dataset is created and what examples in the new 
dataset correspond to the original dataset is somewhat lost.

now for the tree example
function create_tree(all examples dataset)
   % this is strictly not a function, its more about how the function recursively searches the middle point
   set initial start_range to 1 and end_range to N(number of examples
   while some condition (like number of nodes in tree < some value)  
     middle_point, indx = findbestsplit(start_range, end_range)
     %reshape the input array according to the indx sent
     % due to the complicated nature of representing the indx array i am not putting it below

     next iteration search
     middle_point_1 = findbestsplit(start_range, middle_point)
     middle_point_2 = findbestsplit(middle_point+1, end_range)

     in the next to next iteration
     middle_point_11 = findbestsplit(start_range, middle_point_1)
     middle_point_12 = findbestsplit(middle_point_1+1, middle_point)
     middle_point_21 = findbestsplit(middle_point+1, middle_point_2)
     middle_point_22 = findbestsplit(middle_point_2+1, end_range)

    and so on

function middle_point, indx = findbestsplit(start_range, end_range)
    for mt = 1, mtry
      % this is probably the most important part
      % sort the feature mt for the examples
      % so that the indices of the examples and the feature values (in ascending order) are obtained
      for nsp = ndstart, ndend-1
         here it calculates the gini index in a sorted order.
      end
    end
    % the middle point is where the gini index was maximum
    % indx is the sort of the examples for the feature whose gini index was maximum
    % note thtat the indx is then used to reshape the original array so that
    % all examples left of the middle point in subsequent iteration are together and all examples right of the middle point are together

breiman's classicification code has a real nice shortcut of using a presorted 
array. it helps it in scaling RF as O(n) rather than as O(n log n). this though 
has complicated the whole classification code and you wont see sorting done at 
all within the code (except initally). a better way would be to look at the 
regression code where you will see the sorting being done and it is a bit more 
easier to read. except for how gini/deviance is calculated actually regression 
and classification code do the same thing (but do it differently, presorting 
vs. sorting on the fly).

counts = the number of times a particular examples is inbagged and as the gini 
wont change if all the examples (same example but inbagged multiple times) are 
the same so the increment/decrement of counts works 

"In the code of Breiman there is first the iteration over the mtry samples and 
then the iteration over nsp. Is it correct that the loop over ndstart to ndend 
- 1 is over the possible thresholds for the split?

do mt = 1, mtry
    ...
    do nsp = ndstart, ndend-1
        ...
    end
end
"

ndstart-ndend-1 dynamically change as the tree progresses. yeh i guess in 
conjunction with the sorted order it helps in calculating all possible splits.

it may just be the fact that the sorted order of values helps in calculating 
the max (i am not sure its the max gini, my gini-related knowledge is limited) 
http://mathworld.wolfram.com/GiniCoefficient.html (the 2i-n-1 formula) 
in a nice closed formed iterative way. 

i think it is possible to derive what is being done, ie finding the middle 
point for sorted values and that has maximum gini, and relate to the gini coeff 
but it will take me a few more days.

Original comment by abhirana on 6 Jun 2012 at 7:29

GoogleCodeExporter commented 9 years ago

Hi to all! 3 columns inputs as double. Years,a vector and precipitations. How 
can create one script m file that it calls for N=5000 repetitions random as 
bootstrap method for the three columns? I have one csv file called year with 
these 3 columns as double.

Original comment by emotion....@gmail.com on 16 Sep 2013 at 10:18

zhangjun001 / randomforest-matlab

Bootstrap sample for each tree #38