yaqiz01 / pir

GNU General Public License v3.0
2 stars 0 forks source link

Outer Loop Counterchain copy #13

Closed yaqiz01 closed 7 years ago

yaqiz01 commented 8 years ago

CounterChain Copying Rule

Case 1: Copy ancestor CounterChain if it's used in descendant Inner Pipe Example:

MetaPipelA (0 until 10) { i =>
   MetaPipeB(0 until 5) { j =>
       PipeD(0 until 5) {  k => // CU1
            ... = i * k
       }
   }
   PipeC(0 until 5) { h => ... } // CU2
}

In this case A.cchain and B.cchain, are both copied at CU1 from CU2 even though only i is used at PipeD. D.cchain, B.cchain.copy and A.cchain.copy are chained together in order to drive A.cchain.copy at CU1

Case 2: Copy Sram writer's CounterChain for write addr calculation

Case 2.1. If writer's CounterChain is the innermost Pipeline's CounterChain, route enable of the copied CounterChain to reader of the Sram Example:

MetaPipe A { i =>
   MetaPipe B { j =>
       Pipe C { k => // CU1
           sram(k) = ...
       }
       Pipe D {} // CU2
   }
}
Pipe E { h => // CU3
   ... = sram(h)
}

PipeE needs value of k, so C.cchain is copied at Pipe E and enable of C.cchain is routed to E

Case 2.2. If copied CounterChain is from OuterController O that is sram writer's ancestor, copy the descendant cchain of O in branch that reaches the Innermost writer of the sram Example:

MetaPipe A { i =>
   MetaPipe B { j =>
       Pipe C { k => // CU1
           sram(i) = ...
       }
       Pipe D {} // CU2
   }
}
Pipe E { h => // CU3
   ... = sram(h)
}

In this case i is needed at PipeE, so C.cchain, B.cchain, and A.cchain are copied at PipeE. C.cchain.copy, B.cchain.copy and A.cchain.copy are chained together to drive A.cchain.copy at PipeE. @raghup17

yaqiz01 commented 8 years ago

Case 3.1:

MetaPipe A { i =>
   MetaPipe B { j =>
       Pipe C { k => // CU1
           sram1(i) = i
       }
       Pipe D { g => // CU2
           sram2(i) = i
       } // CU2
   }
}
Pipe E { h => // CU3
   ... = sram1(h)
   ... = sram2(h)
}

In this case i is used for write address for both sram1 and sram2, since for writer address counter chain copying, the copied counter chain has to represent the state of writer. According to case 1, A.cchain would be copied at both Pipe C and Pipe D. Does that mean we need to make two distinct copy of A.cchain at PipeE, one chained with C.cchain.copy and the other chained with D.cchain.copy at CU3? If it's pure data flow, there's no guarantee that i (A.cchain.copy) at PipeC and PipeD would be the same.

@dkoeplin, you don't need to worry about the other additional copies of counter chain that you don't generate in spatial. But if we need to make two copies of A.cchain at PipeE, it's probably better to do it more explicitly since it's quite nasty for me to make another copy of A.cchain and swap the usage of the other copy in write address stages, sram write port, etc. @raghup17

raghup17 commented 8 years ago

In this particular example, Metapipe B is the only stage in Metapipe A. This indicates that 'i' will be the same across Pipe C and Pipe D. So, technically, the following should suffice for this example:

  1. Copying only 'i' (Metapipe A's counter) into CU3
  2. Routing the token ('done' signal) from MetaPipe B to CU3 to increment 'i'.

Summarized (in ASCII art) below:

 +---------------+        +------------------+
 |    Pipe C     |        |    Pipe E        |
 |               |        |                  |
 |               |        |   +-------+      |
 |               | data   |   |sram1  |      |
 |    C k       i+-------->   |       |      |
 |    | |        |        |   +-------+      |
 |    B j        |        |                  |
 |    | |        |        |   +-------+      |
 |    A i        |        |   |sram2  |      |
 +---------------+        |   |       |      |
 +---------------+ data   |   +-------+      |
 |    Pipe D     +-------->                  |
 |               ||       |   A i            |
 |               ||       +----^-------------+
 |              i++            |
 |    D g        |             |
 |    | |        |    control  |
 |    B j+---------------------+
 |    | |        |
 |    A i        |
 +---------------+

I haven't given much thought into (if and) how this can be generalized into a rule.

However, I see your point; with a slightly modified example: Case 3.2

MetaPipe A { i =>
       Pipe C { k => // CU1
           sram1(i) = i
       }
       Pipe D { g => // CU2
           sram2(i) = i
       } // CU2
}
Pipe E { h => // CU3
   ... = sram1(h)  // Needs A -- C counter chain
   ... = sram2(h)  // Needs A -- D counter chain
}

In this example, 'i' is definitely different at Pipe C and Pipe D. Currently, we would have to create two counter chains (A--C and A--D) like you pointed out.

yaqiz01 commented 8 years ago

Wow how did you draw that graph?

Even with original example, there's no guarantee that i in Pipe C and Pipe D will be the same because in Pipe C, k - j - i Pipe D, g - j - i k and g can have different range and Pipe C and Pipe D can have different number of stages/single iteration latency If Pipe C and Pipe D are tile load, there's difference in offchip load latencies as well. At the moment I'm just throwing exception on this. Have to think how to express this in PIR more explicitly

raghup17 commented 8 years ago
  1. http://asciiflow.com/ :-)
  2. Agreed, Pipe C and Pipe D could finish at different times. However, to begin the each iteration of Metapipe A, we would still send a token when Metapipe B is done, which would set the init values of tokens in Pipe C and Pipe D. So in the current scheme, it seems like Pipe C or Pipe D cannot run ahead with different values of i. Doing so implies that we are executing two separate instances of Metapipe B. No?
yaqiz01 commented 8 years ago

If it's last iteration of j before incrementing i in both C and D, the token guarantees k and g starts at the same time, but when they saturate at different time, j would saturate at different time, which would increment i at different time, is this possible?

dkoeplin commented 8 years ago

Actually that's a decent point. Is the saturation of the C's copy of A's counterchain depending only on it's local copies of chains B and C? Or does it also receive a control signal from D to regulate when local A increments?

yaqiz01 commented 8 years ago

So that's the first case, when copied outer controller counter chain is used for local computation, such as read address or simply using the value, it's only locally chained, which doesn't depends on whether the original outer controller is enabled. The reason is that this local copy has to represent the reader's state. For example,

MetaPipe A { i =>
   Pipe B { j =>
       // use i
   } 
   Pipe C { k =>
       // use i
   }
}

In this case i is used at B and C, but since it's MetaPipeline , value of i is pipelined in B and C. So we can't just route original i's enable from C to B

yaqiz01 commented 8 years ago

Wait, for Case 2.2,

MetaPipe A { i =>
   MetaPipe B { j =>
       Pipe C { k => // CU1
           sram(i) = ...
       }
       Pipe D {} // CU2
   }
}
Pipe E { h => // CU3
   ... = sram(h)
}

Is there a reason that I can't just make a copy of i at E and route i's enable to E without j and k? since we only care about the writer's state

raghup17 commented 8 years ago

In the current scheme, parent counter copies are updated based on local copies of its children.

Think I didn't consider all cases above. To clarify:

Case 2.2: Enable signal of 'i' is currently high only for a single cycle when 'j' saturates, as it will be chained with 'j'. However, if you meant that we can route the token corresponding to Metapipe A and Metapipe B into Pipe E, we have to consider the 'write enable' signal. Specifically, depending on what is being written to sram in Pipe C, it may or may not be okay to use Metapipe A's token to drive 'i'. to E. Short reason: 'write enable' signal.

When a counter saturates, it currently wraps around to the min value (0) and stops. Consider the two cases: 2.2.1:

       Pipe C { k => // CU1
           sram(i) = i * k
       }

Here, the write data depends on 'k' (common case IMO). If the write enable signal is high for as long as Metapipe A is running, because 'k' wraps around on saturation, we may end up writing garbage to sram(i).

Case 2.2.2

       Pipe C { k => // CU1
           sram(i) = i
       }

Here, writes to sram are loop invariant in Pipe C. In such cases, it is fine to not duplicate 'k's counter in E.

Case 3.1: I agree that copies of 'i' and 'j' update at different times in Pipe C and Pipe D. In particular, copies of 'i' and 'j' in Pipe C are incremented before copies in Pipe D.

Specifically, when 'j' saturates in Pipe C, it increments its local 'i', but does not begin execution. When 'j' saturates in Pipe D later, it increments its local copy of 'i', but does not begin execution. At this point, we send a token that begins the execution of Metapipe B. Pipe C and Pipe D begin execution again with the same value of 'i'.

Hypothetically, say we duplicate the two chains (i-j-k and i-j-g) in Pipe E. This would work, no doubt. It's just that when 'j' and 'g' are enabled in E, the value of 'i' in both duplicated chains will be the same always.

Case 3.2: Currently, we would definitely have to duplicate both chains (i-k and i-g) in Pipe E.

Have I missed anything?

yaqiz01 commented 8 years ago

1 .

Case 3.1

Specifically, when 'j' saturates in Pipe C, it increments its local 'i', but does not begin execution. When 'j' saturates in Pipe D later, it increments its local copy of 'i', but does not begin execution. At this point, we send a token that begins the execution of Metapipe B. Pipe C and Pipe D begin execution again with the same value of 'i'.

Even though during execution of Pipe C and Pipe D i would be the same, since write enable is on all the time, won't we write incorrect value when PipeC and PipeD are not sync on i and when not executing if we use one of their copy in Pipe E only?

2 .

I'm a bit lost on Case 2.2.1 and Case 2.2.2. For Pipe C in both cases, it's using parent counter i for local computation, so it would look like k - j.copy - i.copy inside Pipe C in both cases. So one way in Pipe E for remote write is do k.copy - j.copy - i.copy, would this be different than have only i.copy and sync the i.copy in Pipe C directly with i.copy in Pipe E. When I said routing the enable, I meant route whatever that's connected to the enable of i.copy inside PipeC, which is done of j.copy in PipeC to Pipe E directly