Allow caller to specify value for padding shorter of Zip sequences

GoogleCodeExporter commented 9 years ago

I think it would be good to allow the caller to control what is being used
to pad the shorter collection if desired, rather than relying on
default(TResult).  Especially for numerics, where 0 may very well have a
meaning for the caller, other than "not there".

Since this would change the method signature, new method overrides would
probably be required.  These methods would probably have no need for a zip
strategy argument either.

Here is the method signature I would envision:
private static IEnumerable<TResult> ZipImpl<TFirst, TSecond, TResult>(
    IEnumerable<TFirst> first, 
    IEnumerable<TSecond> second, 
    Func<TFirst, TSecond, TResult> resultSelector,
    TFirst padFirst,
    TSecond padSecond)

Original issue reported on code.google.com by CAmmerman@gmail.com on 16 Feb 2009 at 3:05

GoogleCodeExporter commented 9 years ago

"overrides" in my original note should be "override".  Didn't mean to imply 
existing
methods should be replaced, but rather just a new one should be added.

Original comment by CAmmerman@gmail.com on 16 Feb 2009 at 3:54

GoogleCodeExporter commented 9 years ago

I can't see how you get away without the overload taking a strategy - you still 
need 
to differentiate between Pad, Truncate and Fail. How would you make the above 
signature handle truncation and failure?

Original comment by jonathan.skeet on 16 Feb 2009 at 7:37

GoogleCodeExporter commented 9 years ago

My thought was that adding the pad argument to the existing methods would
overcomplicate the signature because it's only necessary for one of the three
strategies.  The signature I specified above would be a separate overload from 
the
ones already there.

Original comment by CAmmerman@gmail.com on 16 Feb 2009 at 1:14

GoogleCodeExporter commented 9 years ago

When I originally submitted the Zip implementation, I intentionally left a 
comment
that referenced a similar discussion regarding zip in Python 3000 as food for
thought. Since the comment has been lost in translation, :) I'll re-iterate it 
here:

[Python-3000] have zip() raise exception for sequences of different lengths
http://mail.python.org/pipermail/python-3000/2006-August/003338.html

The discussion is very interesting and worth reading up on before settling here.

The current use of an enumeration (ImbalancedZipStartegy) to express the 
required
behavior seems awkward because it's not something you want to be able to vary
externally, which is usually why you would make a parameter or argument. Plus, 
it's a
little cumbersome to write and read back in code and not very LINQ-ish in 
spirit. The
enumeration could certainly be used internally for sharing the implementation 
logic
but the public API should use distinct and clear operator names that imply the
strategy. For example:

- EquiZip:
  Sequences must be equal in length otherwise throw error

- Zip: 
  Truncates to shorter of the two sequences

- ZipLongest: 
  Uses longer of the two sequences while padding shorter of the two

Original comment by azizatif on 16 Feb 2009 at 5:19

GoogleCodeExporter commented 9 years ago

Original comment by azizatif on 16 Feb 2009 at 11:05

Changed title: Allow caller to specify value for padding shorter of Zip sequences

GoogleCodeExporter commented 9 years ago

I agree with Aziz that using the strategy enum seems like it might be more 
cumbersome
with little added value for the API consumer... This doesn't mean we wouldn't 
want to
expose it or use it internally, just that simpler methods might be warranted as 
well.

I think we could conceivably condense all behavior into two methods both named 
Zip...

One method would either truncate or throw, if the sequences aren't of the same 
length:
void Zip(this IEnumerable<T1> src, IEnumerable<T2> other, Boolean throwIfUneven)

The other would take generators for each sequence in case one should be 
shorter, and
if the generator is null, use the "default" for that type:
void Zip(this IEnumerable<T1> src, IEnumerable<T2> other, Func<T1> srcPadGen,
Func<T2> otherPadGen)

Original comment by CAmmerman@gmail.com on 17 Feb 2009 at 12:35

GoogleCodeExporter commented 9 years ago

I removed the comment about Python on the grounds that I'd implemented the 
optional 
functionality, and referring to Python probably wouldn't actually help our 
users :)

Personally I still think the enumeration is the simplest way to go - using a 
boolean 
for throw or not feels like the kind of thing that reduces readability: when 
looking 
at the code in 6 months time, you'd be pretty much *required* to check what 
that 
parameter was.

I can see the benefit in the padding generators in that it allows a non-default 
padding, but that's all - and I suspect you'd rarely want to *actually* express 
both, 
because you'd probably know which is going to be longer.

Don't forget that in the current setup the client doesn't *have* to specify the 
enum 
at all - it defaults to Truncate.

Original comment by jonathan.skeet on 17 Feb 2009 at 6:27

GoogleCodeExporter commented 9 years ago

Fair point on readability for the boolean argument.

It's anecdotal, but in my personal use I far more often want to pad the short 
side
than truncate the long side.  I use the generators all the time.  I have no 
qualms
about making truncate the default behavior, but I think the generator versions 
would
also get a good amount of use.

Especially when working through the IEnumerable interface rather than directly 
with
List, Array, etc.,  it's not at all uncommon to not know which side is longer 
before
zipping.  In fact, needing to know ahead of time I think sort of defeats the 
purpose
of the deferred execution.

If we want to support padding generators, we'd pretty much have to require both 
to be
provided (at least filled in with a null to indicate default), because the way
generics works would prevent us from specifying one overload for each side.  
(They'd
conflict with each other when the element types are the same.)

Original comment by CAmmerman@gmail.com on 17 Feb 2009 at 2:34

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

> referring to Python probably wouldn't actually help our users :)

I was more thinking of MoreLINQ developers and maintainers until we polish 
(docs) for
1.0. :) Meanwhile, is the MoreLINQ audience Mort or Elvis? ;)

> Personally I still think the enumeration is the simplest way to go

We risk going on forever with personal preferences and subjectivity. The right
solution may be simple, but we may not be there yet. :)

The argument is no longer whether we should provide the functionality or not, 
but
rather how to expose it and using what signature(s). There's already agreement 
that a
Boolean definitely reduces readability. The next thing that should be avoided 
is an
explosion of overloads to do and mean everything. As for the 
ImbalancedZipStrategy,
I've pointed out before that making the strategy an argument of the operator 
via an
enumeration (or even polymorphic type) would be interesting if the problem that
needed solving was that the strategy is a variable of the run-time. I think we 
can
also agree that's not the case here. If that's not convincing by itself, then 
there
is a very subtle yet major idea behind extension methods and LINQ that the
enumeration works against. If the enumerable source type wants to provide an
optimization for an operator, it can do so by providing a method by the same 
name and
signature as the operator in question. If we use an enumeration that comes from
MoreLINQ then the other type has to take in a dependency that won't be looked 
upon
lightly. If we take the approach of embodying the strategy in the name 
(assuming this
is a compile-time decision), then we have simple signatures with Zip, EquiZip 
and
ZipLongest being distinct and clear names. Now, if I have a type called
SuperDuperList<T> that wants to provide an optimization for Zip, EquiZip and
ZipLongest (or any one of the three) then I can do so using simply base and 
generic
types. Right now, with the enumeration, I have no choice but to support all
strategies and take a hit on MoreLINQ! Think about it. This is how
Enumerable.Contains works. When using Contains on variant typed as List<T> or
IList<T>, the LINQ extension method is not used! If one wants to force use of 
the
LINQ's Contains implementation then one has to hop over AsEnumerable first. So 
to
stay objective, we need to see which approach doesn't take us to far away from 
how
LINQ operators should be designed, taking built-in ones as guidance.

Original comment by azizatif on 17 Feb 2009 at 3:39

GoogleCodeExporter commented 9 years ago

I don't want to resurrect any painful discussions here, but unless you envisage 
more 
than the three strategies already specified, I prefer the idea of the strategy 
being 
inferred by the method name.

  - Zip
  - PaddedZip

I don't think the truncated version is required because if you lazily evaluate 
the 
results of Zip and PaddedZip you can just do PaddedZip.Take or Zip.Take to get 
the 
shorter of the two.

Original comment by jeffma...@gmail.com on 10 Mar 2009 at 10:05

GoogleCodeExporter commented 9 years ago

@jeffmagic: You've got my vote on strategy being part of the method name. ;)

Original comment by azizatif on 10 Mar 2009 at 11:05

GoogleCodeExporter commented 9 years ago

I've taken sub-issue about distinct method names instead of a strategy 
enumeration 
and created issue #24 out of it.

Original comment by azizatif on 7 Apr 2009 at 7:38

GoogleCodeExporter commented 9 years ago

Great to see some progress on MoreLinq again. I have implemented a patch for 
this and pushed it to my Bitbucket fork (it seems I do no longer have commit 
access to this repo and I don't know how GoogleCode supports pull requests). 
Code is here 
https://bitbucket.org/johannesrudolph/morelinq/commits/b607e78b404bd86ff83a7f9ae
85c511397b06a7a

Please pull it in the trunk if it's acceptable.

Original comment by jojo.rudolph@googlemail.com on 30 Jun 2013 at 9:37

Changed state: Started

GoogleCodeExporter commented 9 years ago

@jojo.rudolph: While I can imagine rare cases (though I have yet to come across 
it) where the last value could be handy, it's better to keep the simpler 
overload where the default is given instead of the one accepting providers. 
When case for providers builds stronger, it can always be added back in a 
future version.

Defaults can be implemented as shown below but it could become verbose so 
providing defaults can be a bit more palatable, especially in the midst of long 
query.

var ns = MoreEnumerable.GenerateByIndex(n => (double?) n * 0.5);

var xy = 
    ns.Take(5).ZipLongest(ns.Take(10), 
                          (x, y) => new 
                          { X = x ?? double.NaN, 
                            Y = y ?? double.NaN });

foreach (var e in xy)
    Console.WriteLine(e.ToString());

// prints:
// { X = 0, Y = 0 }
// { X = 0.5, Y = 0.5 }
// { X = 1, Y = 1 }
// { X = 1.5, Y = 1.5 }
// { X = 2, Y = 2 }
// { X = NaN, Y = 2.5 }
// { X = NaN, Y = 3 }
// { X = NaN, Y = 3.5 }
// { X = NaN, Y = 4 }
// { X = NaN, Y = 4.5 }

Original comment by azizatif on 1 Jul 2013 at 9:53

wforney / morelinq

Allow caller to specify value for padding shorter of Zip sequences #6