yeonjuan / parse-git-diff

parse git diff
MIT License
28 stars 3 forks source link

feat: support generator #28

Open madeindjs opened 10 months ago

madeindjs commented 10 months ago

It might improve the performances to use a Generator when reading big files.

I just added an overide function which allow this

export default function parseGitDiff(
  diff: Generator<string, any, unknown>,
  options?: GitDiffOptions
): Generator<AnyFileChange, any, unknown>;
yeonjuan commented 10 months ago

Hi @madeindjs Thanks for the contribution. I'm wondering if it slows down a lot when using large files as input, have you had any cases like that?

madeindjs commented 10 months ago

So the thing is I tried with a large diff (25Mo) and the performances are still fine. I just tried to experimente some stuff to improve the performances: The execution time was identical. But it surely improve memory consumption. It's might also be cool if user wants to iterate on each changes without storing them in memory.

It's up to you to decide if this PR is good to have or not.

BTW, I tried to go further and use AsyncGenerator when reading the file (here) like this

async function* getGenerator() {
  const stream = createInterface({ input: createReadStream(FILE) });
  for await (const line of stream) yield line;
}

for await (const change of parseGitDiffAsync(getGenerator())) console.log(change);

...but my conslusion was the execution time become worse (aroud 2 times slower, probably because it requires that I add async everywhere). But it can still be intersting for large file which doesn't fit in memory.

yeonjuan commented 10 months ago

But it surely improve memory consumption. It's might also be cool if user wants to iterate on each changes without storing them in memory.

@madeindjs Agree! Thanks for the explanation. What do you think about the supporting streams instead of using generators? After seeing the above example, I realized that it would be more usable if we support streams.

madeindjs commented 10 months ago

Yes I agree, I'll update the PR with the support of stream instead of generator.

My only concern is I duplicated parse-git-diff.ts as parse-git-diff-async.ts to be able to introduce AsyncGenerator. I didn't find a way to factorize things because everything need to be async.

madeindjs commented 10 months ago

@yeonjuan , I added the support of Interface from Node.js readline package. It allow a sexy syntax like:

import { createInterface } from 'node:readline';
import { createReadStream } from 'node:fs';
import parseGitDiffAsync from './parse-git-diff-async.js';

const FILE = '/home/alexandre/Downloads/v3_2_0...v3_3_0.diff';
const stream = createInterface({ input: createReadStream(FILE) });
for await (const change of parseGitDiffAsync(stream)) i++;

The cool thing is readline might allow user to introduce CLI tools which allow pipe operator from the shell like git diff | script

Regarding performances , I noted thoses performances using my 25mb diff

method time
parseGitDiff(diff: string) 0.30s
parseGitDiff(diff: Generator) 0.25s
parseGitDiffAsync(diff: AsyncGenerator) 0.47s

I let you take a look at the PR and comment. As I said, I'm not that happy of the code because I copy/pasted many stuff. There is surely a way to factorize it better.

yeonjuan commented 10 months ago

Hi @madeindjs Thanks for the work! I agree that the code needs to be cleaned up, and I'm thinking about separating the public interfaces. In the end, I'd like to have something like this.

import parseGitDiff, { parseGitDiffs, parseGitDiffsGenerator, parseGitDiffsStream } from 'parse-git-diff';

parseGitDiff; // deprecated

parseGitDiffs; // synchronized version
parseGitDiffsGenerator; // for supporting generator
parseGitDiffsStream; // for supporting stream

And it would be nice to change the output format for consistency.

This might take a little longer. What do you think about what it would be like if I worked on this PR base?

madeindjs commented 10 months ago

Yes I agree that the ouput needs to be changed, but I didn't do it since it's a breaking change for the API.

I personally prefer the overides way because it looks better in term of developper experience. But it's your librairy, do as you like you prefer, don't worry :wink:

Of course, I'm fine that you work on this PR.