ycm-core / ycmd

A code-completion & code-comprehension server
https://ycm-core.github.io/ycmd/
GNU General Public License v3.0
1.7k stars 768 forks source link

Language-model Based Completer #1583

Closed xloem closed 2 years ago

xloem commented 3 years ago

Hey, using new models such as EleutherAI's and NovelAI's Genji or apis such as AI21 or OpenAI, a completer could actually sometimes complete an entire function body.

I might be able to help implement this a little, although I have a serious cognitive disability so it's a weak gamble.

@bilucodota will tabnine be opening a pull request for the ycm changes? Is my link in the expandable details below what you mean about registering a general completer?

Notes on implementing a completer - ycm will find general completers if they are added to the [general completer store](https://github.com/ycm-core/ycmd/blob/master/ycmd/completers/general/general_completer_store.py#L24) - ycm has [official documentation for writing new completers](http://ycm-core.github.io/YouCompleteMe/#writing-new-semantic-completers), haven't read it myself - if needed, [ycmd's protocol documentation](https://ycm-core.github.io/ycmd/) can clarify the existing ycmd interface
Notes on using code prediction models - genji is based on gpt-j which is presently being [merged into the mainstream transformers library](https://github.com/huggingface/transformers/pull/13022). There may be a new form of genji after the merge. Additionally there are links in that thread towards approaches for lower end systems. Genji itself has a split form that can work on lower end systems. - I have a [simple hack towards a unified interface](https://github.com/xloem/codesynth) for some of the code prediction models out there. I'm new to machine learning and don't know where to find existing similar work. - the community that made genji recommend https://fast.ai/ for learning machine learning. have not gone through the course myself.
bstaletic commented 3 years ago

We've discussed this idea and aren't completely against the idea. Though I personally do have some concerns.

  1. How is the AI being fed the data to learn from? Just the repo the user is editing? The whole github/gitlab/whatever?
  2. If former, then is that enough data to actually be useful? That's what some tools try and I've read that people have conflicting experiences with those.
  3. If latter, then we have the Microsoft Copilot problem - scanning GPL3 and occasionally regurgitating it even in proprietary codebases. 3.1. Saying that the user just needs to be vigilant, isn't just infeasible, but also conflicts my own ethics.
  4. I also have concerns regarding fitting this kind of thing well within my own mental model of how things should work, but that's getting ahead of ourselves.
  5. Resource consumption could be a problem. Today, YCM can be used on fairly weak machines. Again, might be getting ahead of ourselves.
xloem commented 3 years ago

We've discussed this idea and aren't completely against the idea. Though I personally do have some concerns.

  1. How is the AI being fed the data to learn from? Just the repo the user is editing? The whole github/gitlab/whatever?

What I've linked above are large neural networks for completing streams of text that have been trained once on software. New preceding context is loaded only from the single file the user is editing, to produce completion that follows it. The network does not presently learn from the user's data.

  1. If former, then is that enough data to actually be useful? That's what some tools try and I've read that people have conflicting experiences with those.

I'm curious to know what you've read! This is my first exposure to this.

It can be quite useful. For example, you can start coding a basic class, and the model can get the picture and start filling out boilerplate functions on its own.

  1. If latter, then we have the Microsoft Copilot problem - scanning GPL3 and occasionally regurgitating it even in proprietary codebases. 3.1. Saying that the user just needs to be vigilant, isn't just infeasible, but also conflicts my own ethics.

If this is a concern, the clearest solution could be to make sure the model is only trained on appropriately licensed code. I believe the details of this concern are still developing, and it looks likely the technology will improve before the concern is resolved.

What's important to me here is making sure that everyone has access to modern tools. Right now people who use visual studio have a lot more power than people who don't, and that worries me.

  1. I also have concerns regarding fitting this kind of thing well within my own mental model of how things should work, but that's getting ahead of ourselves.
  2. Resource consumption could be a problem. Today, YCM can be used on fairly weak machines. Again, might be getting ahead of ourselves.

Yeah. I tend to run a single local RPC server or unfortunately sometimes connect to a remote one.. But this technology is improving regularly.

bstaletic commented 3 years ago

What I've linked above are large neural networks for completing streams of text that have been trained once on software.

Hmm... That sounds like Copilot.

I'm curious to know what you've read! This is my first exposure to this.

I've recently read that TabNine, which is trained on the same repo it is used for (or, at least, that used to be the case), if I remember correctly, doesn't do a very good job.

I did try TabNine back when it first came out, but I'll refrain from discussing that specific tool publically.

If this is a concern... To be completely honest, I am concerned with that. Because I don't like the idea of "code laundry" through AI. "It happens rarely" is not satisfactory in my opinion.

I am speaking only from my point of view and don't actually know if the other maintainer shares this concern.

... the clearest solution could be to make sure the model is only trained on appropriately licensed code.

Agreed. Training on BSD licensed code should be fine. Other permissive licenses too. Caveat: I'm not a lawyer. Bigger caveat: code laundry is currently a gray area.

I believe the details of this concern are still developing, and it looks likely the technology will improve before the concern is resolved.

Mhm... that's what worries me more than fancy features VS has.


All that said, I wouldn't mind trying AI powered YCM. I do expect it to break some of my habits.

puremourning commented 3 years ago

remember that YCM only completes words, not whole sections of code.

xloem commented 3 years ago

If this is a concern...

To be completely honest, I am concerned with that. Because I don't like the idea of "code laundry" through AI. "It happens rarely" is not satisfactory in my opinion.

What do you mean by "code laundry" ? I did a quick web search for the phrase but didn't get good results. Sounds like the license-filtering solution can be made workable.

I believe the details of this concern are still developing, and it looks likely the technology will improve before the concern is resolved.

Mhm... that's what worries me more than fancy features VS has.

Noting also that we're pretty close to complete general-purpose automatic coding on the nonfree side of things.

Regarding words and usefulness, I guess it comes down to development effort and creativity.

Thanks for thinking on all these things some.

bstaletic commented 3 years ago

"code laundry"

It's like money laundry, but for source code. You can't take GPLv3 source and paste it into your proprietary codebase. That's a license violation. But throw AI at it and let the AI suggest a verbatim regurgitation of the GPLv3 code it had been trained on and suddenly you're in gray area.

bilucodota commented 3 years ago

Hi guys, Amir from Tabnine engineering team here. Happy to jump on this thread.

Tabnine provides vim support through a fork of YCM which is not optimal to say the least.

Hoping to get Tabnine plug naturally into YCM, I was looking for an API to register as completion provider ( e.g registerCompletionItemProvider of vscode api ) but could not find a way to register general completer.

Any tips on how to do it?

bilucodota commented 3 years ago

@xloem I'm willing to issue a PR if this is acceptable. Let me know of what you think.

xloem commented 3 years ago

[I had edited my top post to tag bilucodota and draw attention to the general completer store link. Not sure whether the link solves the design issue.]

Upstream projects almost always really really love PRs that contribute reusable and respectful improvements. Making software by contributing upstream is a form of community contribution.

I think when businesses act more to grow free software communities, contributing to them, that that is both necessary and really wonderful, and I personally believe that is the intention of the gpl license. Supporting a software community also means together protecting the project, culture, and other workers, of course. Like the competitor who buys your stock to not become a monopoly.

It's not my project, I just opened this issue.

bilucodota commented 3 years ago

Thanks @xloem. I know it's not your project :). Was referring to both of you guys.

@bstaletic what are your thoughts on the above? Another option is to plug Tabnine using a lsp adapter. Would love to get your feedback.

bilucodota commented 3 years ago

Opened PR https://github.com/ycm-core/ycmd/pull/1588

xloem commented 2 years ago

Hey @puremourning any comments on closing this? Just cause it's stale I infer?

puremourning commented 2 years ago

it's not clear to me how this can be integrated in a way that is maintainable or viable long term. While I'm grateful for the PR, as mentioned it effectively opens up internal APIs and I'm not going to support that. There was another attempt at making a completer using the undocumented tabnine API but that's also not tenable long term

There seems to be little interest in this and it's a ton of work and maintenance forever. if tabnine had a LSP intrface, maybe we could integrate with it, or maybe there are still options, but I don't see any likelihood of this being progressed imminently without significant community investment and ongoing support for the project.

I'm also looking to close off all ycmd issues and feature requests that are unlikely to be worked on imminently, and ultimately disable Issues on this repo, as it's often used as a way to request YouCompleteMe changes by the back door.

xloem commented 2 years ago

Couldn't a small completer be made that would require little maintenance and just ran a public code-completion model on the user's local system, using existing libraries?

xloem commented 2 years ago

I hear you though. Thanks for the explanation.

xloem commented 2 years ago

to further relate, I am definitely interested in implementing and maintaining this but am terrified of the politics that could hurt me like they have in the past. it makes it very hard to psychogically pursue or discuss. others could have analogous issues.

puremourning commented 2 years ago

require little maintenance

The problem is that I won't use this. I have no way to reliable regression test it, and I am just one person in my very limited free time. If some champion from the community is going to take over and own this forever, then I'm listening, but right now it looks like something that will be expensive for me to maintain and offer limited practical benefit.

ran a public code-completion model

What's that mean?

using existing libraries

Which libraries? are there public, supported stable APIs for this stuff?

to further relate, I am definitely interested in this but am terrified of the politics that could hurt me

I think we can hopefully put any politics to the side and just be practical here. I have limited time, and this seems like something that's going to be hard to maintain for me... happy to be proved wrong...

xloem commented 2 years ago

ran a public code-completion model

What's that mean?

a number of neural network transformer models have been made public for others to use freely. there are also public services that run these remotely.

using existing libraries

Which libraries? are there public, supported stable APIs for this stuff?

Some are supported, some are proof of concept. The supported one I'm used to is https://github.com/huggingface/transformers .

There are also stable apis for remote services such as openai or ai21.

Proof of concept systems, such as https://github.com/VHellendoorn/Code-LMs which has public code generation models small enough to run without a GPU, are often ported to systems such as huggingface and uploaded to their content hubs.

to further relate, I am definitely interested in this but am terrified of the politics that could hurt me

I think we can hopefully put any politics to the side and just be practical here. I have limited time, and this seems like something that's going to be hard to maintain for me... happy to be proved wrong...

If I were able to set up some backend code that accessed trained models, could you implement the UI glue using the existing interfaces? If so, do you have license requirements regarding the model training data? There are some existing recommendations that short generations are fine to relicense so long as verbatim copied snippets are not present in the output.

puremourning commented 2 years ago

There is almost no likelihood of this being implemented, so closing.

xloem commented 2 years ago

Hi, I'm sorry, did we resolve my question via a different channel or did you choose to ignore me? Am I off base?

Do you recommend that people use other libraries or systems for language-model based completion unless they have a complete pull request ready?

puremourning commented 2 years ago

This is out of scope for YCM at present.

xloem commented 2 years ago

Thanks, I posted my draft PR before I read your reply.

Are you aware of where this concept might fit in to the open source IDE ecosystem?