yoeo / guesslang

Detect the programming language of a source code
https://guesslang.readthedocs.io
MIT License
773 stars 110 forks source link

Add Solidity (ethereum smart contract language) #50

Open ghost opened 2 years ago

ghost commented 2 years ago

I would love to see Solidity added as a language, as it usually gets detected as JavaScript, Dart, Lua or other languages.

That language has a set of reserved keywords that are very different from other languages and should enable the training to perform extremely well on it.

yoeo commented 2 years ago

Hello @vbersier,

Indeed, it would be a good idea to add Solidity as Etherium/smart contract/NFT are everywere lately. However, there are currently not enough Solidity example files on Github to feed Guesslang (~50k files required) https://github.com/search?q=language%3ASolidity&type=repositories.

I propose that we wait for the number of Solidity projects to grow on Github before adding this language.

ghost commented 2 years ago

Hi @yoeo

There are thousands of source code examples available from etherscan.io and bscscan.com. I wonder if it would be possible to somehow scrape them with their API?

https://docs.etherscan.io/api-endpoints/contracts#get-contract-source-code-for-verified-contract-source-codes https://docs.bscscan.com/api-endpoints/contracts#get-contract-source-code-for-verified-contract-source-codes

In total already 13k files with open source license.

Github search seems glitchy as the number of code results changes with every refresh, from 700 to 63k results.

Finally I think this particular language will require a smaller training set than most other languages, as I explained the reserved keywords are very unique.

yoeo commented 2 years ago

Github search seems glitchy

You're right, now I see that Github search result is not stable for Solidity. Depending on how many files I can actually retrieve from Github, I could perhaps add Solidity to the next batch of supported languages.

I wonder if it would be possible to somehow scrape them with their API?

Currently, the dataset is generated by this script https://github.com/yoeo/guesslangtools/ All the source codes are retrieved from Github but it should be possible to add other sources including etherscan.io and bscscan.com. Of course any contribution for this addition is warmly welcomed.