In the era of big data, there are numerous duplicate code snippets on the Internet, it is especially necessary to make use of them to build new software projects. In… Click to show full abstract
In the era of big data, there are numerous duplicate code snippets on the Internet, it is especially necessary to make use of them to build new software projects. In this paper, we present a toolkit (KG4Py) for generating a knowledge graph of Python files in GitHub repositories and conducting semantic search with the knowledge graph. In KG4Py, we remove all duplicate files in 317 K Python files and perform static code analyses of these files by using a concrete syntax tree (CST) to build a code knowledge graph of Python functions. We integrate a pre-trained model with an unsupervised model to generate a new model, and combine this new model with a code knowledge graph for the purpose of searching code snippets with natural language descriptions. The experimental results show that KG4Py achieves good performance in both the construction of the code knowledge graph and the semantic search of code snippets.
               
Click one of the above tabs to view related content.