Using machine learning, researchers have developed a code that can identify the author of a piece of code or program
Its accuracy for a group of 600 programmers was 83%
It could go a long way in cyber security
Natural Language Processing is one of the most challenging field for data scientists because of how unstructured the data is, hence, analysing hidden patterns among all the unwanted noise is where data scientists can really earn their big bucks.
There has been some progress in this field in recent times in identifying who the author of a piece of literature is, but can this be applied to programmers and coders as well? Do these people really have a “digital fingerprint” as they show in movies? That’s something researchers have been working on, in recognising a pattern among this collection of numbers and texts.
A couple of researchers from Drexel University and George Washington University have revealed that a code, just like literature, can also be analysed to pinpoint the author. They also published their article online, you can read more about it here.
How this works?
Firstly, the algorithm identifies features present in samples of codes, the researchers then narrowed down the features to only include those ones which helped them distinguish between individual developers. This helped them to cut down unwanted features.
Researchers then created an ‘Abstract Syntax Tree’ which was used to recognise the codes underlying structure. And as any other ML algorithm, this requires a set of training data. In the research paper the researchers revealed that its possible to identify the developer using their compiled binary code only.
In order to train their algorithm, researchers picked up sample codes from Google’s annual code jam competition. The results were quite impressive with an accuracy of 96% for 100 coders, but the accuracy dropped to 83% when the number of programmers were increased to 600.
One of the findings to be noted was that it was easier for the algorithm to identify experience coders rather than new comers. This must be mostly due to the fact that they have more training data available and also because each coder has their individual pattern.
So let’s summarise:
This might be the beginning to a whole new world of cyber security and plagiarism check or even identity theft. This is one such algorithm that could definitely make our world a safer or atleast better place to live in.