Software Development
Data / Resources

This Drexel researcher can identify you based on how you write code

Aylin Caliskan-Islam spent the summer at the U.S. Army Research Lab developing a method of code “de-anonymization.” She calls her early results a breakthrough.

A Drexel researcher says code syntax contains identifying details. (Photo by Flickr user Yuri Samoilov, used under a Creative Commons license)

Could a developer contribute to a software project anonymously, wipe her fingerprints off the code and leave no trace?
Drexel researcher Aylin Caliskan-Islam is one step closer to creating a way to do so.

Aylin2014

Aylin Caliskan-Islam. (Courtesy photo)


Anonymization is “a serious concern for people who want to contribute to open source projects anonymously,” Caliskan-Islam said, pointing to how researchers have attempted to unmask the creator of Bitcoin and how developers work on large-scale privacy-focused open source projects like Tor.
The first step in covering a developer’s tracks, though, is figuring out if someone could identify a developer by analyzing their code. Caliskan-Islam, a native of Turkey and a Ph.D. student part of the Drexel lab that has developed software to anonymize authors, spent the summer at the U.S. Army Research Lab in Washington, D.C. developing a method to do just that. (She’s the first international Ph.D. student the Army hired as a summer intern for its Open Campus research initiative, she said.)
Out of 250 examples of source code pulled from the international Google Code Jam competition, she was able to identify authors at a 95 percent accuracy rate, as detailed in a recent academic paper. Given how small each piece of source code was (an average of 70 lines), she called it a breakthrough.
Her approach, which uses machine learning, involves doing what’s essentially a close read of the source code. She looks at things like the words used, the spacing and bracketing and most importantly, structure or syntax (see graphic below for a breakdown of that kind of analysis). All those things make up a developer’s coding style.
source code syntax tree

Here’s how Aylin Caliskan-Islam parses code to figure out who wrote it. (Courtesy image)


Other than leading to the development of an anonymization tool, possible applications include identifying cyber criminals and verifying claims of plagiarism. Caliskan-Islam said she’s not sure how the Army, who funded the project, will put her work to use.
Next up, Caliskan-Islam wants to focus on how to identify developers who have contributed to a project with many authors, like, for example, an open source software project.

Companies: Drexel University
Engagement

Join the conversation!

Find news, events, jobs and people who share your interests on Technical.ly's open community Slack

Trending

Philly’s IT department fires long-tenured staff amid a high-level shakeup of priorities

Why is it so hard to find entry-level software engineering jobs?

This Week in Jobs: Get out there with 22 new job opportunities available to you!

Philly ‘tech walks’ encourage professionals to parade the streets — to build their networks

Technically Media