The feds classify marijuana as one of the world’s most dangerous drugs. That’s hotly debated, as shown by the 42,000 public comments left on the recent Drug Enforcement Administration proposal to downgrade marijuana’s classification from Schedule I to Schedule III.

But there’s a tricky aspect to those comments. At least 10,000 of them were “repeated entries,” meaning it’s possible they were spam.  

Federal law requires government agencies to allow public comment on all proposed rule changes. Being able to submit online has made the rulemaking process more accessible, but it also opens the door to bots and bad actors. Spam is abundant on the internet, but in these cases, being able to tell what’s real and what’s not could have lasting impact.

“Bots are creating the illusion of public participation.”

Anshu Sharma, Civic Tech DC hackathon participant

For Anshu Sharma, a software developer at Arlington’s Cydecor and a PhD candidate at William and Mary, detecting bots in public comments is personal. He often submits comments on regulations.gov, the main clearinghouse site for public feedback, and wants it to be a more honest process. 

“Bots [are] creating the illusion of public participation,” Sharma said, “and we can help reduce that.”

On a scorching Saturday in July, he set out to help solve the problem with other developers and technologists at Civic Tech DC’s first hackathon.

Mike Deeb, one of the heads of the volunteer-based nonprofit, stressed it wasn’t a competition, and believes these problems need to be tackled by more than technologists — people who know communities, nonprofit leaders and others. The group also hosts project nights and showcases of different technologies people are working on to improve citizen participation, government services and boost transparency.

The goal by the end of the day was for some people to have code that’ll “move the needle forward,” Deeb said.   

‘Make it easier for regulators to understand what the public has said’

A man stands holding a microphone and speaking in front of a projected presentation slide titled “Why We’re Here” at an indoor event.
Mike Deeb, co-organizer of Civic Tech DC, introduces the hackathon (Kaela Roeder/Technical.ly)

The roughly 50 Civic Tech DC volunteers spent their day glued to their laptops in an office building near Eastern Market, writing code to figure out how to make federal comment data more accessible.

Comments on regulations.gov are a mix of PDFs, images and unorganized spreadsheets, Fred Trotter explained. He’s one of the leaders behind the Mirrulations, a project that stores and pulls data from the site. 

Trotter believes the government has good intentions, and officials are required to read every comment. But it’s a mess. 

Because of the sheer amount of data, it could take 3 years to download the 25 million-plus comments and related documents on the site, because the API (application programming interface, the way code can access the site) only allows for 1,000 documents to be downloaded per hour. 

Mirrulations head Ben Coleman, a professor at PA’s Moravian University came up with a workaround: He had his students donate their API keys, and they were able to get all of the data within a 6-hour window. The next step was to make the information readily available — it’s expensive to host and distribute terabytes of data, Coleman explained. He partnered with the AWS Open Data Initiative to handle it without the heavy price tag.

Now that the data is obtained and available, the goal is to make it more accessible. 

A man in a purple shirt and khaki pants speaks into a microphone in front of a projected presentation slide in a room with exposed beams.
Ben Coleman highlights how the Mirrulations dataset works at Civic Tech DC’s hackathon (Kaela Roeder/Technical.ly)

“There’s a gap,” Coleman told Technical.ly. “What can we do to make it so that those people who don’t want (and don’t need) to be working at that level can still be working with this data?”

Coleman and Trotter came to the hackathon with hopes of creating a collection of tools, ones that can be used to help policymakers, researchers and journalists find trends — such as when certain groups agree, disagree and how certain groups comment over time. 

“What this project should do is make tools that will make it easier for regulators to understand what the public has said,” said Trotter, who also works on data systems for the Centers for Medicare & Medicaid Services, “and it will also allow the public to hold the regulators accountable to what they actually said.”

Spotting trends among millions of comments

During the hackathon, teams also built technology to scrape data from agencies and spot trends related to citizen commentary about proposed US regulations. Comments for specific rules can even number in the millions, like for the net neutrality vote in 2017. 

Web development consultant Misha Vinokur spent the 8-hour day figuring out a way to scrape comments about Securities and Exchange Commission (SEC) rulemaking. That agency’s comments are not available on regulations.gov, where most commentary about federal rules is housed, but he found a way to pull the data using the programming language PHP (the language Python was blocked, he explained). It wasn’t easy for him. 

“People who deal with technology every single day have a hard time. Imagine the layman, someone who isn’t,” Vinokur said. “Civic tech is very important, just helping people understand something that’s already complex and simplifying it, but even we’re having a hard time doing that.”

People are seated at round tables working on laptops in a large room, with a projection screen at the front displaying information.
About 50 people attended Civic Tech DC’s first hackathon. (Kaela Roeder/Technical.ly)

Sharma, the Arlington software developer, was on the team “Can of Spam,” which decided to tackle the 10,000 repeated marijuana regulation comments and other proposed rules with duplicate comments.

Using standard data science tools like Python and Jupyter Notebook, they decided to pinpoint timestamps of comments in addition to flagging similar text. That’s because often groups and associations will send out templates for people to submit their thoughts, explained software engineer Dean Eby. Those 10,000 comments could be legitimate. 

A large group of people pose and smile together inside a spacious, industrial-style room with plants and string lights in the background.
Participants in Civic Tech DC’s first hackathon. (Kaela Roeder/Technical.ly)

At the end of the hackathon, it was still unclear whether they were, Eby said. He acknowledged during the team’s presentation that the group “bit off a little bit more than we can chew,” but said this is just the first step. 

“Essentially, we want to make it easier for researchers to verify the legitimacy of coordinated campaigns,” Eby told Technical.ly

Shin, an intern at the Aerospace Industries Association, wanted to attend the hackathon to take part in something that helps people. 

“You often can lose that connection to direct impact,” Shin said. “Contributing towards this project, having that tangible impact, I think, is really important.”