GitHub’s Anti-Spam System Is Struggling Against Persistent Abuse #191078

xlionjuan · 2026-03-30T05:46:53Z

xlionjuan
Mar 30, 2026

🏷️ Discussion Type

Product Feedback

💬 Feature/Topic Area

Other

Body

Summary

In recent days, GitHub has once again been flooded with large-scale Chinese spam. This is not a new problem, nor a rare occurrence—it is a recurring pattern that has been visible for years without any clearly effective resolution. Microsoft’s WSL repository is simply one of the latest high-profile examples.

The following discussions document the issue in detail and highlight the scale and persistence of the problem:

microsoft/WSL#40028
microsoft/WSL#21802

Request

GitHub and Microsoft need to acknowledge the seriousness of this issue—not in abstract terms, but in terms of concrete impact and accountability.

The ongoing presence of large-scale spam repositories calls into question whether current moderation and abuse-prevention mechanisms are functioning at an acceptable level. When such content remains widespread and long-lived, it is difficult to interpret this as anything other than a systemic failure to contain known abuse patterns.

More critically, GitHub is not just a hosting platform—it is widely used as a data source for training AI systems. Allowing large volumes of low-quality or malicious content to persist creates a foreseeable risk: contamination of training datasets. This concern is not hypothetical; it directly relates to cases already raised regarding OpenAI’s Codex.

openai/codex#11966

Framed this way, the issue extends beyond spam itself. It raises a broader question: to what extent are GitHub and Microsoft willing to take responsibility for the downstream consequences of the data they host and distribute at scale?

Using AI-assisted analysis, I have identified a significant number of spam repositories, with some cases traceable back to as early as 2023. The longevity of these repositories strongly suggests that this is not merely a detection problem, but a prioritization problem.

I've submitted the following reports to GitHub.

https://github.com/xlionjuan/gh-spam/blob/main/github-spam-report-2026-03-30.md
https://github.com/xlionjuan/gh-spam/blob/main/github-spam-report-2026-03-30-historical.md

At this point, continued inaction—or responses that fail to materially reduce the scale of the problem—will increasingly be seen not as oversight, but as tacit acceptance.

2026-03-30T05:47:29Z

github-actions[bot]
Bot Mar 30, 2026

💬 Your Product Feedback Has Been Submitted 🎉

Thank you for taking the time to share your insights with us! Your feedback is invaluable as we build a better GitHub experience for all our users.

Here's what you can expect moving forward ⏩

Your input will be carefully reviewed and cataloged by members of our product teams.
- Due to the high volume of submissions, we may not always be able to provide individual responses.
- Rest assured, your feedback will help chart our course for product improvements.
Other users may engage with your post, sharing their own perspectives or experiences.
GitHub staff may reach out for further clarification or insight.
- We may 'Answer' your discussion if there is a current solution, workaround, or roadmap/changelog post related to the feedback.

Where to look to see what's shipping 👀

Read the Changelog for real-time updates on the latest GitHub features, enhancements, and calls for feedback.
Explore our Product Roadmap, which details upcoming major releases and initiatives.

What you can do in the meantime 💻

Upvote and comment on other user feedback Discussions that resonate with you.
Add more information at any point! Useful details include: use cases, relevant labels, desired outcomes, and any accompanying screenshots.

As a member of the GitHub community, your participation is essential. While we can't promise that every suggestion will be implemented, we want to emphasize that your feedback is instrumental in guiding our decisions and priorities.

Thank you once again for your contribution to making GitHub even better! We're grateful for your ongoing support and collaboration in shaping the future of our platform. ⭐

0 replies

54145a · 2026-03-30T10:54:47Z

54145a
Mar 30, 2026

See my reply here: microsoft/WSL#40028 (comment)

0 replies

xlionjuan · 2026-03-30T15:48:54Z

xlionjuan
Mar 30, 2026
Author

It is increasing very quickly

https://github.com/search?q=%22%EF%BC%92%EF%BC%90%EF%BC%92%EF%BC%96%E7%AC%AC%E4%B8%80%22+OR+%22%E7%94%B5%E5%AD%90pg%22+OR+%22ty444%22&type=issues&s=created&o=desc

6 replies

xlionjuan Mar 31, 2026
Author

974k

OregonBanner Apr 6, 2026

OMG It's still going

xlionjuan Apr 6, 2026
Author

Yes it is still going, but the total count won't exceed 100k anymore

GitHub is definitely taking actions to banning and removing them, but GitHub is not able to stop them.

areezmuhammed Apr 6, 2026

Are you sure

xlionjuan Apr 6, 2026
Author

Are you sure

Why don't open the search page I provided?

k2alzhang · 2026-03-31T10:35:01Z

k2alzhang
Mar 31, 2026

Github have a idea to go out this spam issus,but you not to change it

0 replies

xlionjuan · 2026-03-31T21:06:05Z

xlionjuan
Mar 31, 2026
Author

nearing 500 repos that contains 80k issues reported, that is crazy.

https://github.com/xlionjuan/gh-spam/blob/main/github-spam-report-2026-04-01.md
https://github.com/xlionjuan/gh-spam/blob/main/github-spam-report-2026-04-01-2.md

0 replies

davex-ai · 2026-04-05T06:14:16Z

davex-ai
Apr 5, 2026

Hey @xlionjuan
This is a serious critique of how platform integrity intersects with the AI supply chain. You’ve framed the issue effectively: it's no longer just a moderation nuisance, but a data-integrity risk for the models being built on that very code.
The fact that you’ve documented cases dating back to 2023 suggests a breakdown in automated purging or a "tolerance threshold" that is set far too high for a platform of this scale. When spam persists for years, it implies that GitHub's detection signals—which usually catch low-effort bots—are being successfully bypassed by these specific patterns, or that the reputation system is failing to weigh these repositories correctly.
The connection to OpenAI’s Codex and dataset contamination is the most pressing point. If "bad data" is being systematically ingested, the downstream cost of cleaning that data shifts from the host (GitHub/Microsoft) to the developers and researchers, which is a major accountability gap. [1]
Since you've already submitted the formal reports, how do you want to escalate this? We could:

Analyze the specific patterns in the repositories you found to see if there are common "fingerprints" (e.g., specific API usage or account creation signatures) that GitHub's current filters might be missing.
Draft a public-facing summary or "Open Letter" style post for technical communities (like Hacker News or Reddit) to bring more visibility to the historical longevity of these accounts.
Investigate if these spam repos are being used for "SEO poisoning" or malware distribution, which might force a faster response from Microsoft’s security teams.

3 replies

xlionjuan Apr 6, 2026
Author

Analyze the specific patterns in the repositories you found to see if there are common "fingerprints" (e.g., specific API usage or account creation signatures) that GitHub's current filters might be missing.

I can only analyze the content or title of the spam issues, I'm not the site operator, I'm not able to analyze the detail informations for the spam accounts(IP, time patterns, API request fingerprint or register mails etc), the GitHub side should definitely taking action about this.

Draft a public-facing summary or "Open Letter" style post for technical communities (like Hacker News or Reddit) to bring more visibility to the historical longevity of these accounts.

If you or you know someone can helping this it would be great.

Investigate if these spam repos are being used for "SEO poisoning" or malware distribution, which might force a faster response from Microsoft’s security teams.

It is already happening, no matter it is SEO/model poisoning

GitHub has the records for being a platform to spreading malware and spam emails, but they're not able to deal this properly.

davex-ai Apr 8, 2026

It sounds like you're diving into a high-stakes discussion about platform integrity and the AI supply chain. The core issue here is that GitHub’s "tolerance threshold" for spam isn't just a nuisance; it’s actively polluting the datasets used to train models like Codex, leading to a massive accountability gap.

Since you're) is limited to analyzing public content (titles/issues) and confirmed that SEO/model poisoning is already happening, the ball is in the community's court to escalate.
Here is a breakdown of how you can move this forward based on the author's feedback:

1. The "Open Letter" / Public Exposure

Since the your're looking for help with visibility, the most effective next step is a data-backed post for sites like Hacker News (YC), Lobsters, or the r/MachineLearning subreddit.

The Hook: Focus on "Dataset Contamination." This isn't just about spam; it's about how GitHub (owned by Microsoft) is hosting "trash" that Microsoft (via OpenAI) then ingests.
The Evidence: Use the specific cases dating back to 2023 to prove that GitHub’s automated purging is failing over a long time horizon.

2. Forensic Pattern Analysis (Public Data)

Even without internal logs (IPs/Emails), we can analyze the public fingerprints:

Temporal Patterns: Do these spam issues appear in bursts across unrelated repos?
Content Templates: Are they using specific "lorem ipsum" variations or character-stuffing to bypass keyword filters?
Repo Targets: Are they targeting high-reputation repos to "piggyback" on their SEO/Dataset weight?

3. Highlighting the Security Risk

You confirmed this is already being used for SEO poisoning. To get Microsoft’s security teams to move faster, you should document if these spam links lead to:

Drive-by downloads or fake "tooling" scripts.
Phishing pages mimicking GitHub login screens.

PS: Pls upvote if this helps

xlionjuan Apr 16, 2026
Author

looks like it is stopped

54145a · 2026-04-10T14:07:00Z

54145a
Apr 10, 2026

It shows that around 90k Chinese spam issues are existing now, using xlionjuan's search term.

Crawlers highly valuate Github, especially as there are fewer and fewer crawlable websites.

1 reply

xlionjuan Apr 10, 2026
Author

Yeah, it is around 100k for few days

GitHub is deleting them very quickly without doubt

But GitHub still not able to stop them, which is really really funny 🤣

GitHub’s Anti-Spam System Is Struggling Against Persistent Abuse #191078

Uh oh!

Uh oh!

🏷️ Discussion Type

💬 Feature/Topic Area

Body

Summary

Request

Replies: 8 comments · 11 replies

Uh oh!

github-actions[bot] Bot Mar 30, 2026

Uh oh!

Uh oh!

Uh oh!

xlionjuan Mar 30, 2026 Author

Uh oh!

xlionjuan Mar 31, 2026 Author

Uh oh!

Uh oh!

xlionjuan Apr 6, 2026 Author

Uh oh!

Uh oh!

xlionjuan Apr 6, 2026 Author

Uh oh!

Uh oh!

xlionjuan Mar 31, 2026 Author

Uh oh!

Uh oh!

xlionjuan Apr 6, 2026 Author

Uh oh!

1. The "Open Letter" / Public Exposure

2. Forensic Pattern Analysis (Public Data)

3. Highlighting the Security Risk

Uh oh!

xlionjuan Apr 16, 2026 Author

Uh oh!

Uh oh!

xlionjuan Apr 10, 2026 Author

Replies: 8 comments 11 replies

github-actions[bot]
Bot Mar 30, 2026

xlionjuan
Mar 30, 2026
Author

xlionjuan Mar 31, 2026
Author

xlionjuan Apr 6, 2026
Author

xlionjuan Apr 6, 2026
Author

xlionjuan
Mar 31, 2026
Author

xlionjuan Apr 6, 2026
Author

xlionjuan Apr 16, 2026
Author

xlionjuan Apr 10, 2026
Author