AI for grant, SPP evaluation screening

Arnold · March 6, 2026, 12:38pm

Prepared by: Lighthouse Labs <arnold@lighthouse.cx>

Thesis

Is it possible to engage frontier AI models to act as a diligent screening tool for grant programs?

Exploration

Let’s seed an impartial evaluator that is a frontier AI model.
Populate prompt context with known best practices:
- Weighted rubrics
- ENS alignment (best guess)
- Hard red flags
Ask it to:
- Provide clear evals and feedback
- Highlight application discrepancies/gaps for stewards to formally address
- Provide funding recommendation
Backtest on historical SPP applications

The prompt should eventually encode shared values distilled from the community and experts.

The final prompt used: https://hackmd.io/a1o98BBHRI2pNmIHtQOFvg

Experiment

Lighthouse Labs ran the prompt over 10 applications from SPP2.
Models used:
- Claude Code / Opus 4.6 / Sonnet 4.5
- Extended thinking (sometimes)
Variation A
- Pure prompt
Variation B
- Pure prompt with additional Instructions
  - Please also investigate disclosed github repositories and analyse contributor activity such as number of contributors and commit frequency.

Interim results

Only 20% of the projects sampled were Strong fund signals.
50% of projects required additional guardrails based on the evaluation rubric meaning the AI believed clearer commitments needed to be made from applicants.
20% clear Do not fund.

We then asked the AI to aggregate the applications observed and produce a report of the most prominent problems observed across all applications. The following is verbatim from the AI.

Key takeaway here is that we can get somewhat meaningful and consistent evals produced.

Problems discovered (and teams anonymised)

No budget breakdown (universal)

Every application failed this. Ranges from percentage-category estimates to single dollar figures with no supporting structure. For six- and seven-figure asks, this is the most basic accountability mechanism missing from the entire cohort. The DAO cannot distinguish appropriate pricing from padding without line items.

KPIs that aren’t KPIs (near-universal)

Most proposals list measurement categories rather than targets — “user engagement,” “community satisfaction,” “accuracy.” A real KPI can fail. Almost none of these can. The rare exceptions committed to specific numbers against specific timelines and were verifiably better applications as a result.

Funded scope defined as “TBD after research”

Several applications included aspirational deliverables where the team genuinely doesn’t know what they’ll build yet — and asked the DAO to fund the discovery phase as if it were a committed deliverable. The correct approach is either to exclude undefined scope or stage the funding with a milestone gate before the larger commitment releases.

ENS as one client among several

The most common structural issue in the lower half of the cohort: teams that build the same product for multiple DAOs and submitted their standard client pitch with “ENS” substituted in. Nothing in the scope is specifically valuable to ENS as opposed to any other DAO that would pay $300K. The program’s minimum grant level attracts this behavior.

No quarterly milestone structure

Most proposals define scope as feature lists rather than temporal delivery plans. With no quarterly gates tied to specific deliverables, the DAO has no mechanism to pause or redirect streams that underperform — which matters because SPP payments stream continuously regardless of delivery status.

GitHub evidence doesn’t match claimed team size

A recurring pattern: claimed engineering headcounts that don’t match visible commit activity even after accounting for non-technical roles. The applications where repo evidence clearly validated team size claims were the same applications that scored highest overall.

Scope overlap without differentiation

Multiple applicants proposed notification systems, delegate dashboards, and financial transparency tooling in the same cohort with no acknowledgment of the overlap and no explanation of why the DAO should fund parallel approaches. The combined ask across overlapping scope exceeded $1M.

Open-source compliance as afterthought

The program requires all directly-funded work to be MIT-licensed. Several applications either didn’t address this at submission, addressed it only after steward prompting, or committed to open-sourcing “at end of program” rather than throughout. The requirement appears to not be well understood at submission time.

COI disclosures that don’t resolve the conflict

Disclosure is necessary but not sufficient. The appropriate response to a structural conflict of interest is a commitment that removes it — not a promise to manage it through recusal.

Solutions to Common SPP Application Problems (Opus4.6 generated)

1. Budget breakdown → Standardized budget template

Require applicants to submit a structured budget form alongside their proposal — headcount by role, infrastructure/tooling costs, travel/events, and overhead. Publish a reference format so there’s no ambiguity about what’s expected. Cap any single line item above 40% of the total without additional justification. This shifts the burden of proof from the DAO asking questions post-submission to applicants justifying costs upfront.

2. KPIs that aren’t KPIs → Require numeric targets with a baseline

Add a mandatory section: “Current state / Baseline” and “Target by end of term.” Any KPI without a number and a measurement method gets flagged at eligibility review. Stewards could maintain a public KPI bank of accepted metrics from prior seasons so applicants have examples to work from — and the DAO builds a growing dataset of governance across cohorts.

3. Undefined scope → Staged funding with milestone gates

Split grants into tranches: an initial 40% release on approval, a second 30% release after a Q2 milestone review, and the final 30% after a Q4 delivery report. This doesn’t punish teams for not knowing everything upfront — it just means genuinely experimental work gets funded at the research scale until it proves itself, rather than at the full scale. Stewards would need clear authority to delay a tranche without triggering a full governance vote each time.

4. Generic DAO tooling repackaged as ENS work → ENS-specific technical investigation requirement

Require applicants to include an “ENS Technical Context” section demonstrating they’ve engaged with ENS’s actual contracts, standards, and roadmap. This could be as simple as: “Which ENS contracts does your work interact with? What specific ENS limitations or gaps does it address? How does your scope relate to ENSv2?” Applications that can’t answer these concretely are revealing that they haven’t done the homework. Stewards could evaluate this section as a threshold gate before scoring the rest.

5. No milestone structure → Quarterly commitment tables as a required format

Make the quarterly delivery table mandatory — Q1 through Q4, specific deliverables per quarter, verification method for each. NameHash Labs did this voluntarily and it was the clearest accountability structure in the cohort. If it’s required, there’s no reason for other applicants not to do it. Combine this with the staged funding above so milestone tables have actual consequences rather than being aspirational.

6. Team size vs. GitHub mismatch → Pre-funding verification call

Before final approval, stewards hold a 30-minute technical verification call with the engineering lead. Not an interview — a structured review of the team’s existing repos, commit history, and architecture decisions. This surfaces the mismatch between claimed and verifiable capacity before funds flow, and gives genuine teams an opportunity to explain gaps (offshore contributors, private repos, etc.) on the record.

7. Scope overlap → Explicit differentiation requirement and cohort-level visibility

Publish the full applicant list and scope summaries early in the window so teams can see what else is in the cohort. Require a “How does this differ from X” section when scope overlaps are obvious — stewards can flag this at eligibility review. Longer term, the DAO could maintain a live map of funded work across SPP seasons so new applicants can identify genuine gaps rather than re-funding covered territory.

8. Open-source compliance as afterthought → Checklist at eligibility gate

Add open-source compliance to the eligibility confirmation checklist with specific questions: Is the work MIT-licensed from day one or at end of term? Is there a public repository URL? Will proprietary components be separated from grant-funded components? Applications that can’t answer these get a compliance note before the vote, not after. The steward’s eligibility confirmation email could include a standard reminder with the exact program language.

9. COI disclosure without resolution → Structural resolution required, not just disclosure

The application should require a COI section that not only discloses conflicts but proposes a structural resolution — and stewards should evaluate whether the proposed resolution is adequate before endorsing. “We will recuse from relevant discussions” is not a resolution. “We will not execute scope item X under this grant; the DAO will separately procure that work” is. The distinction is between managing a conflict and removing it.

Meta-solution: Publish evaluation rubrics before the application window opens

The single highest-leverage change. If applicants know they will be scored on budget transparency, ENS-specific technical investigation, quarterly milestone structure, and KPI specificity before they write their proposals, the average quality of submissions rises substantially. Rubrics don’t need to be the exact internal scoring weights — just the criteria and what good looks like. Several applicants in this cohort would have written materially different proposals with this information available upfront.

Risks

Knowing the prompt ahead of time could be a case for gamification.
- Perhaps ideals should be communicated publicly, with exact methodology kept secret.
- Anyone who knows the prompt cannot be a service provider.
- Allow for up to two resubmissions for applications that do not meet the acceptance threshold
AIs are agreeable even with multiple red flags raised, and weak applications may still score well.
- This is a R&D problem and can be improved over time.

Conclusion

This experiment demonstrates that AI can provide a consistent scoring, ranking and evaluation methodology for screening applications
Quality, actionable feedback can be provided for every applicant, allowing for fast iteration.
Evaluation with actionable feedback provides intervention points for voters to align on, reducing cognitive burden.
Screening applications for eligibility can be automated ensuring that applicants meet the standard as defined by the DAO.
Perhaps most importantly: the uncomfortable observations get surfaced without social cost (i.e. “the AI said what everyone is thinking”).

Sov · March 18, 2026, 1:14pm

Wanted to weigh in here because I’ve been using AI for grant evaluation and workflows for a few years now, going back to the earliest public ChatGPT models.

For most of that time, it was useful but required a lot of hand-holding. Nowadays, I can feed Claude raw application data, and it will pull GitHub activity, search for public profiles, cross-reference claims, and return a layered analysis that would have taken me much longer to manually produce. If you look at the METR benchmark (https://metr.org/), the trajectory of model capability has gone nearly vertical in recent months, and what I’m seeing in practice maps directly to that curve.

From what I could see, the biggest gap in the last SPP cycle was ongoing accountability during delivery. Nobody was tracking whether milestones were met or whether teams built what they proposed (though some teams self-reported this information). The evaluation approach shown here could feed into that loop, and the prompts and rubrics could be fully transparent so anyone can introspect or run the same evaluation themselves.

chainsightsone · March 18, 2026, 7:26pm

Great experiment, Arnold. The rubric-based screening approach is solid, and the finding that only 20% were strong fund signals tracks with what we see on the governance data side.

Sov’s point about the accountability gap during delivery is key. The evaluation layer you’ve built handles the input side (should we fund this?), but the output side (did they deliver?) is still largely manual.

We’ve been working on something adjacent at ChainSights — measuring governance quality quantitatively across 53+ DAOs. A few observations that might be relevant here:

On the “KPIs that aren’t KPIs” problem: This mirrors what we see in delegate voting behaviour. When we score individual vote quality (deliberation time, independence from whales, domain focus), we find that most delegates don’t meet basic quality thresholds either. The problem isn’t unique to grant applicants — it’s structural across DAO governance.

On the rubric transparency question: Publishing evaluation criteria upfront is your highest-leverage recommendation. We’ve seen the same dynamic with governance scores — once DAOs know they’re being measured on specific dimensions, behaviour shifts. ENS’s Human Participation Rate is 10/10 (A+), but Delegate Engagement sits at 1.8/10 (F). If delegates knew engagement was being tracked and scored, the incentive structure changes.

On ongoing accountability: This is where governance analytics and AI evaluation could converge. Imagine combining your screening rubrics with continuous on-chain delivery tracking — commit activity, milestone completion, fund utilisation — scored by the same AI that did the initial evaluation. The baseline is already in the application; the delta is measurable.

Happy to share ENS-specific governance data if it’s useful for calibrating the evaluation model. The dataset is open: ENS — GVS 6.1 (B) | ChainSights

Arnold · March 19, 2026, 11:31am

Cheers @chainsightsone. Appreciate the feedback.

Also worth noting is this work from Docs — Simocracy | Simocracy is super interesting from a capital allocation PoV

chainsightsone · March 19, 2026, 1:56pm

Thanks Arnold, hadn’t seen Simocracy before. The simulated committee approach is a fascinating complement — we measure governance quality after the fact (how did delegates actually vote), Simocracy models what a well-structured deliberation could look like before the vote.

Combining the two could be powerful: compare actual DAO voting outcomes against simulated committee decisions to surface where governance is underperforming its own potential. Essentially a governance “gap analysis”.

Will dig into the docs.