Foundation model research, applied ML systems, computer vision, ML infrastructure, and AI-native product engineering qualify under IRC Section 41. Section 174 amortization makes documenting the credit more important than ever, particularly for teams burning meaningful spend on training compute.
Most AI and ML companies that qualify do not think of their engineering work as research. But if your team is engineering novel model architectures under accuracy uncertainty, building distributed training infrastructure where the parallelism strategy is not obvious, designing custom retrieval architectures where standard RAG falls below the accuracy bar, or running ablation studies to evaluate alternative attention or fusion mechanisms, the R&D credit likely applies right now.
Since 2022, Section 174 requires capitalization and amortization of research and experimental expenditures over 5 years for domestic R&D and 15 years for foreign R&D. For AI companies, this applies to engineer wages, ML researcher wages, and qualifying training compute, which is exactly the spend profile that defines the company. A team running heavy pretraining or fine-tuning workloads sees current-year taxable income increase materially even when cash burn is unchanged. The R&D credit partially offsets this impact by directly reducing tax liability on the same expenses. AI companies that previously skipped the credit because the work felt like ordinary engineering now have a different calculus: the documentation is required for Section 174 anyway, and the credit captures real value from work that would otherwise just create a deferred deduction.
The work must aim to develop or improve the functionality, performance, reliability, or quality of a process, technique, formula, or software component. AI and ML companies meet this test through engineering novel model architectures, more efficient training infrastructure, more accurate retrieval systems, more capable inference serving, or more robust evaluation frameworks. Experimental failure counts. Failed pretraining runs, abandoned architecture variants, ablation studies that showed no improvement, and fine-tuning approaches that did not meet the accuracy bar all contribute qualifying research expenses.
A foundation model team engineers a custom sparse attention pattern intended to reduce training compute at fixed quality. Their first variant fails to converge on a reasoning benchmark. A second variant converges but underperforms baseline on long-context tasks. A third variant combining grouped query attention with a custom routing scheme matches baseline at materially lower compute. All three variants qualify because the intent throughout was to improve technical performance under engineering uncertainty.
This prong is met by any AI or ML team developing a technically better architecture, training pipeline, retrieval system, or evaluation framework. ML researchers, ML engineers, infrastructure engineers, and applied scientists all perform work that satisfies this test as part of their standard scope.
The work must rely on principles of computer science, mathematics, engineering, or physical science. AI and ML engineering is inherently grounded in these disciplines: machine learning theory, optimization, linear algebra, probability and statistics, distributed systems, computer architecture, and information theory all satisfy this prong. Business decisions about pricing, packaging, customer segmentation, and go-to-market strategy do not qualify, but the engineering underlying the model architecture, training infrastructure, and inference systems does.
A computer vision team designs a custom multimodal fusion architecture drawing on representation learning theory, attention mechanism research, and statistical learning principles. An ML infrastructure team engineers a novel model parallelism strategy drawing on distributed systems theory and high-performance computing research. Both draw on recognized scientific and mathematical foundations and satisfy the technological prong directly.
The threshold is low for AI and ML engineering because the scientific foundation is inherent to the discipline. Machine learning, optimization, distributed systems, and information retrieval all rest on established computer science and mathematical principles.
The work must aim to eliminate uncertainty about the capability or method of achieving a technical result. AI and ML development is dense with this kind of uncertainty: whether a candidate architecture will train at all, whether a fine-tuning approach will reach the required accuracy, whether a custom retrieval system can resolve domain-specific queries with citation-level fidelity, whether a parallelism strategy can train a model that exceeds standard memory bounds, or whether an evaluation framework will catch the failure modes that matter for the user.
An applied ML team building a custom recommender system does not know at the start whether their two-tower retrieval architecture combined with a learned re-ranker can hit the relevance and latency targets at production scale. The team systematically evaluates multiple candidate architectures across production cohorts, runs A/B tests against the existing system, and validates against the defined performance criteria. The uncertainty about both the technical method and the achievable accuracy is eliminated through the experimental process.
AI and ML engineering frequently involves uncertainty about both capability (can this architecture meet the accuracy bar at all) and method (which architecture will hold under deployment constraints). Either form of uncertainty qualifies.
The work must involve a process of evaluating alternatives to eliminate technical uncertainty. This does not require a formal lab or a dedicated research team. In AI and ML development, the experimental process is typically the engineering workflow itself: designing candidate architectures, running ablation studies, evaluating results against defined benchmarks, and iterating on the architecture, training pipeline, or evaluation framework. Contemporaneous documentation of this process is the foundation of a defensible R&D credit study.
An ML infrastructure team evaluates four different parallelism strategies before deploying their distributed training stack. Each strategy is benchmarked against representative model sizes with defined throughput and memory targets. Results are documented in architecture decision records and benchmark spreadsheets. The systematic evaluation of alternatives is the process of experimentation. The documentation of that process is what makes the credit defensible under examination.
Architecture decision records, training run logs, ablation study spreadsheets, evaluation results across candidate architectures, and benchmark comparisons against defined accuracy or performance criteria all constitute experimental processes under IRC Section 41.
For the full four-part test explanation with examples across industries, see the main R&D Tax Credit page.
Each sub-sector below includes the qualifying activities, the typical expense breakdown, and the primary exclusion. Select your company type.
An 18-person foundation model research lab identified a research goal: reduce pretraining compute at fixed quality by engineering a novel sparse-and-grouped attention pattern. Their research and engineering team spent 14 months running the program, evaluating six alternative attention mechanisms (full attention baseline, two grouped query attention variants, two sparse attention variants, and a custom hybrid sparse-grouped pattern) across a defined benchmark suite of 12 reasoning, code, and long-context evaluation tasks. The team trained working implementations of each at multiple model sizes and ran ablation studies measuring compute, convergence behavior, and benchmark quality. Three of the six variants failed to converge on at least one reasoning benchmark. Two converged but underperformed baseline. The hybrid variant matched baseline at materially lower compute and was selected for further scaling work.
The team had no internal R&D classification for the work. They considered it "scaling research." But the documented technical uncertainty about whether any of the candidate attention mechanisms would meet the quality bar at reduced compute, the systematic evaluation across the 12-benchmark suite, and the failed training run logs all met the criteria for qualifying research expenses under IRC Section 41. Compute spend in pretraining and ablation runs was material under the Section 174 lens.
A $14,000,000 ARR consumer marketplace identified a structural problem during scaling: their legacy collaborative filtering recommender was hitting a relevance ceiling and breaching latency budgets at peak traffic. Their nine-person ML team spent nine months designing a custom two-tower retrieval architecture combined with a learned re-ranker, evaluating four alternative ranking architectures (neural collaborative filtering baseline, two-tower with shared embeddings, two-tower with item-tower transformer, and a hybrid two-stage retrieval and re-rank model) across six production cohorts via custom A/B test infrastructure. The team built bespoke feature engineering pipelines for marketplace-specific signals (seller velocity, price elasticity, inventory depth) and shipped working implementations of three of the four architectures to limited cohorts before final selection.
The team framed the work in product terms (relevance, conversion, latency) and never connected it to research framing. aecre's technical interview process identified the qualifying experimental structure across the nine-month program. The custom A/B test infrastructure built specifically for the architecture comparison, the systematic evaluation of alternatives across the six production cohorts, and the bespoke feature engineering work all met the criteria for qualifying research expenses under IRC Section 41.
A 12-person ML team at a clinical decision support company identified a research goal: engineer a custom vision-language architecture for radiology report generation from imaging studies, fusing image features with structured patient metadata. Their team spent 12 months on the program, evaluating five candidate architectures (an off-the-shelf vision encoder baseline with frozen weights, two custom encoder pretraining strategies, a multimodal fusion variant with cross-attention, and a hybrid pretrained-and-domain-adapted approach) against a held-out radiologist-labeled cohort of 4,200 studies spanning eight imaging modalities. The team designed a custom encoder pretraining stage on internal imaging data because off-the-shelf encoders underperformed on domain-specific findings. Multiple training runs failed to converge before the team landed on a workable fusion strategy. The radiologist-labeled benchmark cohort served as the defined evaluation set.
The team described the work as "improving model accuracy." aecre's technical interview process identified the qualifying experimental structure across the 12-month program. The systematic evaluation of five candidate architectures against the 4,200-study evaluation cohort, the failed training run logs, the custom encoder pretraining engineering, and the radiologist agreement studies all met the criteria for qualifying research expenses under IRC Section 41. Compute spend in pretraining and fine-tuning runs was material under the Section 174 lens.
A 16-engineer ML platform team at an AI infrastructure company identified a research goal: build training infrastructure to support models exceeding standard FSDP memory bounds at the team's target throughput. Their team spent 10 months engineering a novel parallelism strategy combining tensor parallelism, pipeline parallelism, and a custom expert parallelism scheme for MoE workloads. They benchmarked throughput and memory across seven candidate parallelism configurations on representative model sizes spanning three orders of magnitude in parameter count. The team built custom CUDA kernels for two specific bottlenecks where standard libraries left throughput on the table, and ran ablation studies on each bottleneck independently to attribute the throughput gain. Three of the seven parallelism configurations failed to converge or hit out-of-memory errors at the target model size. The hybrid configuration combining all three parallelism dimensions met the throughput target.
The team described the work as "scaling the training stack." aecre's technical interview process identified the qualifying experimental structure across the 10-month program. The systematic evaluation of seven parallelism configurations across representative model sizes, the kernel performance comparison against baseline libraries, the ablation studies, and the failed configuration logs all met the criteria for qualifying research expenses under IRC Section 41. Training compute spend on benchmark runs was material under the Section 174 lens.
An 11-person engineering team at a legaltech AI product company identified a product gap during enterprise pilots: their off-the-shelf RAG pipeline using a third-party language model for generation was hitting a citation accuracy ceiling around 60% on their internal benchmark, far below the bar their attorney users would tolerate. Hallucinated citations were unacceptable. Their team spent 11 months engineering a custom multi-stage retrieval architecture (hybrid sparse-dense retrieval, query decomposition, citation-aware re-ranking) and built a custom evaluation framework that scored citation accuracy and answer faithfulness against an attorney-labeled benchmark cohort of 1,400 question-answer-citation triples. They evaluated four candidate retrieval strategies against the cohort. Multiple architectures shipped to limited internal pilots before final selection.
The qualifying engineering work was the retrieval architecture, the custom evaluation framework, and the citation-aware re-ranking, not the use of the third-party language model for generation. The systematic evaluation of four candidate retrieval strategies against the attorney-labeled 1,400-triple cohort, the custom evaluation framework engineering, and the documented technical uncertainty about whether any candidate architecture would meet the citation accuracy bar all met the criteria for qualifying research expenses under IRC Section 41. aecre's technical interview process drew the line between the qualifying engineering layer and the third-party model layer in the documentation file.
Answer the quick check questions to see if your company qualifies.
Most AI and ML pass-through entities (S-Corps, partnerships, LLCs) see the full credit benefit at individual rates. Nearly 40 states stack additional credits on top of the federal credit. The federal number is the floor.
The feasibility conversation takes 30 minutes. We assess your qualifying activities, estimate credit value, and tell you plainly whether a study makes sense for your company. No commitment, no cost.
Book a Free AssessmentWe respond within one business day. Partner-led from first conversation through filing.