R&D Tax Credit for AI and ML Companies

Qualification

How the IRS Four-Part Test Applies to AI and ML Work

The IRS uses a four-part test to determine whether work qualifies for the R&D tax credit. AI and ML engineering meets all four parts when the team is resolving genuine technical uncertainty through systematic experimentation. Foundation model research, applied ML, computer vision, and ML infrastructure work satisfy the test directly. Every qualifying activity must pass all four under IRC Section 41. Select each step to see what it means for your company.

Most AI and ML companies that qualify do not think of their engineering work as research. But if your team is engineering novel model architectures under accuracy uncertainty, building distributed training infrastructure where the parallelism strategy is not obvious, designing custom retrieval architectures where standard RAG falls below the accuracy bar, or running ablation studies to evaluate alternative attention or fusion mechanisms, the R&D credit likely applies right now.

Important: Section 174 Hits AI Companies Hardest. The Credit Is the Offset.

Since 2022, Section 174 requires capitalization and amortization of research and experimental expenditures over 5 years for domestic R&D and 15 years for foreign R&D. For AI companies, this applies to engineer wages, ML researcher wages, and qualifying training compute, which is exactly the spend profile that defines the company. A team running heavy pretraining or fine-tuning workloads sees current-year taxable income increase materially even when cash burn is unchanged. The R&D credit partially offsets this impact by directly reducing tax liability on the same expenses. AI companies that previously skipped the credit because the work felt like ordinary engineering now have a different calculus: the documentation is required for Section 174 anyway, and the credit captures real value from work that would otherwise just create a deferred deduction.

✓

Permitted Purpose

✓

Technological in Nature

✓

Elimination of Uncertainty

✓

Process of Experimentation

01. Permitted Purpose

The work must aim to develop or improve the functionality, performance, reliability, or quality of a process, technique, formula, or software component. AI and ML companies meet this test through engineering novel model architectures, more efficient training infrastructure, more accurate retrieval systems, more capable inference serving, or more robust evaluation frameworks. Experimental failure counts. Failed pretraining runs, abandoned architecture variants, ablation studies that showed no improvement, and fine-tuning approaches that did not meet the accuracy bar all contribute qualifying research expenses.

Industry Example

A foundation model team engineers a custom sparse attention pattern intended to reduce training compute at fixed quality. Their first variant fails to converge on a reasoning benchmark. A second variant converges but underperforms baseline on long-context tasks. A third variant combining grouped query attention with a custom routing scheme matches baseline at materially lower compute. All three variants qualify because the intent throughout was to improve technical performance under engineering uncertainty.

This prong is met by any AI or ML team developing a technically better architecture, training pipeline, retrieval system, or evaluation framework. ML researchers, ML engineers, infrastructure engineers, and applied scientists all perform work that satisfies this test as part of their standard scope.

02. Technological in Nature

The work must rely on principles of computer science, mathematics, engineering, or physical science. AI and ML engineering is inherently grounded in these disciplines: machine learning theory, optimization, linear algebra, probability and statistics, distributed systems, computer architecture, and information theory all satisfy this prong. Business decisions about pricing, packaging, customer segmentation, and go-to-market strategy do not qualify, but the engineering underlying the model architecture, training infrastructure, and inference systems does.

Industry Example

A computer vision team designs a custom multimodal fusion architecture drawing on representation learning theory, attention mechanism research, and statistical learning principles. An ML infrastructure team engineers a novel model parallelism strategy drawing on distributed systems theory and high-performance computing research. Both draw on recognized scientific and mathematical foundations and satisfy the technological prong directly.

The threshold is low for AI and ML engineering because the scientific foundation is inherent to the discipline. Machine learning, optimization, distributed systems, and information retrieval all rest on established computer science and mathematical principles.

03. Elimination of Uncertainty

The work must aim to eliminate uncertainty about the capability or method of achieving a technical result. AI and ML development is dense with this kind of uncertainty: whether a candidate architecture will train at all, whether a fine-tuning approach will reach the required accuracy, whether a custom retrieval system can resolve domain-specific queries with citation-level fidelity, whether a parallelism strategy can train a model that exceeds standard memory bounds, or whether an evaluation framework will catch the failure modes that matter for the user.

Industry Example

An applied ML team building a custom recommender system does not know at the start whether their two-tower retrieval architecture combined with a learned re-ranker can hit the relevance and latency targets at production scale. The team systematically evaluates multiple candidate architectures across production cohorts, runs A/B tests against the existing system, and validates against the defined performance criteria. The uncertainty about both the technical method and the achievable accuracy is eliminated through the experimental process.

AI and ML engineering frequently involves uncertainty about both capability (can this architecture meet the accuracy bar at all) and method (which architecture will hold under deployment constraints). Either form of uncertainty qualifies.

04. Process of Experimentation

The work must involve a process of evaluating alternatives to eliminate technical uncertainty. This does not require a formal lab or a dedicated research team. In AI and ML development, the experimental process is typically the engineering workflow itself: designing candidate architectures, running ablation studies, evaluating results against defined benchmarks, and iterating on the architecture, training pipeline, or evaluation framework. Contemporaneous documentation of this process is the foundation of a defensible R&D credit study.

Industry Example

An ML infrastructure team evaluates four different parallelism strategies before deploying their distributed training stack. Each strategy is benchmarked against representative model sizes with defined throughput and memory targets. Results are documented in architecture decision records and benchmark spreadsheets. The systematic evaluation of alternatives is the process of experimentation. The documentation of that process is what makes the credit defensible under examination.

Architecture decision records, training run logs, ablation study spreadsheets, evaluation results across candidate architectures, and benchmark comparisons against defined accuracy or performance criteria all constitute experimental processes under IRC Section 41.

For the full four-part test explanation with examples across industries, see the main R&D Tax Credit page.

Activity Analysis

What Qualifies as R&D for AI and Machine Learning Companies

The credit rewards genuine technical development work under uncertainty. Routine application of off-the-shelf models or standard prompt engineering does not qualify, and overclaiming creates serious audit risk. The qualification standard is defined by the IRS research credit guidelines. Select each activity to see the full qualification requirement.

Pretraining and post-training engineering on novel transformer architectures, attention mechanism research, mixture-of-experts engineering, custom tokenizer development, scaling-laws experimentation, and RLHF, DPO, and GRPO training pipeline engineering. Qualifying work includes ablation studies across candidate architectures, evaluation harness engineering against benchmark suites, and systematic data mixture and curriculum research under technical uncertainty. ML researcher and ML engineer wages allocated to architecture and experimentation hours are the primary qualifying research expense. Standard fine-tuning of off-the-shelf models against off-the-shelf datasets without architectural engineering is excluded.

Engineering of custom recommender systems, ranking and retrieval models, fraud and anomaly detection architectures, time-series forecasting under technical uncertainty, custom feature engineering pipelines, and novel A/B test infrastructure for ML models. Qualifying work includes the systematic evaluation of two-tower retrieval architectures, learned re-rankers, candidate generation models, and feature pipelines under defined accuracy and latency targets. Standard use of AutoML or off-the-shelf forecasting libraries without architectural engineering is excluded. ML engineer and applied scientist wages are the primary qualifying expense.

Custom vision architecture engineering for domain tasks including medical imaging, robotics perception, document understanding, satellite and remote sensing imagery, and industrial inspection. Qualifying work includes multimodal fusion architecture, custom training data pipelines under technical uncertainty, novel evaluation methodology for domain-specific vision tasks, and custom encoder pretraining on domain-specific data. Standard application of pretrained vision models without architectural engineering or domain-specific evaluation work is excluded.

Custom distributed training infrastructure including FSDP and DeepSpeed configurations, novel parallelism strategies (tensor, pipeline, expert), GPU efficiency and kernel engineering, and training reliability infrastructure under technical uncertainty about throughput, memory, or convergence. Qualifying work includes systematic benchmarking of parallelism strategies on representative model sizes, custom CUDA kernel development for identified bottlenecks, and engineering of training reliability layers under multi-node failure conditions. Standard application of out-of-the-box distributed training libraries without engineering is excluded.

Custom inference serving infrastructure including vLLM-style optimizations, custom KV cache architecture, novel batching and continuous batching strategies, quantization and distillation engineering under accuracy uncertainty, and custom inference engines for latency-constrained deployment. Qualifying work includes the systematic evaluation of inference serving architectures against defined throughput, latency, and quality targets. Standard deployment of vendor-provided inference servers without engineering optimization or architectural innovation is excluded.

Engineering of custom retrieval-augmented generation architectures, custom evaluation frameworks, vector database and retrieval engineering, agent orchestration frameworks, and prompt-pipeline architecture under technical uncertainty about accuracy or correctness. The qualifying engineering is the layer between the user and the underlying language model, when that layer involves systematic evaluation of alternatives against defined accuracy benchmarks. Pure API wrapper products that pass user input to a third-party language model without custom retrieval, evaluation, or orchestration engineering are excluded.

Fine-tuning, instruction tuning, parameter-efficient fine-tuning (LoRA, QLoRA, adapter tuning), and continued pretraining engineering work that involves systematic experimentation across architectures, data mixtures, or training configurations under uncertainty about whether the resulting model will meet the accuracy bar. Qualifying work includes custom training data engineering, novel evaluation methodology for the fine-tuned model, and ablation studies across candidate fine-tuning approaches. Standard fine-tuning following published recipes against off-the-shelf datasets without methodology innovation or evaluation engineering is excluded.

Products that pass user input to a third-party language model (OpenAI, Anthropic, Google, or others) and return the response without custom retrieval, evaluation, or orchestration engineering do not qualify. The qualifying engineering work happens at the model provider, not at the wrapper company. The same logic applies to standard LangChain or LlamaIndex assemblies that follow published patterns without architectural engineering. Custom retrieval architectures, custom evaluation frameworks, agent orchestration engineering, and prompt-pipeline architecture under technical uncertainty all qualify even when the underlying model is third-party. The line is the engineering layer between the user and the model.

Standard application of pretrained vision models, off-the-shelf language models, or commercial AI services without architectural engineering or evaluation methodology innovation does not qualify. Calling a third-party API, deploying a Hugging Face model with default configuration, or applying a published recipe against a default dataset does not involve technical uncertainty about engineering capability. The same engineering team performing this work may also do qualifying architecture or evaluation work elsewhere. Only the qualifying work counts toward QREs.

Iterative prompt tuning against a third-party language model, prompt template library development, and standard prompt-engineering workflows do not qualify under IRC Section 41. The credit applies to engineering work involving technical uncertainty about how to achieve a technological result, not to copywriting or messaging optimization that happens to involve a language model. Custom prompt-pipeline architecture engineered under technical uncertainty about correctness (with systematic evaluation against defined accuracy benchmarks) may qualify when the engineering uncertainty is genuine.

Standard data labeling, annotation, and curation work performed against established taxonomies and annotation guidelines does not qualify. The labeling work itself is operational, not research. Custom labeling methodology development under technical uncertainty (active learning frameworks, novel annotation schemes for ambiguous domains, custom inter-annotator agreement methodology) may qualify when the engineering work involves systematic evaluation of alternatives. The distinction is between executing labeling work (excluded) and engineering the labeling methodology under uncertainty (potentially qualifying).

Under IRC Section 41, contract research performed outside the United States does not qualify as a qualified research expense. Engineering and ML researcher wages paid to offshore employees and contractors are excluded regardless of the technical merit of the work. Only U.S.-based employee wages and U.S.-based contractor payments (at 65%) qualify. Many AI companies have a hybrid structure with U.S. senior researchers and offshore implementation teams. The U.S. spend is what generates the credit.

Qualifies Under Specific Conditions

Training compute and GPU costs: Cloud compute, GPU costs, and managed inference services qualify when consumed in qualifying engineering work (model training, ablation studies, hyperparameter sweeps, retrieval evaluation runs, custom inference benchmarking) and when those costs are specifically allocable to the qualifying activity. Routine production inference serving paying customers is excluded. The allocation methodology is documented in the engagement and is critical given Section 174 implications.

AI-native vertical product engineering: Companies building AI-native products in vertical domains (legaltech, healthtech, fintech, salestech, and others) qualify when the engineering layer between the user and the underlying language model involves genuine technical uncertainty. Custom retrieval architecture, custom evaluation frameworks against domain-specific accuracy benchmarks, and agent orchestration engineering all qualify even when the foundation model is third-party. Pure wrapper products without custom engineering at this layer do not qualify.

Outside contractor and consultant engagements: Qualifying U.S.-based contractor work is included at 65% of amounts paid when the hiring company retains substantial rights to the work product and bears the financial risk of the engagement. Standard outsourced ML or engineering work where the contractor retains IP rights or where the client does not bear research risk does not qualify.

Qualifying Activities

Where AI and ML R&D Credits Apply Across Sub-Sectors

Foundation model engineering, applied ML, computer vision, ML infrastructure, and AI-native product engineering all qualify when the work involves genuine technical uncertainty and a systematic evaluation of alternatives. Select your sub-sector to see the specific qualifying activities and the primary exclusion for your company type.

Each sub-sector below includes the qualifying activities, the typical expense breakdown, and the primary exclusion. Select your company type.

Foundation Model and LLM Engineering: Pretraining, Post-Training, Architecture Research

Pretraining engineering on novel transformer architectures, including attention mechanism research, mixture-of-experts engineering, custom positional encoding schemes, and architecture variants developed under technical uncertainty about convergence, scaling behavior, and benchmark performance
Post-training pipeline engineering including RLHF, DPO, GRPO, and other preference optimization methodologies, custom reward model architecture, and instruction tuning research developed under uncertainty about alignment behavior and capability retention
Custom tokenizer development, novel data mixture and curriculum research, and domain-specific pretraining corpus engineering developed under technical uncertainty about downstream model behavior
Evaluation harness engineering, benchmark suite development, and custom evaluation methodology research for foundation model capability assessment, including novel safety, alignment, and capability benchmarks built under uncertainty about measurement validity
Scaling-laws experimentation, ablation studies across architectural variants, and custom training infrastructure for very-large-model training developed under technical uncertainty about throughput, stability, and convergence

Does not qualify: Standard fine-tuning of off-the-shelf models against off-the-shelf datasets without architectural engineering, routine application of published training recipes, and ordinary engineering work on already-published model architectures without research investigation.

Applied ML and Production ML Systems: Recommenders, Ranking, Detection, Forecasting

Custom recommender system engineering including two-tower retrieval architecture, learned re-ranker development, candidate generation models, and multi-objective ranking systems developed under technical uncertainty about relevance, latency, and exploration-exploitation tradeoffs
Fraud detection, anomaly detection, and risk scoring model engineering including novel signal aggregation, custom feature engineering pipelines, and adversarially-robust ML systems developed under uncertainty about accuracy and adversarial behavior
Time-series forecasting and demand prediction architecture engineering, including custom sequence models, hierarchical forecasting frameworks, and probabilistic forecasting systems developed under technical uncertainty about accuracy and calibration
Custom A/B test infrastructure for ML models, including counterfactual evaluation frameworks, contextual bandit infrastructure, and offline evaluation methodology developed under uncertainty about decision quality
Production ML platform engineering including custom feature stores, model monitoring and drift detection systems, and ML observability tooling developed under technical uncertainty about correctness at scale

Does not qualify: Standard application of AutoML or off-the-shelf forecasting libraries without architectural engineering, routine model retraining on existing pipelines, and standard analytics work without methodology innovation.

Computer Vision and Multimodal: Medical Imaging, Document Understanding, Robotics, Satellite, Industrial

Custom vision architecture engineering for domain-specific perception tasks, including medical imaging segmentation and classification, radiology workflow automation, document understanding and OCR for complex layouts, and industrial inspection systems developed under technical uncertainty about domain accuracy
Multimodal fusion architecture engineering, including vision-language model engineering for domain tasks, image-text alignment research, and custom cross-modal attention systems developed under uncertainty about fusion strategy and downstream task performance
Custom training data pipeline engineering for vision tasks, including domain-specific encoder pretraining, novel data augmentation methodology, and synthetic data generation pipelines developed under uncertainty about generalization to real-world inputs
Robotics and autonomous systems perception engineering, including sensor fusion architecture, lidar and point cloud processing models, and sim-to-real transfer methodology developed under technical uncertainty about real-world deployment
Geospatial and satellite imagery ML engineering, including multi-spectral analysis, change-detection architecture, and large-area inference pipelines developed under uncertainty about accuracy and computational tractability

Does not qualify: Standard application of pretrained vision models against off-the-shelf datasets, routine OCR work using commercial APIs without novel methodology, and standard image classification without domain-specific architectural or evaluation engineering.

ML Infrastructure and Tooling: Distributed Training, Inference Serving, Vector DBs, MLOps

Distributed training infrastructure engineering, including custom FSDP and DeepSpeed configurations, novel parallelism strategies (tensor, pipeline, expert, sequence), and training reliability layers developed under technical uncertainty about throughput, memory, and stability
GPU efficiency and kernel engineering, including custom CUDA kernel development for identified bottlenecks, mixed-precision training infrastructure, and quantization-aware training systems developed under uncertainty about throughput and accuracy preservation
Custom inference serving infrastructure including vLLM-style optimizations, novel KV cache architecture, continuous batching engines, and speculative decoding systems developed under technical uncertainty about throughput, latency, and quality
Vector database and retrieval infrastructure engineering, including custom indexing strategies, novel approximate nearest neighbor algorithms, and hybrid sparse-dense retrieval architecture developed under uncertainty about accuracy and recall
MLOps platform and ML observability engineering, including custom training orchestration systems, model registry and lineage tracking architecture, and evaluation framework infrastructure developed under technical uncertainty about correctness at scale

Does not qualify: Standard application of out-of-the-box distributed training libraries without engineering, routine deployment of vendor-provided inference servers, and standard MLOps configuration using commercial platforms without architectural innovation.

AI-Native Product Engineering: RAG, Evaluation Frameworks, Agent Orchestration (Partial Eligibility)

Custom retrieval-augmented generation architecture engineering, including hybrid sparse-dense retrieval systems, query decomposition methodology, citation-aware re-ranking, and multi-stage retrieval pipelines developed under technical uncertainty about domain accuracy
Custom evaluation framework engineering for domain-specific accuracy benchmarks, including novel evaluation methodology, citation accuracy and faithfulness scoring, and domain expert agreement studies developed under uncertainty about measurement validity
Agent orchestration framework engineering, including multi-step reasoning architecture, tool use and function calling infrastructure, and stateful agent runtime developed under technical uncertainty about correctness and reliability
Custom prompt-pipeline architecture engineered under technical uncertainty about correctness, including systematic evaluation against defined accuracy benchmarks, novel pipeline composition methodology, and reliability infrastructure developed under uncertainty about output quality
Domain-specific fine-tuning under technical uncertainty, including custom instruction tuning data engineering, parameter-efficient adaptation research (LoRA, QLoRA, adapters), and continued pretraining for domain corpora developed under uncertainty about accuracy improvements

Does not qualify (the line): Pure API wrapper products that pass user input to a third-party model and return the response without custom retrieval, evaluation, or orchestration engineering. Standard LangChain or LlamaIndex assemblies that follow published patterns. Routine prompt engineering work without systematic evaluation against defined accuracy benchmarks. The qualifying engineering is the layer between the user and the model, when that layer involves genuine technical uncertainty.

Illustrative Scenarios

R&D Tax Credit Examples for AI and Machine Learning Companies

The engineering teams who qualified without knowing they were doing R&D. The following scenarios illustrate how qualifying activities appear in real AI and ML company settings. Activity patterns and qualifying expense structures are drawn from typical engagement experience. Select the scenario that matches your company type.

Scenario 1: Foundation Model Pretraining Team

When the Standard Attention Mechanism Could Not Reduce Training Compute Without Collapsing Quality on Reasoning Benchmarks

An 18-person foundation model research lab identified a research goal: reduce pretraining compute at fixed quality by engineering a novel sparse-and-grouped attention pattern. Their research and engineering team spent 14 months running the program, evaluating six alternative attention mechanisms (full attention baseline, two grouped query attention variants, two sparse attention variants, and a custom hybrid sparse-grouped pattern) across a defined benchmark suite of 12 reasoning, code, and long-context evaluation tasks. The team trained working implementations of each at multiple model sizes and ran ablation studies measuring compute, convergence behavior, and benchmark quality. Three of the six variants failed to converge on at least one reasoning benchmark. Two converged but underperformed baseline. The hybrid variant matched baseline at materially lower compute and was selected for further scaling work.

The team had no internal R&D classification for the work. They considered it "scaling research." But the documented technical uncertainty about whether any of the candidate attention mechanisms would meet the quality bar at reduced compute, the systematic evaluation across the 12-benchmark suite, and the failed training run logs all met the criteria for qualifying research expenses under IRC Section 41. Compute spend in pretraining and ablation runs was material under the Section 174 lens.

Qualifying Expenses

ML researcher, ML engineer, and infrastructure engineer wages allocated to architecture research and ablation hours across the 14-month program. Training compute costs allocated to pretraining runs and ablation studies on the 12-benchmark evaluation suite. Outside research consultant retained at 65% with rights retention.

Key Documentation Signal

The architecture comparison spreadsheet showing compute, convergence behavior, and benchmark quality across all six candidate attention mechanisms against the 12-benchmark evaluation suite. Training run logs from the three failed variants. Ablation study writeups. This record demonstrated systematic evaluation of alternatives with measured outcomes, including the qualifying failed runs, not iterative tuning of a single chosen architecture.

Scenario 2: Applied ML Recommender Team

When the Legacy Collaborative Filtering System Could Not Hold Relevance and Latency Targets at Marketplace Scale

A $14,000,000 ARR consumer marketplace identified a structural problem during scaling: their legacy collaborative filtering recommender was hitting a relevance ceiling and breaching latency budgets at peak traffic. Their nine-person ML team spent nine months designing a custom two-tower retrieval architecture combined with a learned re-ranker, evaluating four alternative ranking architectures (neural collaborative filtering baseline, two-tower with shared embeddings, two-tower with item-tower transformer, and a hybrid two-stage retrieval and re-rank model) across six production cohorts via custom A/B test infrastructure. The team built bespoke feature engineering pipelines for marketplace-specific signals (seller velocity, price elasticity, inventory depth) and shipped working implementations of three of the four architectures to limited cohorts before final selection.

The team framed the work in product terms (relevance, conversion, latency) and never connected it to research framing. aecre's technical interview process identified the qualifying experimental structure across the nine-month program. The custom A/B test infrastructure built specifically for the architecture comparison, the systematic evaluation of alternatives across the six production cohorts, and the bespoke feature engineering work all met the criteria for qualifying research expenses under IRC Section 41.

Qualifying Expenses

ML engineer, applied scientist, and platform engineer wages allocated to architecture and feature engineering hours across the nine-month program. Training compute and inference compute costs allocated to ranking architecture experimentation and the A/B test cohort runs. Outside data science consultant retained at 65% with rights retention.

Key Documentation Signal

The A/B test result spreadsheet comparing relevance, latency, and conversion across the four candidate ranking architectures across the six production cohorts. Feature engineering decision log. Re-ranker architecture iteration notes. This record demonstrated systematic, alternative-driven evaluation rather than iterative tuning of a single chosen ranking model.

Scenario 3: Computer Vision and Multimodal Team

When Off-the-Shelf Vision Encoders Could Not Hold Diagnostic Accuracy on Radiology Findings That Required Domain Pretraining

A 12-person ML team at a clinical decision support company identified a research goal: engineer a custom vision-language architecture for radiology report generation from imaging studies, fusing image features with structured patient metadata. Their team spent 12 months on the program, evaluating five candidate architectures (an off-the-shelf vision encoder baseline with frozen weights, two custom encoder pretraining strategies, a multimodal fusion variant with cross-attention, and a hybrid pretrained-and-domain-adapted approach) against a held-out radiologist-labeled cohort of 4,200 studies spanning eight imaging modalities. The team designed a custom encoder pretraining stage on internal imaging data because off-the-shelf encoders underperformed on domain-specific findings. Multiple training runs failed to converge before the team landed on a workable fusion strategy. The radiologist-labeled benchmark cohort served as the defined evaluation set.

The team described the work as "improving model accuracy." aecre's technical interview process identified the qualifying experimental structure across the 12-month program. The systematic evaluation of five candidate architectures against the 4,200-study evaluation cohort, the failed training run logs, the custom encoder pretraining engineering, and the radiologist agreement studies all met the criteria for qualifying research expenses under IRC Section 41. Compute spend in pretraining and fine-tuning runs was material under the Section 174 lens.

Qualifying Expenses

ML engineer, applied scientist, and computer vision engineer wages allocated to architecture and pretraining hours across the 12-month program. Training compute costs allocated to encoder pretraining and architecture ablation studies. Outside radiologist annotation contractor retained at 65% with rights retention to the labeled benchmark cohort.

Key Documentation Signal

Evaluation results across the five candidate architectures against the 4,200-study radiologist-labeled benchmark. Failed training run postmortems. Radiologist agreement studies on the labeled cohort. This record demonstrated systematic evaluation of alternatives with measured outcomes including the qualifying failed runs, not iterative refinement of a single chosen architecture.

Scenario 4: ML Infrastructure and Distributed Training Team

When Standard FSDP Could Not Hold Memory Bounds for the Target Model Class Without a Novel Parallelism Strategy

A 16-engineer ML platform team at an AI infrastructure company identified a research goal: build training infrastructure to support models exceeding standard FSDP memory bounds at the team's target throughput. Their team spent 10 months engineering a novel parallelism strategy combining tensor parallelism, pipeline parallelism, and a custom expert parallelism scheme for MoE workloads. They benchmarked throughput and memory across seven candidate parallelism configurations on representative model sizes spanning three orders of magnitude in parameter count. The team built custom CUDA kernels for two specific bottlenecks where standard libraries left throughput on the table, and ran ablation studies on each bottleneck independently to attribute the throughput gain. Three of the seven parallelism configurations failed to converge or hit out-of-memory errors at the target model size. The hybrid configuration combining all three parallelism dimensions met the throughput target.

The team described the work as "scaling the training stack." aecre's technical interview process identified the qualifying experimental structure across the 10-month program. The systematic evaluation of seven parallelism configurations across representative model sizes, the kernel performance comparison against baseline libraries, the ablation studies, and the failed configuration logs all met the criteria for qualifying research expenses under IRC Section 41. Training compute spend on benchmark runs was material under the Section 174 lens.

Qualifying Expenses

ML platform engineer, infrastructure engineer, and CUDA kernel engineer wages allocated to parallelism and kernel engineering hours across the 10-month program. Training compute costs allocated to the parallelism configuration benchmarks across representative model sizes. Outside high-performance computing consultant retained at 65% with rights retention.

Key Documentation Signal

Throughput and memory benchmark across all seven candidate parallelism strategies on representative model sizes. Kernel performance comparison vs baseline libraries. Failed parallelism configuration logs. Ablation studies attributing the throughput gain to each engineering change. This record demonstrated systematic evaluation of alternatives with measured outcomes including the qualifying failures, not iterative tuning of a single chosen configuration.

Scenario 5: AI-Native Product Engineering Team (Legaltech)

When Off-the-Shelf RAG Patterns Could Not Reach the Citation-Accuracy Bar That Attorney Users Would Tolerate

An 11-person engineering team at a legaltech AI product company identified a product gap during enterprise pilots: their off-the-shelf RAG pipeline using a third-party language model for generation was hitting a citation accuracy ceiling around 60% on their internal benchmark, far below the bar their attorney users would tolerate. Hallucinated citations were unacceptable. Their team spent 11 months engineering a custom multi-stage retrieval architecture (hybrid sparse-dense retrieval, query decomposition, citation-aware re-ranking) and built a custom evaluation framework that scored citation accuracy and answer faithfulness against an attorney-labeled benchmark cohort of 1,400 question-answer-citation triples. They evaluated four candidate retrieval strategies against the cohort. Multiple architectures shipped to limited internal pilots before final selection.

The qualifying engineering work was the retrieval architecture, the custom evaluation framework, and the citation-aware re-ranking, not the use of the third-party language model for generation. The systematic evaluation of four candidate retrieval strategies against the attorney-labeled 1,400-triple cohort, the custom evaluation framework engineering, and the documented technical uncertainty about whether any candidate architecture would meet the citation accuracy bar all met the criteria for qualifying research expenses under IRC Section 41. aecre's technical interview process drew the line between the qualifying engineering layer and the third-party model layer in the documentation file.

Qualifying Expenses

ML engineer, applied scientist, and platform engineer wages allocated to retrieval architecture, evaluation framework, and re-ranking engineering hours across the 11-month program. Compute costs allocated to retrieval indexing, embedding generation for the evaluation cohort, and the candidate retrieval strategy benchmark runs. Outside legal annotation contractor retained at 65% with rights retention to the labeled benchmark cohort.

Key Documentation Signal

Retrieval accuracy and end-to-end task accuracy across the four candidate retrieval architectures against the 1,400-triple attorney-labeled evaluation cohort. Custom evaluation framework specification. Citation accuracy comparison versus baseline RAG. This record demonstrated systematic, alternative-driven evaluation of the qualifying engineering layer (retrieval, evaluation, re-ranking), drawing the line between custom engineering work that qualifies and the third-party generation layer that does not.

Common Questions

R&D Tax Credit Questions for AI and ML Companies

Do AI and ML companies qualify for the R&D tax credit?

Yes. AI and machine learning engineering qualifies for the R&D tax credit under IRC Section 41 when the team is resolving genuine technical uncertainty through systematic experimentation. Foundation model research, applied ML systems, computer vision, ML infrastructure, and AI-native product engineering all routinely meet the four-part test. Most AI companies underclaim because the engineering work feels like ordinary product development rather than research.

We use OpenAI, Anthropic, or other AI APIs in our product. Does that affect our R&D credit eligibility?

It depends on what your team is engineering on top of those APIs. Pure API wrapper products that pass user input to a third-party language model (the OpenAI API, Anthropic's API, Google's, or others) and return the response do not qualify, because the qualifying engineering work happens at the model provider, not at your company. Custom retrieval architecture engineering, evaluation framework engineering, agent orchestration engineering, fine-tuning under technical uncertainty, and prompt-pipeline architecture all qualify even when the underlying language model is third-party. The qualifying work is the engineering layer between the user and the model, when that layer involves genuine technical uncertainty about whether the chosen approach will meet the accuracy or performance bar. The line is drawn around the engineering, not around the choice of foundation model.

What kinds of AI and ML development qualify as R&D?

Work qualifies when it involves technical uncertainty, systematic experimentation, and a qualified purpose related to developing or improving software functionality, performance, or reliability. Foundation model and LLM engineering, applied ML systems including custom recommenders and ranking models, computer vision and multimodal architecture engineering for domain tasks, distributed training and inference infrastructure, custom RAG architectures and evaluation frameworks, and fine-tuning under technical uncertainty all qualify. Standard application of off-the-shelf models without architectural engineering, basic prompt engineering, and routine data labeling do not. See the full activity breakdown above for the complete view by sub-sector.

How does Section 174 amortization affect AI training costs and the credit value?

Section 174 requires capitalization and amortization of research and experimental expenditures over 5 years for domestic R&D and 15 years for foreign R&D. This applies to most AI engineering costs including engineer and ML researcher wages and qualifying training compute, and significantly increases current-year taxable income for companies with heavy training spend. The R&D credit partially offsets this impact by directly reducing tax liability on the same expenses. For AI companies in compute-heavy years, the credit has become more valuable post-Section 174, not less. Documentation of which engineering activities qualify under Section 174 dovetails directly with R&D credit substantiation: the same engineering analysis supports both.

Do GPU costs and cloud compute qualify as R&D expenses?

Cloud compute used in qualifying research activities is generally treated as a qualifying supply expense. Training runs, ablation studies, hyperparameter sweeps, and inference work performed in support of qualifying R&D activities all support QRE inclusion. The compute must be consumed in qualifying work, not in routine production inference for paying customers. Documentation of the connection between compute spend and qualifying engineering activities is critical, particularly given Section 174 implications. Allocation methodology is documented in every engagement and is one of the highest-leverage QRE categories for compute-heavy AI companies.

Our team uses open-source models like Llama. Does fine-tuning or applying them qualify?

It depends on the engineering work involved. Standard fine-tuning of an open-source model like Llama against an off-the-shelf dataset using established methodology generally does not qualify. Fine-tuning that involves systematic experimentation across architectures, custom training data engineering, novel evaluation methodology, or technical uncertainty about whether the fine-tuned model will meet the accuracy bar does qualify. The same logic applies to retrieval-augmented generation, custom evaluation frameworks, and downstream engineering built on top of open-source foundation models. The open-source nature of the underlying model is not the deciding factor. The presence of genuine engineering uncertainty and systematic evaluation of alternatives is.

How is the R&D credit calculated for AI companies?

The credit is calculated on qualifying research expenses (QREs), which include domestic engineer and ML researcher wages, qualifying supplies, training and research compute consumed in qualifying work, and 65% of qualifying U.S. contractor payments. There are two methods: the Regular Credit Method (20% of QREs above a base amount) and the Alternative Simplified Credit (14% of QREs above 50% of the prior three-year average). For newer AI companies without a long QRE history, the ASC method often produces a higher credit. aecre runs both calculations and recommends the better outcome.

Can AI companies claim both federal and state R&D credits?

Yes. Nearly 40 states offer R&D tax credits that stack on top of the federal credit. The state credit is calculated separately and uses the same qualifying expense base in most states. AI companies operating across multiple states may qualify for credits in each state where qualifying engineering or research work occurs. The combined federal and state credit significantly increases the total benefit.

What does the aecre process look like for an AI or ML company?

The process starts with a 30-minute feasibility assessment at no cost. If the assessment confirms qualifying activities, we conduct a technical interview with your ML engineers and research leads to identify and document QREs. The interview is conducted by people who can read training run logs, ablation studies, and architecture decision records. We work from your existing artifacts: training logs, evaluation results, architecture decision records, design documents, and engineering memoranda. The documentation package is built to withstand IRS scrutiny and includes a technical narrative, qualified purpose statements, and contemporaneous project documentation. The entire process is partner-led and does not require your team to produce documentation from scratch. Note: AI-native product companies that overlap with multi-tenant SaaS engineering may also benefit from reviewing the SaaS niche page for the multi-tenant and integration angle.

R&D Tax Credits for AI and Machine Learning Companies. Most Are Underclaimed.

How the IRS Four-Part Test Applies to AI and ML Work

01. Permitted Purpose

02. Technological in Nature

03. Elimination of Uncertainty

04. Process of Experimentation

What Qualifies as R&D for AI and Machine Learning Companies

Where AI and ML R&D Credits Apply Across Sub-Sectors

Foundation Model and LLM Engineering: Pretraining, Post-Training, Architecture Research

Applied ML and Production ML Systems: Recommenders, Ranking, Detection, Forecasting

Computer Vision and Multimodal: Medical Imaging, Document Understanding, Robotics, Satellite, Industrial

ML Infrastructure and Tooling: Distributed Training, Inference Serving, Vector DBs, MLOps

AI-Native Product Engineering: RAG, Evaluation Frameworks, Agent Orchestration (Partial Eligibility)

R&D Tax Credit Examples for AI and Machine Learning Companies

When the Standard Attention Mechanism Could Not Reduce Training Compute Without Collapsing Quality on Reasoning Benchmarks

When the Legacy Collaborative Filtering System Could Not Hold Relevance and Latency Targets at Marketplace Scale

When Off-the-Shelf Vision Encoders Could Not Hold Diagnostic Accuracy on Radiology Findings That Required Domain Pretraining

When Standard FSDP Could Not Hold Memory Bounds for the Target Model Class Without a Novel Parallelism Strategy

When Off-the-Shelf RAG Patterns Could Not Reach the Citation-Accuracy Bar That Attorney Users Would Tolerate

Estimate Your AI Company's R&D Tax Credit

How aecre Documents R&D for AI and ML Companies

R&D Tax Credit Questions for AI and ML Companies

Find Out If Your AI Company Qualifies

Get in Touch Directly