ArcellAI Memo

by the arcellai founding team — learn more at arcellai.com or view pitch deck

Overview

ArcellAI is the autonomous data-engineering layer that makes AI-for-science actually work in production.

There's no fundamental reason R&D teams should waste 80% of their time on data plumbing. "AI deployments fail 75–90% of the time in specialized domains because biological data is just hard" is a lie we've been telling ourselves to avoid admitting the real problem: nobody built the infrastructure layer.

The data-engineering bottleneck has existed for decades. Foundation models exist. Vertical AI labs build beautiful molecule generators. But between the two sits a vast, unaddressed chasm: fragmented multi-modal datasets, zero provenance tracking, irreproducible pipelines, and manual ETL hell that strangles every ambitious R&D application before it reaches production.

The 2024 bioinformatics consulting services market will grow four-fold to over $250 billion by 2035—signaling the lack of adequate tooling to automate these tasks.

Agents changed the equation overnight. Suddenly the thing that required PhD-level bioinformaticians writing thousands of lines of brittle Python became a natural language conversation with an autonomous system that understands scientific context, manages provenance graphs, orchestrates hardware integrations, and generates reproducible workflows end-to-end.

We're building the future where deep-tech R&D moves at software speed.

End result: complete dissolution of the bioinformatics consulting industry. When any researcher can describe an experimental objective and receive a fully automated, provenance-tracked, hardware-integrated pipeline thirty minutes later—complete with feature engineering, statistical validation, and interactive dashboards—we've won.

Real example: Drug sensitivity prediction with VC-GPT

Instead of abstract promises, here's a concrete workflow we run end-to-end using VC-GPT, our flagship agentic product. This example leverages PyTDC (Therapeutics Data Commons), a standard for machine learning in therapeutics, to benchmark drug response predictions.

Objective: "Evaluate a drug sensitivity prediction pipeline across multiple datasets (e.g., GDSC, CTRP) with proper splits, metrics, and error bars."
Data Sources: VC-GPT autonomously ingests data from public repositories (TDC), enterprise assays (SQL/Snowflake), and instrument telemetry (lab robotics APIs).
Execution:
- Planner: Decomposes the request into steps: data loading → schema harmonization → feature engineering (molecular fingerprints) → model training → evaluation.
- Executor: Writes and runs code to execute these steps in a secure sandbox, resolving dependencies and handling errors.
- Semantic Layer: Tracks lineage of every transformation. It knows that "IC50" from one source must be log-transformed to match "LN_IC50" from another.
Result: A fully reproducible pipeline, versioned in our Provenance Graph, with interactive dashboards comparing model performance. What took days of manual data wrangling now takes minutes.

PyTDC: The Standard for Therapeutics ML

PyTDC provides the benchmark datasets and tasks that power this workflow. By integrating PyTDC's rigorous standards with our agentic orchestration, we ensure that every result is not just fast, but scientifically valid. Read our ICML '25 Paper on the underlying technology →

What the platform delivers

Agentic data engineering: Automates ingestion → cleaning/transformation → lineage → orchestration across your R&D stack.
Context-aware reasoning: Domain-specific intelligence via context engineering and safe tool use.
Provenance and reproducibility: Versioned datasets, transformation lineage, and auditable workflows captured in a provenance graph.
Self-driving semantic layer: Defines and centralizes research metrics, experimental KPIs, and statistical calculations for consistency and governance.

How it works

Planner → Executor → Critic architecture composes multi-step workflows from standardized tools and integrations.
Every run strengthens your provenance graph and semantic layer, improving future plans and enabling consistent reporting across teams.
Model-agnostic: Invoke classical stats or foundation models as steps within governed, reproducible pipelines.

Product & technical breakthroughs

We provide the missing infrastructure layer for AI-driven science. Our platform combines agentic reasoning with rigorous data engineering to solve the "garbage in, garbage out" problem in specialized R&D domains.

Semantic Data Layer with Provenance Graph: Based on our ICML 2025 publication, this is a living knowledge graph that tracks every transformation from raw instrument output to final analysis. It encodes domain-specific ontologies, allowing agents to reason about scientific meaning (e.g., "this column is a gene expression value") rather than just schema types.
Planner-Executor-Critic Architecture: A robust agentic loop that plans multi-step scientific workflows, executes code in secure sandboxes, and critiques outputs against domain constraints. This prevents the hallucination and fragility common in general-purpose LLM coding tools.
Automated Provenance Tracking: Every step of ingestion, transformation, and modeling is automatically versioned. This creates an audit trail essential for regulatory compliance and scientific reproducibility, without requiring manual logging by scientists.
Hardware & API Integration: We don't just process static files. ArcellAI integrates with lab automation hardware, enterprise data lakes (Snowflake, Databricks), and public repositories (PyTDC, ChEMBL) to unify the fragmented R&D data ecosystem.

Market opportunity

We are capitalizing on a massive structural gap in the R&D market, where data complexity is outpacing human capacity.

TAM: $500B (AI in R&D)
SAM: $250B (AI in physical sciences and engineering)
Beachhead: $25B (AI in biotechnology)
Secondary Beachhead: $80B (Healthcare Big Data analytics)

The market signal is unmistakable: the bioinformatics consulting sector is projected to quadruple by 2035. This explosion is a symptom of failure—companies are hiring armies of consultants to do manual data plumbing because adequate software doesn't exist. ArcellAI replaces this manual service layer with scalable, autonomous software.

Our actual moat

Competitors can copy features, but they cannot replicate the proprietary knowledge graph of transformations, ontologies, and hardware quirks built by our system over thousands of runs.

The Provenance Moat: Every pipeline run strengthens our graph, locks in users, and raises switching costs. The more you use ArcellAI, the more it "knows" your specific experimental context—making it harder to leave than to stay.
The Integration Moat: Deep ties into enterprise hardware and lab systems create high switching costs. We don't just read CSVs; we integrate with the machines that generate them.
The Data Flywheel: More data → smarter agents → better pipelines → more users → more data. We are building the definitive dataset of how science is done.

Use cases

Techbio-centered Physical AI (BioAI): Drug sensitivity prediction (PyTDC), single-cell analysis, assay harmonization, lab robotics data, multi-omic integration.

Physical AI / Manufacturing: Sensor fusion, yield analysis, root-cause studies with governed experimentation.

Physical AI for Materials & Chemistry: Materials and chemistry pipelines with reproducible metrics and versioned datasets.

Clinical R&D: Automated normalization across clinical formats, pipeline traceability for regulatory submissions, reproducible analytics for clinical trials, and harmonization of multi-site data for biostats teams.

Physical AI for Engineering / Robotics: Closed-loop experiment logs, control data, and adaptive pipeline orchestration.

Product: VC-GPT (Virtual Cells analytics agent)

Built on research presented at ICML '25, with domain-specific reasoning via in-context learning and standardized tools/APIs.
Autonomous, reproducible workflows and integrations with the bioengineering hardware stack.

Technical Stack

Our architecture is built to handle the rigorous demands of scientific R&D, from massive datasets to reproducible compute.

Foundation Models

We leverage state-of-the-art LLMs for reasoning and code generation, fine-tuned on scientific code and literature. Our system is model-agnostic, allowing us to swap in the best model for the task (e.g., Claude 3.5 Sonnet for complex planning, GPT-4o for data synthesis).

Agent Framework

Our proprietary Planner-Executor-Critic framework orchestrates multi-step workflows.

Planner: Decomposes high-level scientific goals into executable steps.
Executor: Runs code in isolated sandboxes with access to domain-specific tools.
Critic: Validates outputs against scientific constraints (e.g., statistical significance, schema compliance).

Data Infrastructure

Semantic Layer: A knowledge graph that maps raw data schemas to scientific ontologies.
Provenance Engine: Automatically tracks data lineage, transformation logic, and execution metadata for every run.
Hybrid Storage: Optimized for high-throughput biological data (Parquet/Arrow) and complex metadata queries (Graph/Relational).

Scientific Computing

We integrate deeply with the PyData stack (pandas, numpy, scikit-learn) and domain-specific libraries (PyTDC, Scanpy, RDKit). All computation occurs in containerized environments to ensure reproducibility across different infrastructure.

Deployment

ArcellAI can be deployed as a managed SaaS or within a customer's VPC for maximum security. We support major cloud providers (AWS, GCP, Azure) and integrate with on-premise HPC clusters for large-scale jobs.

Development Resources & Requirements

To build and extend the ArcellAI platform, we rely on a modern stack and a set of key resources.

Core Requirements

Python 3.10+: The backbone of our backend and agent logic.
Docker/Kubernetes: For containerization and orchestration of agent sandboxes.
PostgreSQL + pgvector: For relational data and vector embeddings.
Next.js / React: For our conversational interface and dashboards.

Key Libraries

LangChain / LangGraph: For agent orchestration and state management.
PyTDC: Our standard for therapeutics machine learning tasks.
FastAPI: High-performance API framework for our backend services.

Access & Documentation

API Documentation: Comprehensive guides for integrating ArcellAI into your existing LIMs or ELN.
SDK: Python and TypeScript SDKs for programmatic access to agent workflows.
Community: Join our Discord for support and to connect with other computational biologists and engineers.
Clinical R&D applications: ArcellAI extends directly to automated normalization across clinical formats, pipeline traceability for regulatory submissions, reproducible analytics for clinical trials, and harmonization of multi-site data for biostats teams. Target: clinical research informatics teams at academic medical centers, CROs managing trial data operations, med-tech companies integrating diagnostics/imaging/wearables hardware, and biopharma clinical development groups preparing data for FDA submissions.

Why now

Agentic systems unlock the last mile of AI for science and engineering by automating the data-engineering bottleneck with governance built-in. As organizations move from pilots to production, the winners will have an agentic data layer with provenance, semantics, and reproducibility at the core.

Team

We are an all-MIT CS founding team with deep expertise in AI research and large-scale data infrastructure.

Alex Velez-Arce (Founder & CEO/CTO): AI research & strategy at FAANG+, MIT CS, Harvard BioAI. Formerly SWE at Pinterest and AI researcher at Harvard. Built data & ML products accounting for $100M+ in revenue.
Jesus Caraballo (Founding Engineer): AI research at Harvard Medical School, backend engineering, MIT Computer Science. Open-sourced virtual cells AI platform with 30K+ MAU.

Our work is peer-reviewed and published at top AI venues including ICML and NeurIPS.

Contact

To pilot ArcellAI for your R&D workflows—biotech, manufacturing, robotics, or materials—reach out at kaela@arcell.ai.