Machine Learning Data Engineer - Systems & Retrieval
Company: Zyphra
Location: Palo Alto
Posted on: February 15, 2026
|
|
|
Job Description:
Job Description Job Description Zyphra is an artificial
intelligence company based in Palo Alto, California. The Role: As a
Machine Learning Data Engineer - Systems & Retrieval , you will
build and optimize the data infrastructure that fuels our machine
learning systems. This includes designing high-performance
pipelines for collecting, transforming, indexing, and serving
massive, heterogeneous datasets from raw web-scale data to
enterprise document corpora. You’ll play a central role in
architecting retrieval systems for LLMs and enabling scalable
training and inference with clean, accessible, and secure data.
You’ll have an impact across both research and product teams by
shaping the foundation upon which intelligent systems are trained,
retrieved, and reasoned over. You’ll work across: Design and
implementation of distributed data ingestion and transformation
pipelines Building retrieval and indexing systems that support RAG
and other LLM-based methods Mining and organizing large
unstructured datasets, both in research and production environments
Collaborating with ML engineers, systems engineers, and DevOps to
scale pipelines and observability Ensuring compliance and access
control in data handling, with security and auditability in mind
Requirements: Strong software engineering background with fluency
in Python Experience designing, building, and maintaining data
pipelines in production environments Deep understanding of data
structures, storage formats, and distributed data systems
Familiarity with indexing and retrieval techniques for large-scale
document corpora Understanding of database systems (SQL and NoSQL),
their internals, and performance characteristics Strong attention
to security, access controls, and compliance best practices (e.g.,
GDPR, SOC2) Excellent debugging, observability, and logging
practices to support reliability at scale Strong communication
skills and experience collaborating across ML, infra, and product
teams Bonus Skill Set: Experience building or maintaining
LLM-integrated retrieval systems (e.g, RAG pipelines) Academic or
industry background in data mining, search, recommendation systems,
or IR literature Experience with large-scale ETL systems and tools
like Apache Beam, Spark, or similar Familiarity with vector
databases (e.g., FAISS, Weaviate, Pinecone) and embedding-based
retrieval Understanding of data validation and quality assurance in
machine learning workflows Experience working on cross-functional
infra and MLOps teams Knowledge of how data infrastructure supports
training pipelines, inference serving, and feedback loops Comfort
working across raw, unstructured data, structured databases, and
model-ready formats Why Work at Zyphra: Our research methodology is
to make grounded, methodical steps toward ambitious goals. Both
deep research and engineering excellence are equally valued We
strongly value new and crazy ideas and are very willing to bet big
on new ideas We move as quickly as we can; we aim to minimize the
bar to impact as low as possible We all enjoy what we do and love
discussing AI Benefits and Perks: Comprehensive medical, dental,
vision, and FSA plans Competitive compensation and 401(k)
Relocation and immigration support on a case-by-case basis On-site
meals prepared by a dedicated culinary team; Thursday Happy Hours
In-person team in Palo Alto, CA, with a collaborative, high-energy
environment If you're excited by the challenge of high-scale,
high-performance data engineering in the context of cutting-edge
AI, you’ll thrive in this role. Apply Today!
Keywords: Zyphra, Ceres , Machine Learning Data Engineer - Systems & Retrieval, IT / Software / Systems , Palo Alto, California