The RoleWe're seeking a Data Engineer to transform large-scale geospatial datasets into structured, reliable, and accessible formats that power Mach9's ML and product pipelines. You'll work with high-volume data sources - laser scan point clouds, imagery, and a long tail of geospatial formats - and own the systems that get them ingested, standardized, stored, and made available for training, perception, and production use in a consistent and efficient way.
This role sits at the front of everything we do: our models are only as good as the data feeding them, and you'll be the one making that data trustworthy at scale.
Responsibilities- Develop and maintain scalable, reproducible workflows for ingesting and processing large volumes of point cloud, imagery, and geospatial data.
- Convert datasets from various sensor providers into Mach9's standardized internal formats.
- Build CI/CD pipelines and automated checks that guarantee the correctness and consistency of data pipelines, including regression detection on dataset processing.
- Optimize processing performance, query speed, and storage efficiency across large geospatial datasets.
- Work closely with the customer success team to efficiently resolve issues and unblock customer projects.
- Build and maintain agentic harness for automated dataset triage and code patching. Automatically propose or apply fixes, and escalate when human judgment is needed.
- Work closely with ML and product teams to make data readily usable for training, inference and visualization.
- Work closely with customers and data-provider partners to facilitate data integration (with occasional travels).
- Puzzle-hunting: work with data formats with sparse or missing documentation.
Requirements- Strong software development, problem-solving, and debugging skills, with hands-on experience building production systems in Python.
- Solid foundation in distributed systems and parallel computing.
- Comfort operating with ambiguity - able to dig into undocumented or messy data formats, reverse-engineer how they work, and make steady progress without a clear spec.
- Experience building agentic systems and setting up agent harnesses - orchestrating LLM-driven workflows for triage, debugging, or automated code patching.
- Strong communication and collaboration skills, with the ability to work across ML, product, and customer-facing teams.
- Bachelor's degree in Computer Science, Engineering, or equivalent experience.
Bonus qualifications- Experience building agentic systems and setting up agent harnesses - orchestrating LLM-driven workflows for triage, debugging, or automated code patching.
- Understanding of geospatial data formats (e.g., LAS/LAZ, COPC, E57, GeoTIFF, Shapefiles) and tooling (e.g., GDAL, PDAL, untwine, laz-perf).
- Expertise designing and managing data schemas and storage systems for geospatial data (e.g., Postgres/PostGIS, AWS S3).
- Experience with large-scale data processing frameworks and cloud platforms (e.g., Spark, AWS Batch).
- Familiarity with coordinate reference systems and transforms (CRS, WKT, pyproj, affine transforms).
- Experience building data versioning, lineage, or artifact-tracking systems.
- Experience operating data pipelines that feed ML training and inference.
- Familiar with C++.