Posted on:
October 28, 2024
About Us Valo Health is a technology company that integrates human-centric data and AI-powered technology to accelerate the creation of life-changing drugs. Valo was created with the belief that the drug discovery and development process can and should be faster and less expensive, with a higher success rate. We use models early to fail less often as we reinvent drug discovery and development from the ground up. Disease doesn't wait, so neither can we. We are a multi-disciplinary team of experts in science, technology, and pharmaceuticals united in our mission to achieve better drugs for patients, faster. Valo is committed to hiring diverse talent, prioritizing growth and development, fostering an inclusive environment, and bringing together a group of different experiences, backgrounds, and voices to work together. We achieve the widest-ranging impact when we leverage our broad backgrounds and perspectives. Valo's machine learning and AI capabilities are built on high-quality, high-density human-centric data from multiple sources: that's where you come in!
About the Role As a Staff / Senior Staff Data Engineer, you will join the data engineering core in the Translational Data Sciences group, working with data scientists and engineers building powerful computational tools and answering critical scientific questions about patients, diseases, and drug development. In this role, you will lead the development, road mapping, and execution of complex initiatives to transform real-world data (e.g., electronic medical records, biomarkers and biomedical imaging, and text notes) into analysis-ready data products for internal teams. To do so, you will partner with a diverse set of scientists, engineers, and domain experts across traditional industry boundaries. Primary downstream use cases of these data are longitudinal deep learning models of patient trajectories, and knowledge graph integration for target identification, statistical genetics, and multi-omics modeling.
What You'll Do - Build, maintain, and extend data transformation pipelines and systems to ingest and harmonize third-party EHR data into Valo's data ecosystems. - Define Valo's EHR data models and pipelines (Spark, SQL) in a centralized data ecosystem and semi-isolated cloud environments. - Collaborate closely with data providers and in-house data users to integrate third-party EHR data with Valo's standardized data. - Maintain and extend data integration (standardization & harmonization) & data quality processes to improve quality, reliability, and FAIRness. - Ensure conceptual accuracy and generalizability of data: do standardized derived features represent clinical concepts in repeatable ways? - Simplify access, transformation, and use of data for data scientists, while promoting consistent data usage patterns including version management, shared ontologies & data dictionaries. - Support internal data users through direct assistance and by composing demos, how-tos, and reference documentation. - Provide technical leadership within the translational data engineering team, advising colleagues on data transformations and database design while encouraging best practices. - Participate in the creation and maintenance of technical documentation.
What You Bring - Bachelor's degree + 8 (staff) / 10 (senior staff) years of experience, MS + 6/8 YOE, PhD + 5/7 YOE in Computer Science, Information Systems, or Data Science. - 5+ years experience in a technical role (SWE / DE), focusing on data ingestion, streaming technologies, troubleshooting data pipelines (e.g., Prefect, Airflow), and implementing CI/CD practices. - Production programming experience in Python & SQL; familiar with cloud compute and big data tools (e.g., Spark). - 3+ years experience in gathering requirements and understanding customers/data users' goals, including demonstrated experience in scoping projects, determining timelines, and delivering end-to-end projects. - Technical project management experience (scoping, defining milestones & timelines) is a plus. - Experience with EHR/EMR data and medical coding ontologies (e.g., ICD, ATC, LOINC, SNOMED) is preferred.
Nice to Have - Experience with sparse longitudinal records (e.g., customer/log data with historical ontologies). - Familiarity with data engineering best practices and testing methodologies (data provenance, collaborative development using source control management (git), code versioning, reproducibility, etc.).