Job Summary:
TSMC Arizona is a leading semiconductor manufacturing company, offering an opportunity to work at the most advanced fab in the United States. As a Senior Data Engineer in the AI Data Curation track, you will design and maintain scalable data pipelines, ensuring that the data for AI models is high-quality and aligned with ethical standards.
Responsibilities:
• Design and implement data pipelines for processing, cleaning, and curating large datasets used in model training and fine-tuning.
• Automate data cleaning processes (e.g., removing noise, duplicates, irrelevant content) and ensure datasets are appropriately labeled and structured.
• Collaborate with model teams to ensure data aligns with model requirements and performance goals.
• Assess and mitigate bias in datasets, ensuring that models are trained on diverse and representative data.
• Manage data storage and retrieval strategies, ensuring scalability and data consistency across different environments.
• Conduct regular audits to ensure data integrity, privacy, and security compliance.
Qualifications:
Required:
• Bachelor's degree in Computer Science, Data Science, or a related field.
• 5+ years of experience in data engineering, data wrangling, or data curation, particularly in machine learning or AI-driven environments.
• Strong proficiency in Python (Pandas, NumPy) and SQL for data manipulation and querying.
• Familiarity with cloud-based data storage (AWS S3, Google Cloud Storage, etc.) and distributed systems for managing large datasets.
• Experience with data annotation tools and platforms for manual or semi-automated labeling.
• Experience with NLP data formats, such as JSONL, text, or embeddings, and an understanding of tokenization.
• Experience managing data pipelines with tools like Apache Kafka, Apache Airflow, or similar ETL tools.
• Strong knowledge of AI ethics, data privacy, and compliance standards (GDPR, CCPA, etc.).
• Candidates must be willing and able to work on-site at our Phoenix Arizona facility.
• Communication
• Computer proficiency
• Presentation skills
• Listening
• Teamwork
Preferred:
• Experience with vector databases and indexing for LLMs (e.g., FAISS, Pinecone).
Company:
Established in 1987, TSMC is the world's first dedicated semiconductor foundry. Founded in 1987, the company is headquartered in Hsinchu, TWN, with a team of 10001+ employees. The company is currently Late Stage.