Lamhot Siagian — AI/ML Engineer • Evaluation & Quality Engineering
📍 Sunnyvale, CA (Remote/US) • Email: lamhotsiagian2025@gmail.com •
LinkedIn • GitHub
About
I build reliable AI systems—from local RAG and LLM pipelines to evaluation frameworks for ranking, retrieval, and generative quality (accuracy, robustness, hallucination, bias/fairness, safety). Background spans software quality engineering, recommender/search evaluation, and production ML workflows.
Core Expertise
- LLM/RAG: LangChain, FAISS, offline semantic search, grounding, prompt orchestration
- Evaluation: precision/recall, Top‑K ranking metrics (NDCG, MAP, MRR), faithfulness/groundedness, human rubrics, LLM‑as‑judge
- Responsible AI: bias/fairness audits (Fairlearn, AIF360), safety & alignment checks
- Data/ML: end‑to‑end pipelines, feature engineering, model training, regression/classification, CV
- Production: CI/CD, regression testing, monitoring, experiment tracking, stakeholder decision support
Technical Skills
- Programming: Python, Java, JavaScript, TypeScript
- AI / ML: TensorFlow, Keras, PyTorch, scikit‑learn, NLTK, OpenCV, Seaborn, LangChain, Jupyter
- Testing & Evaluation: DeepEval, LangChain Evaluation, Selenium, Fairlearn, Hugging Face Evaluate, BLEU, METEOR, ROUGE, BERTScore
- DevOps / Cloud: Docker, Kubernetes, Jenkins, GitLab CI, SonarQube, AWS, GCP, RIO
- Data Platforms: SQL, Snowflake, MongoDB, Redis, Elasticsearch, Spark, Hadoop
Experience
ML Researcher / AI Engineer — University of the Cumberlands (Remote, USA)
Apr 2024 – Present
- Built a local RAG QA system using LangChain + FAISS + local LLMs for 100% offline PDF handbook Q&A.
- Developed an end‑to‑end regression pipeline (house price prediction) with synthetic data generation, feature processing, training, and evaluation.
- Implemented a Neural Network classifier (Titanic) in TensorFlow/Keras; evaluated with ROC‑AUC and classification metrics.
- Created a CV mood detection pipeline with OpenCV + MobileNetV2 for emotion recognition.
- Designed a Snowflake data engineering project for customer data management (schema → load → query → analytics).
AI Engineer — HP (Palo Alto, CA)
Nov 2024 – May 2025
- Built, deployed, and operated production‑grade LLM/RAG workflows using Python, SQL, and Snowflake.
- Designed evaluation frameworks for AI tools: accuracy, robustness, hallucination, bias, safety, alignment.
- Developed quantitative + qualitative metrics (precision/recall, faithfulness/groundedness, human rubrics, LLM‑as‑judge) to compare models/prompt/RAG variants.
- Led human‑in‑the‑loop processes: annotation guidelines, scoring rubrics, and IAA (Cohen’s Kappa / Krippendorff’s Alpha).
- Built automated evaluation pipelines for regression tests, A/B comparisons, and continuous monitoring.
Software Engineer — Apple (Sunnyvale, CA)
Nov 2021 – Nov 2024
- Built evaluation frameworks for large‑scale recommendation & ranking systems (relevance, personalization, diversity, fairness, cold‑start).
- Defined and computed offline metrics: Precision@K, Recall@K, NDCG, MAP, MRR, coverage, novelty, diversity, popularity bias.
- Executed online experiments (A/B, canary), applying statistical testing and effect‑size analysis across engagement metrics.
- Evaluated ML models/embeddings and analyzed failure modes (shift, feature importance, similarity metrics).
- Visualized and monitored quality trends (Tableau/Looker/Grafana) with experiment tracking (e.g., MLflow/W&B).
Jul 2020 – Jun 2021
- Built data‑driven evaluation and test frameworks for AI‑assisted insights in a CGM mobile application.
- Validated ML + rules‑based algorithms on time‑series sensor data (anomalies, drift, edge cases).
- Developed offline metrics + datasets for CGM events (hypo/hyperglycemia, time‑in‑range, rate‑of‑change).
- Collaborated with ML/clinical/regulatory stakeholders to support verification/validation (V&V) and post‑market monitoring.
Nov 2016 – Sep 2019
- Developed and tested ML APIs for e‑commerce features, validating data quality, outputs, edge cases.
- Evaluated recommendation systems using Precision@K, Recall@K, NDCG for relevance/personalization/fair exposure.
- Tested chatbot/NLP systems for intent accuracy, fallback behavior, and hallucination risks.
Projects
- Local RAG QA System — Offline question answering over PDFs with LangChain + FAISS + local LLMs.
- House Price Prediction — End‑to‑end regression pipeline (NumPy/Pandas/scikit‑learn).
- Titanic Survival Classifier — Neural net classifier (TensorFlow/Keras) with ROC‑AUC evaluation.
- Mood Detection (Computer Vision) — OpenCV + MobileNetV2 inference pipeline.
- Customer Data Platform on Snowflake — End‑to‑end DE project (schema → load → analytics).
(See my GitHub profile for pinned repos and implementation details.)
Publications
- The Importance of Test‑Case Design in AI Evaluation: A Study of RAG Recommendation Systems with Top‑K Ranking Metrics. https://doi.org/10.13140/RG.2.2.12132.64647
- Auditing Fairness of ChatGPT‑5.2 for Indonesian Hate‑Speech Detection Using Fairlearn and AIF360. https://doi.org/10.13140/RG.2.2.28496.98569
- Evaluating GPT‑5.2 on Batak Toba Language Accuracy and Multi‑Metric Analysis with BLEU, METEOR, ROUGE, BERTScore, and COMET. https://doi.org/10.13140/RG.2.2.10015.83369
- Benchmarking Hallucination Evaluation for RAG Under an Abstention Policy: A Controlled 30‑Query Study with RAGAS, DeepEval, and LLM‑as‑Judge. https://doi.org/10.13140/RG.2.2.23948.78726
- Benchmark Framework Evaluation Using Hugging Face Evaluate, scikit‑learn Metrics, and TorchMetrics: A Case Study on CNN and Vision Transformer Models for Pattern Recognition. https://doi.org/10.13140/RG.2.2.35661.70882
- E‑book: End‑to‑End API Testing (2024), Leanpub
- Book: Software Test Automation (2018), ISBN: 978‑602‑475‑707‑6
Education
- PhD in Artificial Intelligence — University of the Cumberlands (Remote, USA) (Apr 2024 – Dec 2026)
- MS in Computer Science — Maharishi International University (Aug 2019 – Aug 2021)
- BS in Information Systems — Bina Nusantara University (Aug 2016 – Aug 2018)
- AS in Informatics Engineering — IT‑Del (Sep 2012 – Sep 2015)