Lamhot Siagian — AI/ML Engineer • Evaluation & Quality Engineering

📍 Sunnyvale, CA (Remote/US) • Email: lamhotsiagian2025@gmail.com •
LinkedIn • GitHub

About

I build reliable AI systems—from local RAG and LLM pipelines to evaluation frameworks for ranking, retrieval, and generative quality (accuracy, robustness, hallucination, bias/fairness, safety). Background spans software quality engineering, recommender/search evaluation, and production ML workflows.

Core Expertise

LLM/RAG: LangChain, FAISS, offline semantic search, grounding, prompt orchestration
Evaluation: precision/recall, Top‑K ranking metrics (NDCG, MAP, MRR), faithfulness/groundedness, human rubrics, LLM‑as‑judge
Responsible AI: bias/fairness audits (Fairlearn, AIF360), safety & alignment checks
Data/ML: end‑to‑end pipelines, feature engineering, model training, regression/classification, CV
Production: CI/CD, regression testing, monitoring, experiment tracking, stakeholder decision support

Technical Skills

Programming: Python, Java, JavaScript, TypeScript
AI / ML: TensorFlow, Keras, PyTorch, scikit‑learn, NLTK, OpenCV, Seaborn, LangChain, Jupyter
Testing & Evaluation: DeepEval, LangChain Evaluation, Selenium, Fairlearn, Hugging Face Evaluate, BLEU, METEOR, ROUGE, BERTScore
DevOps / Cloud: Docker, Kubernetes, Jenkins, GitLab CI, SonarQube, AWS, GCP, RIO
Data Platforms: SQL, Snowflake, MongoDB, Redis, Elasticsearch, Spark, Hadoop

Experience

ML Researcher / AI Engineer — University of the Cumberlands (Remote, USA)

Apr 2024 – Present

Built a local RAG QA system using LangChain + FAISS + local LLMs for 100% offline PDF handbook Q&A.
Developed an end‑to‑end regression pipeline (house price prediction) with synthetic data generation, feature processing, training, and evaluation.
Implemented a Neural Network classifier (Titanic) in TensorFlow/Keras; evaluated with ROC‑AUC and classification metrics.
Created a CV mood detection pipeline with OpenCV + MobileNetV2 for emotion recognition.
Designed a Snowflake data engineering project for customer data management (schema → load → query → analytics).

AI Engineer — HP (Palo Alto, CA)

Nov 2024 – May 2025

Built, deployed, and operated production‑grade LLM/RAG workflows using Python, SQL, and Snowflake.
Designed evaluation frameworks for AI tools: accuracy, robustness, hallucination, bias, safety, alignment.
Developed quantitative + qualitative metrics (precision/recall, faithfulness/groundedness, human rubrics, LLM‑as‑judge) to compare models/prompt/RAG variants.
Led human‑in‑the‑loop processes: annotation guidelines, scoring rubrics, and IAA (Cohen’s Kappa / Krippendorff’s Alpha).
Built automated evaluation pipelines for regression tests, A/B comparisons, and continuous monitoring.

Software Engineer — Apple (Sunnyvale, CA)

Nov 2021 – Nov 2024

Built evaluation frameworks for large‑scale recommendation & ranking systems (relevance, personalization, diversity, fairness, cold‑start).
Defined and computed offline metrics: Precision@K, Recall@K, NDCG, MAP, MRR, coverage, novelty, diversity, popularity bias.
Executed online experiments (A/B, canary), applying statistical testing and effect‑size analysis across engagement metrics.
Evaluated ML models/embeddings and analyzed failure modes (shift, feature importance, similarity metrics).
Visualized and monitored quality trends (Tableau/Looker/Grafana) with experiment tracking (e.g., MLflow/W&B).

Software Engineer II — Dexcom (San Diego, CA)

Jul 2020 – Jun 2021

Built data‑driven evaluation and test frameworks for AI‑assisted insights in a CGM mobile application.
Validated ML + rules‑based algorithms on time‑series sensor data (anomalies, drift, edge cases).
Developed offline metrics + datasets for CGM events (hypo/hyperglycemia, time‑in‑range, rate‑of‑change).
Collaborated with ML/clinical/regulatory stakeholders to support verification/validation (V&V) and post‑market monitoring.

Sr. Software Engineer — Bukalapak.com (Jakarta, Indonesia)

Nov 2016 – Sep 2019

Developed and tested ML APIs for e‑commerce features, validating data quality, outputs, edge cases.
Evaluated recommendation systems using Precision@K, Recall@K, NDCG for relevance/personalization/fair exposure.
Tested chatbot/NLP systems for intent accuracy, fallback behavior, and hallucination risks.

Projects

Local RAG QA System — Offline question answering over PDFs with LangChain + FAISS + local LLMs.
House Price Prediction — End‑to‑end regression pipeline (NumPy/Pandas/scikit‑learn).
Titanic Survival Classifier — Neural net classifier (TensorFlow/Keras) with ROC‑AUC evaluation.
Mood Detection (Computer Vision) — OpenCV + MobileNetV2 inference pipeline.
Customer Data Platform on Snowflake — End‑to‑end DE project (schema → load → analytics).

(See my GitHub profile for pinned repos and implementation details.)

Publications

The Importance of Test‑Case Design in AI Evaluation: A Study of RAG Recommendation Systems with Top‑K Ranking Metrics. https://doi.org/10.13140/RG.2.2.12132.64647
Auditing Fairness of ChatGPT‑5.2 for Indonesian Hate‑Speech Detection Using Fairlearn and AIF360. https://doi.org/10.13140/RG.2.2.28496.98569
Evaluating GPT‑5.2 on Batak Toba Language Accuracy and Multi‑Metric Analysis with BLEU, METEOR, ROUGE, BERTScore, and COMET. https://doi.org/10.13140/RG.2.2.10015.83369
Benchmarking Hallucination Evaluation for RAG Under an Abstention Policy: A Controlled 30‑Query Study with RAGAS, DeepEval, and LLM‑as‑Judge. https://doi.org/10.13140/RG.2.2.23948.78726
Benchmark Framework Evaluation Using Hugging Face Evaluate, scikit‑learn Metrics, and TorchMetrics: A Case Study on CNN and Vision Transformer Models for Pattern Recognition. https://doi.org/10.13140/RG.2.2.35661.70882
E‑book: End‑to‑End API Testing (2024), Leanpub
Book: Software Test Automation (2018), ISBN: 978‑602‑475‑707‑6

Education

PhD in Artificial Intelligence — University of the Cumberlands (Remote, USA) (Apr 2024 – Dec 2026)
MS in Computer Science — Maharishi International University (Aug 2019 – Aug 2021)
BS in Information Systems — Bina Nusantara University (Aug 2016 – Aug 2018)
AS in Informatics Engineering — IT‑Del (Sep 2012 – Sep 2015)