During my PhD at the Photogrammetric Computer Vision Lab, I studied self-supervised learning on 3D volumetric data, diffusion-based generative models, and vision-language reasoning. That research now shapes my work at Path Robotics, where I design deep learning systems for robotic perception—including multimodal backbones for generalized robotics, diffusion-based 3D asset generation for simulation, and the MLOps infrastructure that ties it all together.
Research Interest: I study how machines see, generate, and reason about visual content. My work sits at the intersection of generative modeling and visual understanding—building systems that can synthesize realistic imagery under precise control and enabling multimodal models to interpret complex visual scenes.
We introduce LLaVA-LE, a vision-language model for lunar surface and subsurface characterization. We curate LUCID, a new dataset of 96k panchromatic images with scientific captions and 81k QA pairs from NASA missions. Fine-tuned with a two-stage curriculum, LLaVA-LE achieves a 3.3x gain over base LLaVA, with reasoning scores exceeding the judge's own reference.
Right-click and open the image in a new tab for better resolution
SeamCam-based camouflage image generation vs. SOTA.
We introduce SeamCam, a camouflage evaluation metric that quantifies how detectable an animal is from visual evidence. SeamCam achieves 78.82% agreement with human judgments, outperforming state-of-the-art by ~25%. We further use SeamCam as a preference signal for DPO fine-tuning of diffusion-based inpainting models for camouflage generation, and introduce CamFG-1.5k, a curated benchmark of 1,521 high-resolution images for unbiased evaluation.
Sketch-to-image generation with adjustable detail.
We introduce KnobGen, a dual-pathway framework that bridges the gap between novice sketches and expert-level image generation.
Our system dynamically balances fine-grained detail and high-level control using adjustable modules, producing high-quality results from any sketch.
3D medical image segmentation results.
SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation
Shehan Pererra*, Pouyan Navard*, Alper Yilmaz
* Equal contribution CVPR 2024,
DEF-AI-MIA Workshop Project Page /
CVF
/
Code
SegFormer3D redefines 3D medical image segmentation with a lightweight hierarchical Transformer that rivals state-of-the-art models. By blending multi-scale volumetric attention with an all-MLP decoder, we achieve competitive accuracy with 33x fewer parameters and 13x lower compute.