📝There Are Cameras in Every Room of My House and My AI Still Can't See - Notepad
2026-05-23

There Are Cameras in Every Room of My House and My AI Still Can't See

I put cameras in every room to give my AI eyes. Five Pi nodes, an RTX 5090, and a persistent memory layer later — here's what the observation stack looks like.

Project: Physical-World AI Awareness Pipeline

Hardware cost: ~$500 Nodes deployed: 5

Problem

AI systems have zero physical-world awareness. Every interaction is text-based. No observation of behavior, environment, or embodied context. The models are smart enough. The input layer is missing.

Implementation

Five-node sensor network using Raspberry Pi Zero 2W boards with ArduCam IMX708 12MP wide-angle cameras and WM8960 audio HATs. Custom Python daemon handles motion-triggered and audio-triggered recording. Data ships to a 13TB NAS. Inference on RTX 5090.

Camera selection was non-trivial — started with OwlSight 64MP (ov64a40), hit thermal and driver issues, migrated to IMX708. Stable at 1280x720@15fps main stream, 320x240 motion detection stream.

Architecture: The 3-Layer Stack

| Layer | Function | Status | |-------|----------|--------| | 1. Observation | Physical-world data capture (cameras, mics, sensors) | Building now | | 2. Memory | Persistent cross-session storage with encoding gate | TrueMemory — shipped | | 3. Reasoning | LLM inference | Solved by frontier models |

Layer 3 gets billions in investment. Layers 1 and 2 are where the actual bottleneck lives.

Key finding

Raw sensor data is cheap to capture. The pipeline from capture to LLM-usable context is where the engineering difficulty concentrates. Research on the memory layer: arXiv:2605.04897.


Josh Adler — joshadler.com

← All posts← Back to JoshOS
There Are Cameras in Every Room of My House and My AI Still Can't See