I put cameras in every room to give my AI eyes. Five Pi nodes, an RTX 5090, and a persistent memory layer later — here's what the observation stack looks like.
Hardware cost: ~$500 Nodes deployed: 5
AI systems have zero physical-world awareness. Every interaction is text-based. No observation of behavior, environment, or embodied context. The models are smart enough. The input layer is missing.
Five-node sensor network using Raspberry Pi Zero 2W boards with ArduCam IMX708 12MP wide-angle cameras and WM8960 audio HATs. Custom Python daemon handles motion-triggered and audio-triggered recording. Data ships to a 13TB NAS. Inference on RTX 5090.
Camera selection was non-trivial — started with OwlSight 64MP (ov64a40), hit thermal and driver issues, migrated to IMX708. Stable at 1280x720@15fps main stream, 320x240 motion detection stream.
| Layer | Function | Status | |-------|----------|--------| | 1. Observation | Physical-world data capture (cameras, mics, sensors) | Building now | | 2. Memory | Persistent cross-session storage with encoding gate | TrueMemory — shipped | | 3. Reasoning | LLM inference | Solved by frontier models |
Layer 3 gets billions in investment. Layers 1 and 2 are where the actual bottleneck lives.
Raw sensor data is cheap to capture. The pipeline from capture to LLM-usable context is where the engineering difficulty concentrates. Research on the memory layer: arXiv:2605.04897.
Josh Adler — joshadler.com