# 🚀 Hugging Face Spaces Deployment - Troubleshooting Guide ## ✅ **Your Local Fix Applied** Great news! The core issue has been resolved locally. The problem was that the downloaded model doesn't contain `actor_critic` weights, but the code assumed it did. This caused a `NoneType` error when clicking to start the game. **Fixed**: The app now properly detects when `actor_critic` weights are missing and falls back to human control mode instead of crashing. ## 🔍 **Potential HF Spaces Issues & Solutions** ### **Issue 1: Model Download Timeouts** ⏰ **Symptoms:** - "Model loading timed out" message - App shows loading forever - Click doesn't start the game **Root Cause:** HF Spaces network can be slower, 5-minute timeout may not be enough. **Solution:** ```python # In app.py, update the timeout in _load_model_from_url_async(): success = await asyncio.wait_for(future, timeout=900.0) # 15 minutes instead of 5 ``` ### **Issue 2: Memory Limitations** 💾 **Symptoms:** - App crashes during model loading - "Out of memory" errors in logs - Models load but inference fails **Root Cause:** HF Spaces free tier has only 16GB RAM. **Quick Fix:** Force CPU-only mode ```python # Add at the top of app.py import os os.environ["CUDA_VISIBLE_DEVICES"] = "" # Force CPU mode for HF Spaces ``` **Better Solution:** Add memory management ```python # Add memory cleanup after model loading import gc gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache() ``` ### **Issue 3: WebSocket Connection Failures** 🔌 **Symptoms:** - "Connection Error" or "Disconnected" status - Click works but no response - Frequent reconnections **Root Cause:** HF Spaces proxy/domain restrictions. **Solution:** Update the WebSocket connection code in the HTML template: ```javascript // Replace the connectWebSocket function in app.py HTML function connectWebSocket() { const isHFSpaces = window.location.hostname.includes('huggingface.co'); const protocol = window.location.protocol === 'https:' ? 'wss:' : 'ws:'; const wsUrl = `${protocol}//${window.location.host}/ws`; ws = new WebSocket(wsUrl); // Longer timeout for HF Spaces const timeout = isHFSpaces ? 30000 : 10000; const connectTimer = setTimeout(() => { if (ws.readyState !== WebSocket.OPEN) { ws.close(); setTimeout(connectWebSocket, 5000); // Retry after 5s } }, timeout); ws.onopen = function(event) { clearTimeout(connectTimer); statusEl.textContent = 'Connected'; statusEl.style.color = '#00ff00'; // Re-send start if user already clicked if (gameStarted && !gamePlaying) { ws.send(JSON.stringify({ type: 'start' })); } }; } ``` ### **Issue 4: Actor-Critic Model Missing** 🧠 **Already Fixed!** ✅ The app now handles this gracefully: - Detects missing `actor_critic` weights - Falls back to human control mode - Shows proper warning messages - Game still works (user can control manually) ### **Issue 5: Dockerfile Optimization** 🐳 **Update your Dockerfile for HF Spaces:** ```dockerfile # Add these optimizations ENV SHM_SIZE=2g ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 ENV OMP_NUM_THREADS=4 # Add health check HEALTHCHECK --interval=30s --timeout=10s --start-period=60s \ CMD curl --fail http://localhost:7860/health || exit 1 ``` ## 🚀 **Quick Deployment Checklist** ### **Before Deploying:** 1. ✅ **Test locally with conda**: `conda activate diamond && python run_web_demo.py` 2. ✅ **Verify the fix works**: Click should now work (even without actor_critic weights) 3. ✅ **Check model download**: Test internet connectivity for HF model URL ### **For HF Spaces Deployment:** 1. **Update timeout values:** ```python # In app.py line ~153 success = await asyncio.wait_for(future, timeout=900.0) # 15 min ``` 2. **Add health check endpoint:** ```python @app.get("/health") async def health_check(): return { "status": "healthy", "models_ready": game_engine.models_ready, "actor_critic_loaded": game_engine.actor_critic_loaded } ``` 3. **Force CPU mode for free tier:** ```python # Add at app.py startup os.environ["CUDA_VISIBLE_DEVICES"] = "" ``` 4. **Update Dockerfile** with the optimizations above 5. **Test WebSocket connection** - add the improved connection handling ## 🔧 **Debugging on HF Spaces** ### **Check Logs:** 1. Go to your Space page on HuggingFace 2. Click "Logs" tab 3. Look for these messages: - ✅ `"Actor-critic model exists but has no trained weights - using dummy mode!"` - ✅ `"WebPlayEnv set to human control mode"` - ❌ `"Model loading timed out"` - ❌ `"WebSocket error"` ### **Test Health Endpoint:** - Visit: `https://your-space.hf.space/health` - Should return JSON with status info ### **Browser Console:** - Open Developer Tools (F12) - Check for WebSocket connection errors - Look for JavaScript errors during click ## 🎯 **Expected Behavior After Fixes** 1. **App loads** → Shows loading progress bar 2. **Models initialize** → Either loads actor_critic OR shows "no trained weights" 3. **User clicks game area** → Game starts immediately (no hanging) 4. **If actor_critic missing** → User gets manual control (still playable!) 5. **If actor_critic loaded** → AI takes control automatically ## 🆘 **If Issues Persist** **Quick Diagnostic:** ```python # Add this test endpoint to app.py @app.get("/debug") async def debug_info(): return { "models_ready": game_engine.models_ready, "actor_critic_loaded": game_engine.actor_critic_loaded, "loading_status": game_engine.loading_status, "game_started": game_engine.game_started, "obs_shape": str(game_engine.obs.shape) if game_engine.obs is not None else "None", "connected_clients": len(connected_clients), "cuda_available": torch.cuda.is_available(), "device_count": torch.cuda.device_count() if torch.cuda.is_available() else 0 } ``` Visit `/debug` endpoint to see the current state. **Most Common Issue:** If clicking still doesn't work on HF Spaces, it's usually the WebSocket connection. Update the connection handling as described above. The core model/clicking issue is now fixed - the remaining items are deployment optimizations for HF Spaces' specific environment! 🎉