EquiTrail — Disaster Recovery Plan¶
Owner: Nossie Consultancy B.V. · Last updated: 2026-06-01 Review this document every quarter and after every significant infrastructure change.
1. Scope & Objectives¶
Systems covered¶
| System | Criticality | Notes |
|---|---|---|
| Firebase (Auth + Firestore + Functions) | CRITICAL | User accounts, rides, horses, settings |
| Oracle Cloud routing server (GraphHopper) | HIGH | Navigation — app works without it (graceful degradation) |
| Website (equitrail.horse, Plesk) | MEDIUM | Marketing/PRO page — app unaffected |
| Discord bot (Oracle server) | LOW | Community — app unaffected |
Android keystore (equitrail.jks) |
CRITICAL | Loss = permanent inability to update Play Store app |
| Play Console service account | HIGH | Loss = no automated uploads (manual still works) |
| Git repository (Azure DevOps) | HIGH | Source of truth for all code |
| Oracle Cloud backup server | MEDIUM | Secondary store for secrets + docs |
Recovery Objectives¶
| System | RTO (Recovery Time) | RPO (Recovery Point) |
|---|---|---|
| Firebase Firestore | 4 hours | 24 hours (daily export) |
| Routing server | 2 hours | No data loss (stateless) |
| Android keystore | 1 hour | No data loss (backed up to Oracle + 2 offline copies) |
| Website | 30 minutes | No data loss (static files in git + Oracle backup) |
| Discord bot | 4 hours | No data loss (code in git, routes on server) |
| Git repository | 2 hours | No data loss (Azure DevOps + local clone) |
2. Critical Asset Inventory¶
2a. Irreplaceable Assets (MUST protect — loss = catastrophic)¶
| Asset | Location | Backup |
|---|---|---|
android/equitrail.jks |
workdir + Oracle backup | ALSO keep offline encrypted copy |
android/key.properties |
workdir + Oracle backup | Contains keystore passwords |
android/play_service_account.json |
workdir + Oracle backup | Regeneratable via Google Cloud |
android/app/google-services.json |
workdir + Oracle backup | Re-downloadable from Firebase console |
ios/Runner/GoogleService-Info.plist |
workdir + Oracle backup | Re-downloadable from Firebase console |
credentials.env |
workdir + Oracle backup | Master credentials file |
| Firebase user data (Firestore) | Firebase cloud | Daily export to Cloud Storage |
WARNING: Losing
android/equitrail.jkswithout a backup means you CANNOT update the Play Store app ever again. You would need to publish a brand-new app with a new package name, losing all install history and reviews. This file must have 3 copies: (1) workdir, (2) Oracle backup server, (3) offline encrypted (USB/1Password).
2b. Recoverable Assets¶
| Asset | Recovery Method | Time |
|---|---|---|
| Source code | Clone from Azure DevOps | 5 min |
| Website HTML | Pull from git | 5 min |
| Discord bot code | Pull from git | 5 min |
| GraphHopper server | Re-provision Oracle Cloud instance + ansible/manual setup | 2 hours |
| Firebase project config | Re-download from Firebase console | 15 min |
3. Failure Scenarios & Runbooks¶
Scenario A: Firebase Outage (Google-side)¶
Symptoms: App login fails, rides not syncing, Firestore queries time out.
Impact: HIGH — users cannot log in, cloud sync broken. Local GPS tracking still works.
Runbook:
1. Check https://status.firebase.google.com — confirm it is a Google-side outage.
2. Post in Discord #announcements: "We are aware of a temporary cloud sync issue. Local tracking continues to work. No data is lost."
3. No code change needed — app is designed with offline-first Hive storage.
4. Monitor Firebase status page.
5. When restored: CloudSyncService will auto-sync on next app open.
6. If > 6 hours: post update in Discord.
Prevention: None beyond Google's own SLAs. App architecture (offline-first Hive) already mitigates impact.
Scenario B: Firebase Project Accidentally Deleted¶
Symptoms: Auth returns 404, Firestore returns 404 on all collections.
Impact: CATASTROPHIC — all user data lost if no Firestore export exists.
Runbook:
1. Immediately check Google Cloud console → Firebase → check 30-day project deletion window.
2. If within 30 days: request project restoration via Google Cloud Support.
3. Restore from latest Firestore export (see Backup Plan → Section 4).
4. Restore google-services.json and GoogleService-Info.plist from Oracle backup.
5. Re-deploy Firebase Functions if any were lost.
6. Push emergency app update disabling Firebase features until restored.
7. Notify users via Discord + push notification.
Prevention:
- Enable Firebase project delete protection (Console → Project Settings → delete protection).
- Daily Firestore export to Cloud Storage bucket in europe-west4.
- Only nossiej@gmail.com has Owner role — no other Owner accounts.
Scenario C: Routing Server Down (GraphHopper on Oracle Cloud)¶
Symptoms: Navigation tab shows error, route planning fails. GPS tracking still works.
Impact: MEDIUM — navigation feature unavailable. All tracking/history/social unaffected.
Runbook:
1. SSH to server: ssh equitrail@100.126.14.49
- If SSH fails → Tailscale issue: check Tailscale status on Mac (tailscale status)
- If Tailscale OK but SSH fails → Oracle Cloud may have rebooted, check OCI console
2. Check GraphHopper: curl http://localhost:8989/health
3. Restart if needed: sudo systemctl restart graphhopper
4. Check logs: sudo journalctl -u graphhopper -n 100
5. If disk full: df -h → clean old GH logs: sudo journalctl --vacuum-time=7d
6. If OOM: free -m → increase swap or restart
7. If instance terminated: re-provision (see Appendix A)
Fallback (in app): RoutingService falls back to showing error toast. Navigation works in degraded mode (GPS track only, no turn-by-turn). No code change needed.
Future mitigation: Add secondary OSRM fallback in routing_service.dart (#101).
Scenario D: Android Keystore Lost¶
Symptoms: flutter build appbundle --release fails to sign, or keystore file missing.
Impact: CATASTROPHIC — cannot publish updates to Play Store without this exact keystore.
Runbook:
1. Check Oracle backup: ssh equitrail@100.126.14.49 "ls ~/equitrail-backup/latest/android/"
2. Restore: scp equitrail@100.126.14.49:~/equitrail-backup/latest/android/equitrail.jks android/
3. Restore passwords from credentials.env backup.
4. Verify: keytool -list -v -keystore android/equitrail.jks -storepass VYdUKs9MbUH5mFeegSMB
5. Test build: flutter build appbundle --release
If keystore cannot be recovered:
1. Contact Google Play Support — they cannot restore it for you.
2. Must publish new app with new package name (e.g., com.nossie.equitrail2).
3. Notify all users to migrate.
4. This is a months-long effort — PREVENT THIS AT ALL COSTS.
Prevention:
- Oracle backup runs every session exit (auto hook).
- Manual offline copy: export android/equitrail.jks to password manager (1Password) or encrypted USB.
- Run bash scripts/verify_backup.sh monthly to confirm backup integrity.
Scenario E: Azure DevOps Repository Unavailable¶
Symptoms: git push / git pull fails to dev.azure.com.
Impact: MEDIUM — code still locally available, no immediate app outage.
Runbook:
1. Check https://status.dev.azure.com — confirm Microsoft-side outage.
2. Continue working locally — all commits are safe in local .git/.
3. When restored: git push origin main
4. If Azure DevOps project deleted: restore from local clone (full history intact).
5. Long-term: push mirror to GitHub as secondary remote (see Appendix B).
Scenario F: Website Down (Plesk / equitrail.horse)¶
Symptoms: equitrail.horse returns error or blank page.
Impact: LOW — app functions fully without website. PRO purchase page affected.
Runbook:
1. Check if it's DNS: curl -I http://136.144.178.202 (direct IP)
2. If IP works but domain doesn't: DNS issue — check domain registrar DNS config.
3. If IP also fails: Plesk server issue — check Plesk control panel.
4. If blank page: 0-byte file issue — re-deploy: bash scripts/deploy_website.sh
5. Emergency: website static files are in website/ in git — can be hosted anywhere in minutes.
Nuclear fallback: Deploy website/ to Netlify/Vercel as temporary hosting in < 10 minutes.
Scenario G: Mac / Development Machine Lost or Stolen¶
Symptoms: Primary development machine unavailable.
Impact: HIGH for development velocity. No immediate user impact.
Runbook:
1. On any new Mac: brew install flutter git python3
2. Clone repo: git clone https://dev.azure.com/nossie/equitrail/_git/equitrail
3. Restore secrets from Oracle backup:
ssh equitrail@100.126.14.49
tar czf /tmp/equitrail-secrets.tar.gz ~/equitrail-backup/latest/
scp equitrail@100.126.14.49:/tmp/equitrail-secrets.tar.gz ~/
flutter pub get — development ready.
Time to full recovery: ~2 hours (Flutter install + restore + verify).
Scenario H: Discord Bot Down¶
Symptoms: Bot offline in Discord, onboarding not working, RouteHelper not responding.
Impact: LOW — app unaffected, community impacted.
Runbook:
1. SSH: ssh equitrail@100.126.14.49
2. Check status: systemctl --user status equitrail-bot
3. View logs: journalctl --user -u equitrail-bot -n 50
4. Restart: systemctl --user restart equitrail-bot
5. If crash loop: journalctl --user -u equitrail-bot -n 200 → fix code → deploy + restart
6. If Oracle server rebooted: loginctl enable-linger equitrail (run once to ensure auto-start)
4. Communication Plan¶
During an Incident¶
| Severity | Who to notify | Channel | Timing |
|---|---|---|---|
| CRITICAL (Firebase down, keystore lost) | All users | Discord #announcements + push notification | Within 1 hour |
| HIGH (routing down > 1 hour) | Discord | Discord #announcements | Within 2 hours |
| MEDIUM (website down) | — | No user notification needed | — |
| LOW (Discord bot down) | — | Post in #bot-commands | Best effort |
Message Templates¶
Firebase outage (Dutch):
⚠️ Tijdelijke storing — De cloud-sync is momenteel niet beschikbaar door een Google-storing. Je rit-registratie werkt normaal en al je gegevens zijn veilig bewaard op je toestel. We houden je op de hoogte. 🐴
Planned maintenance:
🔧 Onderhoud gepland — Op [datum] van [tijd] tot [tijd] is de navigatiefunctie tijdelijk niet beschikbaar. GPS-tracking en geschiedenis werken normaal.
5. Quarterly DR Test Checklist¶
Run every quarter (next: 2026-09-01):
- Verify Oracle backup is running:
ssh equitrail@100.126.14.49 "ls -la ~/equitrail-backup/" - Test keystore restore: copy from backup to
/tmp/, runkeytool -list - Verify Firestore export is scheduled and recent (Cloud Storage bucket)
- Test routing server restart:
sudo systemctl restart graphhopper - Verify website deploy:
bash scripts/deploy_website.sh --file index.html - Test Discord bot restart:
systemctl --user restart equitrail-bot - Confirm offline keystore copy is accessible (1Password / USB)
- Review this document for accuracy
Appendix A: Re-provision Routing Server¶
If Oracle Cloud instance is terminated and must be recreated:
# 1. Create new Oracle Cloud ARM instance (Always Free tier)
# Region: Netherlands (Amsterdam) — ap-amsterdam-1 or eu-amsterdam-1
# OS: Ubuntu 22.04 LTS
# Shape: VM.Standard.A1.Flex (ARM, free tier)
# 2. Install Tailscale
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up
# 3. Install Java + GraphHopper
sudo apt-get install -y openjdk-17-jdk
wget https://github.com/graphhopper/graphhopper/releases/download/9.1/graphhopper-web-9.1.jar
# ... (full setup in docs/routing_server_setup.md — TODO: create this)
# 4. Re-configure UFW
sudo ufw allow from 100.64.0.0/10 # Tailscale only
sudo ufw enable
Appendix B: GitHub Mirror Setup (Recommended)¶
Add a GitHub private repo as a secondary remote to protect against Azure DevOps outage:
# One-time setup
git remote add github https://github.com/nossiej/equitrail-private.git
# Push to both on each session
git push origin main && git push github main
# Or set up automatic mirror with Azure DevOps pipeline
Appendix C: Firestore Daily Export Setup¶
# Enable Cloud Firestore export via gcloud
gcloud firestore export gs://equitrail-backups/$(date +%Y%m%d) \
--project=equitrail \
--async
# Schedule via Cloud Scheduler (one-time setup in GCP console):
# Target: https://firestore.googleapis.com/v1/projects/equitrail/databases/(default):exportDocuments
# Schedule: 0 3 * * * (daily at 03:00 Amsterdam time)
# Body: {"outputUriPrefix": "gs://equitrail-backups/"}
Last reviewed: 2026-06-01 | Next review due: 2026-09-01