agnes-the-ai-analyst/dev_docs/session_explore.md
Petr c56905d34f Initial commit: OSS data distribution platform
Open-source AI data analyst platform extracted from internal repo.
Includes data sync engine, Keboola adapter, Flask web portal,
server deployment scripts, and configuration templates.
2026-03-08 23:31:28 +01:00

943 lines
24 KiB
Markdown

# Session Exploration Guide
Guide for exploring Claude Code session transcripts to identify friction points and improve the analyst experience.
## Session Data Location
**Server:** `data-broker-for-claude` (alias: `kids`)
**Path:** `/data/user_sessions/`
Sessions are collected by systemd service `session-collector.timer` (runs every 30 minutes).
### Directory Structure
Sessions are organized by user:
```
/data/user_sessions/
├── petr/
│ ├── 2026-02-10_49898dbe-5045-45f5-9177-2ff10917de4a.jsonl
│ └── ...
├── martin.matejka/
│ └── ...
└── jakub.sochan/
└── ...
```
### Permission Issue (See Issue #147)
Session files are owned by `root:data-ops` with `-rw-------` permissions, making direct `scp` impossible. **Workaround:** Use `sudo cat` over SSH:
```bash
# This FAILS
scp kids:/data/user_sessions/petr/session.jsonl .
# Use this instead
ssh kids "sudo cat /data/user_sessions/petr/session.jsonl" > session.jsonl
```
---
## ⚠️ CRITICAL: Common Analysis Mistakes
Based on real-world friction point analysis (Feb 2026), avoid these common pitfalls:
### 1. File mtime ≠ Session Time
**WRONG:**
```bash
# This finds files modified in last 48h, NOT sessions active in last 48h
find /data/user_sessions -name '*.jsonl' -mtime -2
```
**Problem:** File modification time is when collector WROTE the file, not when session started/ended.
**Example:**
- Session starts: 5.2.2026 15:56
- Session ends: 9.2.2026 20:50 (laptop closed for 4 days)
- File mtime: 9.2.2026 20:50
- `-mtime -2` filter: MATCHES (file modified 2 days ago)
- **But session actually started 6 days ago!**
**CORRECT:**
```bash
# Download sessions first, then filter by internal timestamps
for session in *.jsonl; do
FIRST_TS=$(jq -r 'select(.timestamp) | .timestamp' "$session" | head -1)
# Check if FIRST_TS is within your time window
done
```
### 2. Session Gaps (Laptop Closed)
**WRONG:**
```python
# Summing elapsed times without checking for gaps
total_time = sum(event['data']['elapsedTimeSeconds'] for event in bash_events)
```
**Problem:** Sessions can span DAYS with laptop closed. Summing elapsed times includes idle time.
**Example:**
```
Session petr.hunka_2026-02-09_19c0a02f:
First event: 2026-02-05 16:06:52
Last event: 2026-02-09 15:18:50
Span: 3 days, 23 hours
Gap 1: 89.8 hours (laptop closed over weekend)
Gap 2: 4.9 hours (lunch break)
Gap 3: 5.4 hours (after work)
TOTAL ACTIVE TIME: only 4.6 minutes!
WRONG CALCULATION: 173.6 minutes (2.9 hours)
```
**CORRECT:**
```python
# Group events into active blocks (gap >10 min = new block)
def calculate_active_time(timestamps):
blocks = []
current_block_start = timestamps[0]
prev_ts = timestamps[0]
for ts in timestamps[1:]:
gap_seconds = (ts - prev_ts).total_seconds()
if gap_seconds > 600: # >10 min gap
blocks.append((current_block_start, prev_ts))
current_block_start = ts
prev_ts = ts
blocks.append((current_block_start, timestamps[-1]))
total_active = sum((end - start).total_seconds() for start, end in blocks)
return total_active
```
### 3. Bash Progress Elapsed Times Are Cumulative
**WRONG:**
```python
# Interpreting elapsed times as per-operation duration
for event in bash_progress_events:
if event['data']['elapsedTimeSeconds'] > 60:
print(f"This operation took {elapsed}s") # WRONG!
```
**Problem:** `elapsedTimeSeconds` is cumulative from command start, not per-operation.
**Example:**
```
bash_progress events for single rsync command:
Event 1: elapsedTimeSeconds: 11 (11s from start)
Event 2: elapsedTimeSeconds: 12 (12s from start)
Event 3: elapsedTimeSeconds: 13 (13s from start)
...
Event 96: elapsedTimeSeconds: 266 (266s from start)
```
**These are NOT 96 separate 11-266s operations!**
**This is ONE operation with 96 progress updates over 266 seconds.**
**CORRECT:**
```python
# Group by parent command, track min/max elapsed time
commands = {}
for event in bash_progress_events:
cmd_id = event.get('parentUuid') or event.get('uuid')
if cmd_id not in commands:
commands[cmd_id] = {'min': float('inf'), 'max': 0}
elapsed = event['data']['elapsedTimeSeconds']
commands[cmd_id]['min'] = min(commands[cmd_id]['min'], elapsed)
commands[cmd_id]['max'] = max(commands[cmd_id]['max'], elapsed)
# Total duration = max - min
for cmd_id, times in commands.items():
duration = times['max'] - times['min']
print(f"Command {cmd_id} took {duration}s")
```
### 4. Pre-Fix Sessions
**WRONG:** Reporting bugs that were already fixed.
**Problem:** Session file mtime doesn't tell you when session STARTED, only when it was written.
**Example:**
```
Issue #84 fixed: 2026-02-06 21:37:49
Session file: petr.hunka_2026-02-09_19c0a02f.jsonl
File mtime: 2026-02-09 20:50 (within 48h filter)
BUT: Session started 2026-02-05 15:56 (BEFORE fix!)
```
**CORRECT:**
```bash
# Always check when bug was fixed vs. when session started
git log --all --oneline --grep="issue-number" -- relevant/files
# Extract session start time
FIRST_TS=$(jq -r 'select(.timestamp) | .timestamp' session.jsonl | head -1)
# Compare: if session started before fix, ignore the bug report
```
### 5. Context-Free Bash Commands
**WRONG:**
```bash
# Reporting this as "user manually typed /tmp pattern"
pip freeze > /tmp/analyst_requirements.txt
scp /tmp/analyst_requirements.txt data-analyst:/tmp/...
```
**Problem:** You don't know if this was:
1. From our bootstrap.yaml (intended)
2. Claude improvising during troubleshooting (bug)
3. User typed manually (edge case)
**CORRECT:**
```bash
# Look at surrounding context in session
jq -r 'select(.type == "user") | .message.content' session.jsonl | tail -20
# Check if user pasted bootstrap instructions
grep -B10 -A10 "/tmp/analyst_requirements" session.jsonl
# Check our codebase - is this pattern in our scripts?
git grep "/tmp/analyst_requirements" -- scripts/ docs/
```
---
## Session JSONL Format
Each session is stored as a JSONL file with one JSON object per line. Each line represents an event in the session.
### Event Types
Based on real session analysis:
1. **user** - User messages
2. **assistant** - Assistant responses (contains tool_use in content array)
3. **progress** - Progress updates (hooks, bash execution)
4. **system** - System messages (often empty)
5. **file-history-snapshot** - File tracking snapshots
6. **queue-operation** - Queue management events
7. **summary** - Session summaries
### Common Event Structure
All events share:
```json
{
"type": "user|assistant|progress|system|...",
"sessionId": "uuid",
"timestamp": "ISO 8601",
"cwd": "/path/to/working/dir",
"version": "2.1.29",
"uuid": "event-uuid",
"parentUuid": "parent-event-uuid"
}
```
### User Event
```json
{
"type": "user",
"message": {
"role": "user",
"content": "user's message text"
},
"isMeta": false
}
```
### Assistant Event with Tool Use
```json
{
"type": "assistant",
"message": {
"role": "assistant",
"content": [
{
"type": "text",
"text": "response text"
},
{
"type": "tool_use",
"id": "tool-use-id",
"name": "Bash",
"input": {
"command": "ls -la",
"description": "List files"
}
}
]
}
}
```
### Progress Event - Bash Execution
**Important:** Exit codes are NOT present in bash_progress events! Only output and timing.
```json
{
"type": "progress",
"data": {
"type": "bash_progress",
"output": "command output",
"fullOutput": "complete output with formatting",
"elapsedTimeSeconds": 1.234,
"totalLines": 10,
"timeoutMs": 120000
}
}
```
### Progress Event - Hook
```json
{
"type": "progress",
"data": {
"type": "hook_progress",
"hookEvent": "SessionStart",
"hookName": "SessionStart:clear",
"command": "${CLAUDE_PLUGIN_ROOT}/hooks-handlers/session-start.sh"
}
}
```
### Key Insight for Friction Detection
⚠️ **Exit codes and explicit errors are NOT tracked in the current session format!**
To find friction points, look for:
- Error keywords in bash output strings
- Repeated similar commands (retry patterns)
- User questions/confusion in messages
- Long bash execution times
- Context overflow (session continuations)
## Practical Commands
### Browse Sessions on Server
```bash
# List all sessions
ssh kids "ls -lh /data/user_sessions/"
# Count sessions
ssh kids "ls /data/user_sessions/ | wc -l"
# Find recent sessions (last 7 days)
ssh kids "find /data/user_sessions -name '*.jsonl' -mtime -7"
# Find sessions by user
ssh kids "ls /data/user_sessions/ | grep '^petr-'"
```
### Download Sessions for Local Analysis
```bash
# Download specific session
scp kids:/data/user_sessions/petr-2024-12-15-abc123.jsonl .
# Download all sessions for a user
scp kids:/data/user_sessions/petr-*.jsonl ./sessions/
# Download recent sessions
ssh kids "find /data/user_sessions -mtime -7" | xargs -I {} scp kids:{} ./sessions/
```
### Quick Stats on Server
```bash
# Count events per session
ssh kids "wc -l /data/user_sessions/*.jsonl"
# Count tool uses
ssh kids "grep -h '\"type\": \"tool_use\"' /data/user_sessions/*.jsonl | wc -l"
# Count errors
ssh kids "grep -h '\"is_error\": true' /data/user_sessions/*.jsonl | wc -l"
```
## jq Queries for Analysis
### Extract All Bash Commands
```bash
# Extract from assistant messages with tool_use
jq -r 'select(.type == "assistant") | .message.content[]? | select(.type == "tool_use" and .name == "Bash") | .input.command' session.jsonl
```
### Find Bash Output (for error keywords)
```bash
# Extract bash execution output
jq -r 'select(.type == "progress" and .data.type == "bash_progress") | .data.fullOutput' session.jsonl
```
### Find Error Keywords in Output
Since exit codes aren't tracked, search output strings:
```bash
# Search bash output for errors
jq -r 'select(.type == "progress" and .data.type == "bash_progress") | .data.fullOutput' session.jsonl | grep -i "error\|failed\|permission denied"
```
### ⚠️ Exit Code Queries Don't Work
Exit codes are NOT present in the current session format. These queries will return nothing:
```bash
# These DON'T work with current format
jq 'select(.exit_code == 127)' session.jsonl # Returns nothing
jq 'select(.data.exitCode != 0)' session.jsonl # Returns nothing
```
### Count Tool Usage
```bash
jq -r 'select(.type == "tool_use") | .name' session.jsonl | sort | uniq -c
```
### Extract Error Messages
```bash
jq -r 'select(.is_error == true) | .content' session.jsonl
```
### Find Retry Patterns
```bash
# Find repeated similar commands (potential retry loops)
jq -r 'select(.type == "tool_use" and .name == "Bash") | .input.command' session.jsonl | sort | uniq -c | sort -rn
```
### Session Timeline
```bash
jq -r '[.timestamp, .type, .name // .role] | @tsv' session.jsonl
```
### Full Tool Use + Result
```bash
# Extract tool use followed by its result
jq -s 'group_by(.tool_use_id // empty) | .[] | select(length == 2)' session.jsonl
```
## grep Patterns for Friction Points
### Permission Errors
```bash
grep -i "permission denied" session.jsonl
grep -i "access denied" session.jsonl
grep -i "not permitted" session.jsonl
```
### Command Not Found
```bash
grep "command not found" session.jsonl
grep "No such file or directory" session.jsonl
```
### Authentication Issues
```bash
grep -i "authentication failed" session.jsonl
grep -i "unauthorized" session.jsonl
grep -i "forbidden" session.jsonl
```
### Timeout Issues
```bash
grep -i "timeout" session.jsonl
grep -i "timed out" session.jsonl
```
### Data Sync Issues
```bash
grep "sync_data.sh" session.jsonl
grep "rsync" session.jsonl | grep -i "error\|fail"
```
### Python/Environment Issues
```bash
grep "ModuleNotFoundError" session.jsonl
grep "ImportError" session.jsonl
grep "No module named" session.jsonl
```
## What to Look For (Friction Points)
### 1. Tool Failures
**Indicators:**
- `"is_error": true` in tool results
- Non-zero exit codes
- Exceptions in tool output
**Check:**
- Which tools fail most often?
- Are there common error patterns?
- Do errors lead to retry loops?
### 2. Permission Issues
**Indicators:**
- Exit code 126
- "Permission denied" messages
- Sudoers-related errors
**Check:**
- What operations need elevated permissions?
- Are there gaps in sudoers configuration?
- Do users hit permission walls frequently?
### 3. Command Not Found
**Indicators:**
- Exit code 127
- "command not found" messages
**Check:**
- Missing system utilities?
- PATH issues?
- Typos in commands vs. actual bugs?
### 4. Retry Loops
**Indicators:**
- Same command repeated multiple times
- Error → retry → error pattern
- High tool use count for single task
**Check:**
- What triggers retries?
- Are retries effective or just wasting time?
- Could better error messages prevent retries?
### 5. Data Sync Problems
**Indicators:**
- rsync errors
- Missing files after sync
- Stale data complaints
**Check:**
- Network issues?
- Permission problems on server?
- User confusion about when to sync?
### 6. Environment Setup Issues
**Indicators:**
- Missing Python modules
- Virtual environment problems
- Dependency conflicts
**Check:**
- Are setup instructions clear?
- Do users skip setup steps?
- Are requirements.txt files accurate?
### 7. User Confusion
**Indicators:**
- Repeated questions about same topic
- Commands tried in wrong directory
- Misunderstanding of data structure
**Check:**
- Documentation gaps?
- Confusing error messages?
- Missing guidance in critical moments?
## Creating GitHub Issues
### When to Create an Issue
Create a GitHub issue when you find:
- Repeated failures across multiple sessions
- Systematic problems (not user typos)
- Missing features that would prevent friction
- Documentation gaps that cause confusion
- Security or permission model issues
### Issue Template
```markdown
## Friction Point: [Short Description]
**Source:** Session exploration - [username]-[date]-[session_id]
**Frequency:** [How often this appears in sessions]
**Impact:** [High/Medium/Low]
### Problem
[Clear description of what goes wrong]
### Evidence
```
[Relevant excerpts from session JSONL]
```
### Root Cause
[Your analysis of why this happens]
### Proposed Solution
[How to fix this - could be code, documentation, or process change]
### Related Sessions
- session1.jsonl
- session2.jsonl
```
### Labels to Use
- `user-feedback` - Always use this for friction points from sessions
- `claude-learnings` - If the issue reveals Claude-specific patterns
- `bug` - If something is broken
- `documentation` - If docs are missing or unclear
- `enhancement` - If a feature would prevent the friction
- `security` - If related to permissions or access control
- `pipeline` - If data sync or transformation related
## Example Exploration Workflow
### Step 1: Get Overview
```bash
# How many sessions do we have?
ssh kids "ls /data/user_sessions/ | wc -l"
# When was the last session?
ssh kids "ls -lt /data/user_sessions/ | head -5"
# Which users have sessions?
ssh kids "ls /data/user_sessions/ | cut -d- -f1 | sort | uniq -c"
```
### Step 2: Sample Sessions
```bash
# Download a few recent sessions
mkdir -p ~/session-exploration
cd ~/session-exploration
# Get 5 most recent
ssh kids "ls -t /data/user_sessions/*.jsonl | head -5" | xargs -I {} scp kids:{} .
```
### Step 3: Quick Analysis
```bash
# For each session, check:
for session in *.jsonl; do
echo "=== $session ==="
echo "Events: $(wc -l < $session)"
echo "Errors: $(jq 'select(.is_error == true)' $session | wc -l)"
echo "Exit codes: $(jq -r 'select(.exit_code != null and .exit_code != 0) | .exit_code' $session | sort | uniq -c)"
echo ""
done
```
### Step 4: Deep Dive on Errors
```bash
# Extract all errors to a file
jq 'select(.is_error == true)' *.jsonl > all_errors.json
# Analyze error patterns
jq -r '.content' all_errors.json | sort | uniq -c | sort -rn
```
### Step 5: Create Issues
For each distinct friction point:
1. Document the pattern
2. Collect evidence from sessions
3. Propose solution
4. Create GitHub issue with appropriate labels
## Tips
- **Start small:** Analyze 5-10 sessions first before scaling up
- **Look for patterns:** Single errors might be user mistakes, repeated errors are friction points
- **Context matters:** Read surrounding messages to understand user intent
- **Time analysis:** Check if friction points occur at specific times (e.g., after server updates)
- **User comparison:** Do all users hit the same issues or are some user-specific?
---
## Practical Example: Correct Session Analysis
Based on lessons learned from Feb 2026 friction point analysis:
### Step 1: Download Sessions with Correct Time Filtering
```bash
# Create analysis directory
mkdir -p ~/session-analysis/raw
# Download ALL sessions (we'll filter by timestamps later)
# OR download sessions from specific date range based on filename
ssh kids "find /data/user_sessions -name '*2026-02-1[01]*.jsonl' -type f" | while read session; do
filename=$(echo $session | sed 's/\/data\/user_sessions\///' | tr '/' '_')
ssh kids "sudo cat $session" > ~/session-analysis/raw/$filename
done
```
### Step 2: Filter by Internal Timestamps
```python
#!/usr/bin/env python3
"""Filter sessions by actual start/end time, not file mtime."""
import json
from datetime import datetime, timezone, timedelta
from pathlib import Path
# Define time window
NOW = datetime.now(timezone.utc)
WINDOW_START = NOW - timedelta(hours=48)
session_dir = Path("~/session-analysis/raw").expanduser()
valid_sessions = []
for session_file in session_dir.glob("*.jsonl"):
try:
# Get first and last timestamp
with open(session_file) as f:
lines = f.readlines()
first_event = json.loads(lines[0])
last_event = json.loads(lines[-1])
first_ts = datetime.fromisoformat(first_event['timestamp'].replace('Z', '+00:00'))
last_ts = datetime.fromisoformat(last_event['timestamp'].replace('Z', '+00:00'))
# Check if session STARTED within window (not just ended)
if first_ts >= WINDOW_START:
valid_sessions.append({
'file': session_file.name,
'start': first_ts,
'end': last_ts,
'duration': last_ts - first_ts
})
print(f"✓ {session_file.name}: {first_ts} -> {last_ts}")
else:
print(f"✗ {session_file.name}: started {first_ts} (before window)")
except Exception as e:
print(f"✗ {session_file.name}: error - {e}")
print(f"\nFound {len(valid_sessions)} sessions within 48h window")
```
### Step 3: Calculate Active Time (Skip Gaps)
```python
#!/usr/bin/env python3
"""Calculate real active time, excluding laptop-closed gaps."""
import json
from datetime import datetime
from pathlib import Path
def calculate_active_time(session_file, gap_threshold_minutes=10):
"""
Calculate active time in session, excluding gaps >threshold.
Args:
session_file: Path to JSONL session file
gap_threshold_minutes: Gap >this = laptop closed (default 10 min)
Returns:
dict with total_span, active_time, gaps
"""
timestamps = []
with open(session_file) as f:
for line in f:
try:
event = json.loads(line)
if ts_str := event.get('timestamp'):
ts = datetime.fromisoformat(ts_str.replace('Z', '+00:00'))
timestamps.append(ts)
except:
pass
if len(timestamps) < 2:
return None
timestamps.sort()
# Find gaps
gaps = []
blocks = []
block_start = timestamps[0]
prev_ts = timestamps[0]
for ts in timestamps[1:]:
gap_seconds = (ts - prev_ts).total_seconds()
if gap_seconds > gap_threshold_minutes * 60:
# End current block
blocks.append((block_start, prev_ts))
gaps.append({
'start': prev_ts,
'end': ts,
'duration_hours': gap_seconds / 3600
})
block_start = ts
prev_ts = ts
# Last block
blocks.append((block_start, timestamps[-1]))
# Calculate metrics
total_span = (timestamps[-1] - timestamps[0]).total_seconds()
active_time = sum((end - start).total_seconds() for start, end in blocks)
gap_time = total_span - active_time
return {
'total_span_hours': total_span / 3600,
'active_time_hours': active_time / 3600,
'gap_time_hours': gap_time / 3600,
'num_gaps': len(gaps),
'gaps': gaps,
'num_blocks': len(blocks)
}
# Example usage
session = Path("~/session-analysis/raw/petr.hunka_2026-02-09_19c0a02f.jsonl").expanduser()
result = calculate_active_time(session)
print(f"Total span: {result['total_span_hours']:.2f} hours")
print(f"Active time: {result['active_time_hours']:.2f} hours")
print(f"Gap time: {result['gap_time_hours']:.2f} hours")
print(f"Number of gaps: {result['num_gaps']}")
for i, gap in enumerate(result['gaps'], 1):
print(f" Gap {i}: {gap['duration_hours']:.1f} hours ({gap['start']} -> {gap['end']})")
```
### Step 4: Verify Bug Fix Timeline
```bash
#!/bin/bash
# Check if session contains pre-fix bugs
SESSION_FILE="$1"
ISSUE_NUMBER="$2" # e.g., 84
# Get session start time
FIRST_TS=$(jq -r 'select(.timestamp) | .timestamp' "$SESSION_FILE" | head -1)
echo "Session started: $FIRST_TS"
# Get when issue was fixed
echo -e "\nIssue #$ISSUE_NUMBER fix timeline:"
gh issue view $ISSUE_NUMBER --json closedAt,title | jq -r '"Closed: " + .closedAt + " - " + .title'
# Get relevant commits around that time
echo -e "\nRelevant commits:"
CLOSE_DATE=$(gh issue view $ISSUE_NUMBER --json closedAt -q .closedAt | cut -d'T' -f1)
git log --all --since="$CLOSE_DATE" --until="$(date -v+1d -j -f "%Y-%m-%d" "$CLOSE_DATE" +%Y-%m-%d)" \
--oneline --grep="$ISSUE_NUMBER\|tmp\|requirements" -- scripts/ docs/setup/
echo -e "\nConclusion:"
if [[ "$FIRST_TS" < "$(gh issue view $ISSUE_NUMBER --json closedAt -q .closedAt)" ]]; then
echo "⚠️ Session STARTED BEFORE fix - bug is expected"
else
echo "✓ Session started AFTER fix - bug should not appear"
fi
```
### Step 5: Generate Accurate Report
```python
#!/usr/bin/env python3
"""Generate friction report with verified metrics."""
import json
from pathlib import Path
from collections import defaultdict
def analyze_friction_points(sessions_dir):
"""
Analyze friction points with proper methodology.
Avoids:
- False positives from pre-fix sessions
- Inflated time estimates from gaps
- Misinterpreted cumulative elapsed times
"""
friction_points = defaultdict(list)
for session_file in Path(sessions_dir).glob("*.jsonl"):
# 1. Check session time window
active_time = calculate_active_time(session_file)
if not active_time or active_time['total_span_hours'] > 48:
continue # Skip multi-day sessions or invalid
# 2. Extract friction indicators
with open(session_file) as f:
for line in f:
event = json.loads(line)
# Example: Permission errors
if event.get('type') == 'progress':
output = event.get('data', {}).get('fullOutput', '')
if 'Permission denied' in output:
friction_points['permission_errors'].append({
'session': session_file.name,
'timestamp': event.get('timestamp'),
'output': output[:200]
})
# Example: Slow operations (using proper elapsed time interpretation)
if event.get('type') == 'progress' and event.get('data', {}).get('type') == 'bash_progress':
# This is cumulative, not per-operation!
# Track command start/end, not individual progress events
pass
return friction_points
# Generate report
friction = analyze_friction_points("~/session-analysis/raw")
for category, incidents in friction.items():
print(f"\n## {category.replace('_', ' ').title()}")
print(f"Found {len(incidents)} incidents")
# ... detailed reporting
```
---
## Future Automation
Once we understand common patterns, we can build:
- Automated friction detection scripts
- Session quality metrics
- Alerting for critical failures
- User experience dashboards
But start with manual exploration to learn what matters.