summaryrefslogtreecommitdiff
path: root/CLAUDE.md
blob: 2f77c579ec952697f9eac672fcc091e4d7fd7ae8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

**Tokyo Livehouse Events** is a full-stack web service that automatically scrapes and aggregates event information from major Tokyo live houses. Built with React Router v7 (SSR), SQLite (better-sqlite3), and Tailwind CSS, it allows users to search, filter, and view live music events across multiple venues.

### Key Stack
- **Frontend**: React 19 + React Router v7 (SSR enabled)
- **Backend**: React Router Node server + SQLite database
- **Scraping**: Cheerio (HTML parsing) + Playwright (JS-heavy sites)
- **Styling**: Tailwind CSS v4
- **Build**: Vite + React Router build system

---

## Common Development Commands

```bash
# Install dependencies
npm install

# Development server (http://localhost:5173)
npm run dev

# Type checking
npm run typecheck

# Build for production
npm run build

# Start production server (after build)
npm start

# Run scraper (all venues)
npm run scrape

# Run scraper for specific venue
npm run scrape liquid-room

# List registered scrapers
npm run scrape -- --list
```

### Scraper CLI Details
The scraper CLI (`scripts/scrape.ts`) runs with `tsx` and supports:
- **No args**: Scrapes all 12 registered venues
- **Venue ID arg**: Scrapes single venue (e.g., `npm run scrape meets-otsuka`)
- **`--list` flag**: Displays all registered scrapers with IDs and names

Output includes success count per venue, elapsed time, and exit code (0 = all success, 1 = any failure).

---

## Architecture Overview

### Directory Structure
```
app/
├── lib/                              # Server-only utilities
│   ├── db.server.ts                  # SQLite database layer (better-sqlite3)
│   ├── scraper-runner.server.ts      # Orchestrates scraping + Markdown generation
│   ├── playwright.server.ts          # Shared browser instance for JS-heavy sites
│   ├── markdown-writer.server.ts     # Generates events/<venue-id>.md files
│   └── venue-meta.server.ts          # Server-only scraper metadata
│
├── scrapers/                         # Scraper modules (one per venue)
│   ├── base.ts                       # Scraper interface & VenueMeta type
│   ├── index.ts                      # Registry: ALL_SCRAPERS (15 venues)
│   ├── liquid-room.ts                # Example: fetch + Cheerio
│   ├── flat-nishiogikubo.ts          # Example: Playwright (Wix site)
│   ├── warp-kichijoji.ts             # fetch + Cheerio
│   ├── pitbar-nishiogikubo.ts        # Playwright (freecalend.com)
│   └── [11 other venues]
│
├── routes/                           # React Router routes (config-mapped in routes.ts)
│   ├── index.tsx                     # Redirects to /events
│   ├── events._index.tsx             # Event list with filtering
│   ├── events.$id.tsx                # Event detail page
│   ├── events.by-date.tsx            # Calendar/date-based view
│   ├── venues.tsx                    # Venue list + scrape status
│   ├── api.scrape.ts                 # POST/GET endpoint to trigger scraping
│   └── api.scrape-status.ts          # GET endpoint for scrape job status
│
├── components/
│   ├── EventCard.tsx                 # Card view for events
│   ├── EventListRow.tsx              # Row view for events
│   └── FilterBar.tsx                 # Search/filter form
│
├── root.tsx                          # Root layout + error boundary
├── routes.ts                         # React Router route config (file-based)
└── app.css                           # Tailwind/global styles

events/                               # Auto-generated Markdown files (one per venue)
events.db                             # SQLite database (created at runtime)
scripts/
└── scrape.ts                         # CLI scraper entry point
```

### Data Flow

#### Scraping Pipeline
1. User calls `npm run scrape` (CLI) or `POST /api/scrape` (HTTP)
2. `scraper-runner.server.ts` → `runAllScrapers()` or `runScraper(venueId)`
3. For each scraper in `ALL_SCRAPERS`:
   - Execute `scraper.scrape()` → returns `EventInput[]`
   - Filter events within 35-day scraping window
   - Upsert each event to SQLite via `db.server.ts`
   - Log results to `scrape_logs` table
4. After success, generate Markdown files in `events/` directory
5. Close shared Playwright browser instance
6. Return results (CLI: pretty-printed output, HTTP: JSON with run_id)

#### Web UI Data Access
1. Route loaders call `queryEvents()` or `getVenues()` from `db.server.ts`
2. Results rendered as React components
3. Filter form passes query params: `date_from`, `date_to`, `venue_id`, `keyword`
4. Pagination via `page` param (30 events per page)
5. View toggle: card vs. list (persisted in URL as `view` param)

---

## Scraper Implementation Pattern

Each scraper module exports:
- `venue`: VenueMeta object (id, name, url, area)
- `scraper`: Scraper object with `scrape()` async method

### Typical Flow (Cheerio-based)
```typescript
export const scraper: Scraper = {
  venue,
  async scrape(): Promise<EventInput[]> {
    const res = await fetch(venue_schedule_url);
    const html = await res.text();
    const $ = cheerio.load(html);
    const events: EventInput[] = [];
    
    $("selector-for-event-items").each((_, el) => {
      const title = $(el).find(".title").text().trim();
      const date = parseJapaneseDate($(el).find(".date").text());
      // ... extract other fields
      events.push({ venue_id: venue.id, title, date, ... });
    });
    return events;
  }
};
```

### Playwright-based (JS-required sites like Wix)
```typescript
export const scraper: Scraper = {
  venue,
  async scrape(): Promise<EventInput[]> {
    const browser = await getBrowser(); // Singleton browser
    const page = await browser.newPage();
    await page.goto(url, { waitUntil: "domcontentloaded" });
    // ... navigate, extract data via locators
    await page.close();
    return events;
  }
};
```

### Key Utilities
- **Date parsing**: `parseJapaneseDate(str)` handles formats like "2025年06月15日", "06/15", etc. Defaults to current year if year is omitted.
- **URL handling**: `absoluteUrl(url, base)` converts relative to absolute URLs.
- **Event deduplication**: Filter by `date + title` if needed.
- **Scraping window**: Always ~35 days from today (`SCRAPE_WINDOW_DAYS` constant).

---

## Database Schema

### Tables
- **venues**: id (PK), name, url, area
- **events**: id (PK), venue_id (FK), title, artist, date, start_time, open_time, ticket_url, price, image_url, description, source_url, fetched_at
  - Unique constraint: (venue_id, title, date) — prevents duplicates on re-scrape
  - Indexes on date and venue_id
- **scrape_logs**: id (PK), run_id (UUID), venue_id, venue_name, status ("running"|"ok"|"error"), events_saved, error, started_at, finished_at

### Key Functions (lib/db.server.ts)
- `getDb()`: Singleton database connection
- `upsertVenue()`, `upsertEvent()`: Insert or replace records
- `queryEvents(params)`: Search with filters (date range, venue, keyword)
- `getEvent(id)`: Single event detail
- `getVenues()`: All venues with event counts
- Scrape log functions: `insertScrapeLog()`, `updateScrapeLog()`, `getLatestScrapeRun()`, `getScrapeRunById()`

---

## Adding a New Venue

### Step 1: Create Scraper File
Create `app/scrapers/<venue-id>.ts` implementing the `Scraper` interface (see pattern above).

### Step 2: Register in Index
Add import and entry to `app/scrapers/index.ts` → `ALL_SCRAPERS` array.

### Step 3: Update Documentation
Add row to `SCRAPE_TARGETS.md` table.

### Automated Approach (Claude Code Skill)
Run `/add-livehouse <Venue Name> <URL>` to get interactive guidance for venue addition.

---

## API Endpoints

### Scraping
- `POST /api/scrape` (form action) → Starts all scrapers in background, redirects to referrer, returns 202
- `GET /api/scrape?venue_id=<id>` → Starts single venue scraper, returns `{ run_id, status: "started" }` (202)
- `GET /api/scrape` (no params) → Starts all scrapers, same as POST

### Status
- `GET /api/scrape-status?run_id=<uuid>` → Returns scrape logs for specific run
- `GET /api/scrape-status` → Returns latest run's logs

Both return `{ running: boolean, results: ScrapeLog[] }`.

---

## React Router Specifics

### Config-based Routing
Routes are defined explicitly in `app/routes.ts` using `index()`, `route()`, and `prefix()` helpers from `@react-router/dev/routes`. File names under `routes/` do not auto-determine paths — the mapping is set in `routes.ts`. Use `params.<name>` in loaders/components to access dynamic segments (e.g., `:id` → `params.id`).

### SSR Configuration
Enabled in `react-router.config.ts` (`ssr: true`). All routes server-render by default.

### Loaders & Actions
- Loaders: `export async function loader({ request, params }: Route.LoaderArgs)`
- Actions: `export async function action({ request }: Route.ActionArgs)`
- Data accessed via `useLoaderData<typeof loader>()` hook

### Link Prefetch
Use `<Link>` from react-router for client-side navigation (no full page reload).

---

## Important Implementation Details

### Scrape Window
- Fixed at 35 days from today
- Filters applied in `scraper-runner.server.ts` → `withinWindow(event, from, to)`
- Events outside this range are discarded before DB insert

### Shared Playwright Browser
- Singleton instance via `getBrowser()` in `playwright.server.ts`
- Only created if a scraper calls it
- Closed after each scraping run via `closeBrowser()`
- Used by Wix sites (FLAT 西荻窪) and other JS-heavy venues

### Event Deduplication
- Unique constraint: `(venue_id, title, date)` prevents duplicates
- On conflict: all fields (except date/venue/title) are updated → re-scrape refreshes data
- No manual cleanup needed

### Markdown Generation
- Auto-generated in `events/<venue-id>.md` after successful scrape
- Table format: date | artist | title | time | price | URL
- Marked as auto-generated (warns against manual editing)
- Regenerated on each scrape (overwrites previous)

### Date Parsing
- Handles Japanese format: "2025年06月15日", "2025/06/15", "06/15" (infers current year)
- Converted to ISO format (YYYY-MM-DD) for database and API
- Display converted back to Japanese in UI (e.g., "2025/06/15(日)")

---

## Environment & Prerequisites

- **Node.js**: 20.12+ (required for `styleText` API used in scraper CLI)
- **npm**: latest
- **Optional**: Playwright binary (auto-installed on first `npm install`)

For Docker: See `Dockerfile` (multi-stage build, production uses node:20-alpine).

---

## Type Safety

The project uses strict TypeScript with path aliases:
- `~/*` maps to `app/*` (configured in `tsconfig.json`)
- Run `npm run typecheck` to validate before commits
- React Router auto-generates types via `@react-router/dev` (`.react-router/types/`)

---

## Key Files Not to Miss

- **`app/lib/db.server.ts`**: Database schema and query interface — essential for understanding data layer
- **`app/lib/scraper-runner.server.ts`**: Orchestration logic — where scraping window and dedup happen
- **`app/scrapers/base.ts`**: Scraper interface — defines contract all venues must implement
- **`app/scrapers/liquid-room.ts`** & **`flat-nishiogikubo.ts`**: Templates for simple (Cheerio) and complex (Playwright) scrapers
- **`app/routes/events._index.tsx`**: Main event list with filters — shows how DB queries integrate with UI
- **`SCRAPE_TARGETS.md`**: Live reference of all 12+ venues, their status, and scraper locations

---

## Debugging

### Scraper Issues
- Check `events.db` with SQLite browser to inspect DB state
- Re-run single venue: `npm run scrape <venue-id>`
- Inspect scrape logs: `GET /api/scrape-status` endpoint in browser
- Add console.log in scraper for debugging (visible in CLI output)

### Database Queries
- Direct SQL via `getDb().prepare(...).all()` in `lib/db.server.ts`
- WAL mode enabled → check `events.db-shm` and `events.db-wal` if DB locked
- Foreign keys enforced → cannot delete venues with events

### Playwright Issues
- Headless browser logs: Add `{ headless: false }` to `chromium.launch()` to see browser
- Timeout errors: Check site structure changed, adjust selectors in scraper
- Memory leaks: Ensure `page.close()` called (in finally block)

---

## Permissions & Settings

Project uses `.claude/settings.json` with pre-configured allowlists for:
- `npm run` and `npx` commands
- Local dev server access (http://localhost:5173)
- Scraper endpoint testing
- Git commits
- External domain scraping (WebFetch for venue sites)

Future instances inherit these permissions. Add new patterns as needed.