CSV big data process

Million row data no worry ——OpenClaw direct run Python help done

CSV process those annoying thing

Million row data, Excel direct lay flat

Excel open 500k row already spin, 1 million row direct crash. Finally open, drag scroll bar lag 5 second.

Manual clean data worse nightmare: date format have 3 kind, phone have or no area code, repeat row delete then pop again, missing value not know fill 0 or delete……clean one data spend one week, finish find still miss few column not process.

OpenClaw: local run Python, data never leave your PC

Throw CSV to OpenClaw, direct start Python script local, pandas, polars random use. 2 million row? Few second read finish.

Key is:Your data one byte not go upload any server. Company sales data, user privacy data, financial report——direct process no worry, data safe string no need tense.

3 data process Prompt, copy direct use

From summary analyze to data clean to multi table merge, grab as need.

Million row sales data: monthly summary + Top 10 Golden instruction

Read ~/data/sales_2025.csv (about 2 million row), help me process:

1. Monthly summary total sales, output monthly trend
2. Find Top 10 product by sales, list product name and total amount
3. Group by region, stat every region order count and average order value
4. Export result summary.csv, save to ~/data/output/

Use pandas process, note memory optimize (specify dtype, chunk read if need).

This most common data analyze scenario. 2 million row local run pandas also few second, no worry upload time and file size limit. Recommend use Opus model, generate pandas code more solid, edge case process better.

Data clean one-stop: dedup + format uniform + missing value handle Beginner friendly

Clean ~/data/raw_customers.csv this data:

1. Remove complete duplicate row
2. Date column uniform YYYY-MM-DD format (original data have 2025/01/15, 01-15-2025, 2025 Jan 15 etc multi format)
3. Phone uniform 11 digit number (remove area code, space, dash)
4. Missing value handle: number column fill median, category column fill "unknown"
5. Output clean report: how many row process, every column handle situation

Clean after save as cleaned_customers.csv.

Data clean look simple, manual do easy miss. Let AI write script run once, compare change Excel column by column fast 100 times, not easy error.

Multi-file merge: 5 CSV associate create wide table Advanced tricks

~/data/ directory have 5 CSV file:
- users.csv (user ID, name, register time, region)
- orders.csv (order ID, user ID, product ID, amount, order time)
- products.csv (product ID, category, brand, unit price)
- reviews.csv (user ID, product ID, rating, review time)
- returns.csv (order ID, return reason, return time)

Help me:
1. Associate this 5 table by user ID and product ID, create one wide table
2. Handle one-many relationship (one user multi order)
3. Add derive field: user total spend, purchase count, average rating, return rate
4. Export as merged_wide_table.csv
5. Output data quality report: associate match rate, unmatched record count

Multi-table merge is data analyze basic skill, but code easy flip on JOIN type. AI auto pick right method base on table structure, still remind one-many maybe cause data swell problem.

Big data process config suggestion

Before process big file, tweak these config smooth better.

OpenClaw big data process config (.openclaw.yml)

# Big data process recommend config
sandbox:
  memory_limit: 8GB          # Big CSV need more memory
  timeout: 600               # Complex process maybe run few minute
  allowed_paths:
    - ~/data/                 # Allow read-write data directory
    - ~/output/               # Output directory

python:
  packages:                   # Pre-install common data process lib
    - pandas>=2.0
    - polars                  # 10 times faster than pandas replace
    - openpyxl                # Read-write Excel
    - pyarrow                 # parquet format support

model: claude-opus-4         # Data process recommend Opus, code quality higher

CSV process: OpenClaw vs ChatGPT Code Interpreter

Both can run Python, but difference still pretty big.

OpenClaw

Local execute, file size unlimited, 10GB CSV also ok
Data not upload, privacy safe secure
Can direct read local database, access internal network resource
Process result direct save local, not gone when session end
Want install what Python lib just install, no limit

ChatGPT Code Interpreter

File upload max about 500MB, big data cannot process
Data gotta send OpenAI server, company data no dare use
Sandbox environment limit, lot lib cannot install
Session end file gone, gotta quick download
Network slow upload take forever, experience bad

Real scenario

E-commerce operation: yearly data review

Year end gotta do full year data review, 12 month sales data scatter in ten+ CSV, total 5+ million row. Boss want report after tomorrow.

OpenClaw solution

One Prompt done: merge 12 month data, multi-dimension summary by product-region-month, generate trend chart and compare table, output complete analyze report. Start to result come not even 20 minute. Data whole process local, financial sensitive info no worry leak.

Manual solution

First open one by one in Excel, open already lag death. Use VLOOKUP connect, formula write error still debug. Just data merge already two day, not start analyze yet.

Few practical small trick

💡 Process super big CSV (few G above), in Prompt mention one sentence "use polars replace pandas", speed can fast 5-10 times. polars memory also use less.

🎯 Not sure data look like what? First let AI "read first 20 row, give data overview", clear see column name, data type, missing situation, then write process Prompt, one time success rate high lot.

⚠️ Process CSV contain Chinese, remember Prompt say encode format (UTF-8 / GBK). Else read maybe is garbled, waste one round chat.

Case ini membantu kamu?