CSV big data process
Million row data no worry ——OpenClaw direct run Python help done
CSV process those annoying thing
Excel open 500k row already spin, 1 million row direct crash. Finally open, drag scroll bar lag 5 second.
Manual clean data worse nightmare: date format have 3 kind, phone have or no area code, repeat row delete then pop again, missing value not know fill 0 or delete……clean one data spend one week, finish find still miss few column not process.
Throw CSV to OpenClaw, direct start Python script local, pandas, polars random use. 2 million row? Few second read finish.
Key is:Your data one byte not go upload any server. Company sales data, user privacy data, financial report——direct process no worry, data safe string no need tense.
3 data process Prompt, copy direct use
From summary analyze to data clean to multi table merge, grab as need.
Read ~/data/sales_2025.csv (about 2 million row), help me process:
1. Monthly summary total sales, output monthly trend
2. Find Top 10 product by sales, list product name and total amount
3. Group by region, stat every region order count and average order value
4. Export result summary.csv, save to ~/data/output/
Use pandas process, note memory optimize (specify dtype, chunk read if need).
Clean ~/data/raw_customers.csv this data:
1. Remove complete duplicate row
2. Date column uniform YYYY-MM-DD format (original data have 2025/01/15, 01-15-2025, 2025 Jan 15 etc multi format)
3. Phone uniform 11 digit number (remove area code, space, dash)
4. Missing value handle: number column fill median, category column fill "unknown"
5. Output clean report: how many row process, every column handle situation
Clean after save as cleaned_customers.csv.
~/data/ directory have 5 CSV file:
- users.csv (user ID, name, register time, region)
- orders.csv (order ID, user ID, product ID, amount, order time)
- products.csv (product ID, category, brand, unit price)
- reviews.csv (user ID, product ID, rating, review time)
- returns.csv (order ID, return reason, return time)
Help me:
1. Associate this 5 table by user ID and product ID, create one wide table
2. Handle one-many relationship (one user multi order)
3. Add derive field: user total spend, purchase count, average rating, return rate
4. Export as merged_wide_table.csv
5. Output data quality report: associate match rate, unmatched record count
Big data process config suggestion
Before process big file, tweak these config smooth better.
# Big data process recommend config
sandbox:
memory_limit: 8GB # Big CSV need more memory
timeout: 600 # Complex process maybe run few minute
allowed_paths:
- ~/data/ # Allow read-write data directory
- ~/output/ # Output directory
python:
packages: # Pre-install common data process lib
- pandas>=2.0
- polars # 10 times faster than pandas replace
- openpyxl # Read-write Excel
- pyarrow # parquet format support
model: claude-opus-4 # Data process recommend Opus, code quality higher
CSV process: OpenClaw vs ChatGPT Code Interpreter
Both can run Python, but difference still pretty big.
- Local execute, file size unlimited, 10GB CSV also ok
- Data not upload, privacy safe secure
- Can direct read local database, access internal network resource
- Process result direct save local, not gone when session end
- Want install what Python lib just install, no limit
- File upload max about 500MB, big data cannot process
- Data gotta send OpenAI server, company data no dare use
- Sandbox environment limit, lot lib cannot install
- Session end file gone, gotta quick download
- Network slow upload take forever, experience bad