AI Prompts for Data Analysts
Generate insights, build queries, and summarize datasets without exposing proprietary business data.
"Optimize the following SQL query that takes 45 seconds to run on a table with 10 million rows: 'SELECT u.name, u.email, COUNT(o.id) as order_count, SUM(o.total) as lifetime_value FROM users u LEFT JOIN orders o ON u.id = o.user_id WHERE o.created_at >= DATE_SUB(NOW(), INTERVAL 1 YEAR) AND u.status = "active" GROUP BY u.id, u.name, u.email HAVING lifetime_value > 1000 ORDER BY lifetime_value DESC LIMIT 100;' Identify every performance issue, suggest the optimal indexes to create (with exact CREATE INDEX statements), rewrite the query for better performance, explain why each change helps, and estimate the expected improvement. Include how to verify the improvement using EXPLAIN ANALYZE."
"Design an executive KPI dashboard for the CEO of a B2B SaaS company. Include: the 8 most important metrics to display (MRR, growth rate, churn, NRR, CAC, LTV, burn rate, runway), the ideal visualization type for each metric and why (sparkline vs. big number vs. gauge vs. bar chart), how to show trends (MoM, QoQ, YoY comparisons), color coding conventions (red/yellow/green thresholds for each metric), a drill-down hierarchy (what the CEO should be able to click into), data refresh frequency, and a wireframe layout described in text showing the spatial arrangement. Include 3 common dashboard design mistakes that make executives ignore dashboards and how to avoid them."
"Analyze the following A/B test results and provide a recommendation: Control group (existing checkout page): 12,450 visitors, 498 conversions (4.0%). Variant (simplified checkout): 12,380 visitors, 534 conversions (4.31%). Calculate: the absolute and relative lift, the p-value using a two-proportion z-test (show the formula and calculation), the 95% confidence interval for the difference, whether the result is statistically significant, the estimated annual revenue impact assuming $85 average order value and 500,000 annual visitors, and a clear recommendation (ship, continue testing, or abandon). Also explain: what sample size would have been needed to detect this effect size with 80% power, and common mistakes in interpreting these results."
"Write a complete Python exploratory data analysis (EDA) script using pandas, matplotlib, and seaborn for analyzing a customer churn dataset. The dataset has columns: customer_id, signup_date, last_active_date, plan_type (free/basic/premium), monthly_spend, support_tickets, feature_usage_score, and churned (0/1). The script should: load and inspect the data (shape, dtypes, missing values, duplicates), generate summary statistics with interpretation, create 5 key visualizations (churn rate by plan, spend distribution by churn status, correlation heatmap, feature usage vs. churn, cohort analysis by signup month), identify the top 3 predictive features for churn, and output a summary of findings in markdown format. Include comments explaining each analytical decision."
"Create a template for translating technical data analysis into a stakeholder-friendly insights report. Then demonstrate it with this scenario: you analyzed 6 months of customer support ticket data and found that response time has increased 40%, ticket volume spikes every Monday, 3 product features account for 60% of all tickets, and customers who submit more than 2 tickets in their first month have 3x higher churn. Structure the report as: executive summary (3 bullet points a VP can act on), key findings with visualizations described, root cause analysis, business impact quantified in dollars, and 5 prioritized recommendations with expected impact and effort. Keep it under 2 pages. Avoid technical jargon entirely."
"Design an ETL pipeline that combines data from 3 sources into a single analytics warehouse: (1) a PostgreSQL production database with user and transaction data, (2) a Stripe API for payment and subscription data, and (3) Google Analytics 4 event data via BigQuery export. Cover: extraction strategy for each source (full vs. incremental, CDC for Postgres), transformation logic (deduplication, schema normalization, currency conversion, timezone handling), loading strategy (upsert vs. append), a star schema design for the target warehouse with fact and dimension tables, scheduling approach (Airflow DAG structure), data quality checks at each stage, alerting for pipeline failures, and a monitoring dashboard for pipeline health. Include the SQL for creating the core fact table."
"Create a step-by-step guide for building a cohort retention analysis for a SaaS product. Include: the SQL query to generate monthly signup cohorts and calculate retention rates for each subsequent month, how to create a retention triangle/heatmap visualization in Python (complete code using pandas and seaborn), how to interpret the results (what good vs. bad retention curves look like), how to identify the 'aha moment' from cohort data (the action that correlates with higher retention), benchmarks for B2B SaaS retention by month (Month 1: 80-90%, Month 3: 60-70%, Month 12: 40-50%), and 3 actionable recommendations based on common cohort patterns (early drop-off, gradual decay, cliff at month X)."
"Walk through building a customer lifetime value (LTV) prediction model using Python scikit-learn. The dataset has: customer tenure, plan type, monthly spend, number of seats, industry, support tickets filed, feature adoption score, and NPS score. Cover: data preprocessing (encoding categoricals, handling missing values, feature scaling), feature selection (correlation analysis, VIF for multicollinearity), model selection (compare Linear Regression, Random Forest, and Gradient Boosting), train/test split strategy (80/20 with time-based split to avoid leakage), evaluation metrics (MAE, RMSE, R-squared with interpretation), feature importance analysis, and how to deploy the model for the sales team to use (a simple scoring function). Include complete Python code with comments."
"Create a comprehensive data quality audit checklist to run on any new dataset before starting analysis. Cover 6 dimensions of data quality: Completeness (missing value patterns, required fields), Accuracy (outlier detection, range validation, cross-field consistency), Consistency (format standardization, duplicate detection, referential integrity), Timeliness (data freshness, lag analysis, timezone consistency), Uniqueness (primary key validation, near-duplicate detection), and Validity (business rule compliance, enum value checks). For each dimension, provide: specific checks to run, Python/SQL code snippets for automated detection, severity classification (blocker vs. warning), and a remediation recommendation. Include a data quality scorecard template that produces a single quality score (0-100)."
"Create a metrics definition document for a SaaS company that eliminates ambiguity in how key metrics are calculated. Define 10 critical metrics: MRR, ARR, Net Revenue Retention, Gross Churn, Logo Churn, CAC (blended and paid), LTV, Payback Period, DAU/MAU ratio, and Activation Rate. For each metric: provide the exact formula with SQL pseudocode, specify what's included and excluded (e.g., does MRR include annual contracts divided by 12?), list edge cases and how to handle them (refunds, upgrades mid-month, free trials), show a worked example with specific numbers, and note common miscalculations that lead to inflated or deflated numbers. Include a data lineage diagram showing which source tables feed each metric."
"Design an automated anomaly detection system for monitoring key business metrics (daily revenue, signups, API errors, page load time). Cover: choosing the right algorithm for each metric type (Z-score for normally distributed, IQR for skewed, seasonal decomposition for time series with patterns), setting dynamic thresholds that account for day-of-week and seasonal effects, implementing in Python with complete code (using statsmodels for seasonal decomposition and scipy for statistical tests), an alerting framework (what triggers an alert vs. a warning), reducing false positives (minimum consecutive anomaly periods, business calendar awareness), and a Slack notification template that includes the metric, expected range, actual value, and a link to the dashboard. Include a testing strategy using historical data."
"Create a data storytelling framework for presenting analytical findings to non-technical stakeholders. Cover: the 3-act structure for data presentations (Setup: what question we asked, Conflict: what we found that was surprising, Resolution: what we should do about it), choosing the right chart for the message (comparison, composition, distribution, relationship — with specific chart recommendations for each), the 5-second rule (can someone understand your chart in 5 seconds?), annotation best practices (what to highlight and how), color usage principles (max 3 colors, consistent meaning), how to handle uncertainty and caveats without undermining your recommendation, and a slide template for the 4 most common data presentation scenarios (trend analysis, comparison, deep-dive, recommendation). Include before/after examples of bad vs. good data slides."
Ready to try these prompts?
Start a secure chat on Okara and experience private AI with 20+ open-source models.