A step-by-step journey through data cleaning, star schema design, SQL Server, and Power BI. This is Part 1 of a four-phase sales analytics series — covering descriptive analysis: what happened?

Introduction

As a data analyst, one of the most common challenges is turning a messy raw dataset into a clear, actionable dashboard that business stakeholders can actually use. In this project, I built a complete end-to-end sales analytics pipeline — from a raw 9,994-row CSV file all the way to an interactive Power BI dashboard — using Python, SQL Server, and Power BI.

This is the first of four analysis phases:

  • Descriptive Analysis ← (this post)
  • Diagnostic Analysis
  • Predictive Analysis
  • Prescriptive Analysis

The Business Problem

The goal of descriptive analysis is to answer the fundamental question: what happened? Specifically, the business wanted to understand which products generate the most revenue, how sales performed across different regions and categories, what monthly revenue trends look like over time, and what the overall KPIs are — total revenue, profit, orders, and units sold.

The dataset used is a retail sales dataset spanning 2011 to 2014, containing over 9,900 transaction records across the United States.

The Dataset

The raw CSV contained 20 columns covering the full sales lifecycle:

Step 1: Data Cleaning with Python

The first challenge was understanding the data properly. An initial cleaning attempt reduced the dataset from 9,994 rows to just 2,650 — a red flag that something had gone wrong.

The culprit? A common mistake:

One order can contain multiple products. Deduplicating by Order ID alone deletes all but one line item per order — losing thousands of valid transactions. The correct approach removes only truly identical rows:

The final cleaning script:

Result: 9,994 clean rows, 5,009 unique orders, 1,862 unique products.

Step 2: Star Schema Design

Rather than dumping the flat CSV directly into Power BI (a common shortcut), I designed a proper star schema in SQL Server. This is industry-standard data modeling that makes dashboards faster, cleaner, and easier to maintain.

A star schema has two types of tables: a fact table that stores measurable numbers (Sales, Profit, Quantity, Discount), and dimension tables that store descriptive context (Who, What, Where, When). The fact_sales table sits at the center, connected to 5 dimension tables via foreign keys. Every metric lives in the fact table. Every label lives in a dimension.

Key lesson: Always use DECIMAL(10,2) for financial columns — never FLOAT. Float has rounding errors that corrupt money calculations.

Step 3: Loading Data with Python + pyodbc

With the schema in place, I wrote a Python script to extract each dimension table from the cleaned CSV and load them into SQL Server in the correct order — dimensions first, fact table last.

Challenges encountered and fixed:

ErrorCauseFix
DateTime overflowPandas Timestamp not accepted by SQL ServerConvert to string "%Y-%m-%d" first
String truncationProduct names exceeded VARCHAR(100)Increased to VARCHAR(255)
Wrong row countdayfirst=True broke date parsingUsed format="%Y-%m-%d" explicitly

Final verification after loading:

star schema

Step 4: Descriptive Analysis SQL Queries

Before touching Power BI, I wrote the analytical queries that would power the dashboard visuals:

Step 5: Power BI Dashboard

With the data warehouse and queries ready, I connected Power BI Desktop to SQL Server and built the dashboard. Power BI auto-detected most relationships — all 5 foreign key relationships were set to One-to-Many from dimension to fact table.

The DAX measure for Profit Margin:

Formatted as Percentage in Measure tools — gives a clean 12.47% display. Dashboard visuals built:

VisualTypePurpose
KPI CardsNew CardTotal Revenue, Profit, Orders, Units Sold, Profit Margin
Top 10 ProductsClustered Bar ChartIdentify best performing products
Sales by Region & CategoryClustered Bar ChartCompare regional performance
Monthly Revenue TrendsLine ChartIdentify seasonal patterns

A year range slicer was added to allow stakeholders to filter by year interactively.

Key Business Insights

Overall Performance (2011–2014):

  • Total Revenue: $2.30M
  • Total Profit: $286.40K
  • Profit Margin: 12.47%
  • Total Orders: 5,009
  • Total Units Sold: 38K
Top Product: Canon imageCLASS 2200 Advanced Copier at ~$60K revenue — significantly ahead of all other products.
  • Regional Performance: West leads with the highest revenue, followed by East. South is the weakest performing region across all categories.
  • Seasonal Pattern: Sales consistently spike in Q4 (November–December) every year — classic holiday/year-end demand pattern. Q1 is consistently the weakest period.
  • Category Breakdown: Technology and Office Supplies are the dominant revenue drivers across all regions.

What Made This Project Different

Most Superstore dataset projects on Kaggle follow a simple path: download CSV → load to Power BI → make charts. This project took a different approach:

  • Built a proper data pipeline instead of loading raw CSV directly
  • Designed a star schema following data warehouse best practices
  • Used SQL Server as a proper data store, not just Power BI's in-memory model
  • Followed a structured analytical framework (Descriptive → Diagnostic → Predictive → Prescriptive)
  • Backed every visual with SQL queries for transparency and reproducibility

What's Next

In the next phase — Diagnostic Analysis — I investigate the why behind these numbers:

  • Why does West consistently outperform South?
  • Why is Furniture the weakest category nationally?
  • Which products are losing the company money?
  • What is the relationship between discounting and profit margin?

Want To See the Source Code?

Click the "gi" button below to see the codes and technologies I used.