Synthetic Data Generation#
ProRCA includes a dedicated module, data_generators, designed to create realistic datasets for testing, demonstration, and validation purposes. While you are encouraged to use ProRCA with your own operational data, the synthetic data generator provides a valuable tool, especially when you need a controlled environment with known ground truth.
Why Use Synthetic Data?#
Algorithm Validation: Test ProRCA’s anomaly detection and root cause analysis capabilities against datasets where the anomalies and their causes are known beforehand.
Demonstration & Learning: Provide clear examples and tutorials without requiring sensitive real-world data.
Benchmarking: Create standardized datasets to compare different configurations or versions of analysis techniques.
Filling Gaps: Use synthetic data when real data is scarce, incomplete, or unavailable.
Realism and Quality#
The data generation process within ProRCA has been meticulously crafted to ensure a high degree of realism:
Expert Validation: The underlying model and parameters, particularly for the initial retail generator, have been developed and validated in consultation with industry experts to reflect common operational dynamics.
Causal Structure: The data is generated based on plausible causal relationships between different business metrics (e.g., price affects sales, quantity affects costs), providing a solid foundation for causal analysis.
Realistic Distributions: Variables are often generated using distributions (like Log-Normal or Poisson) that mimic real-world patterns, rather than simple uniform randomness.
Configurability: Key parameters like date ranges, product details, and pricing strategies can be adjusted.
Current Capabilities: Retail Data#
The first release of ProRCA includes a generator specifically focused on fashion retail transactions.
Generate Retail Data#
Function:
generate_fashion_data_with_brandPurpose: Creates a realistic DataFrame mimicking retail sales, costs, and profit calculations over time, simulating aspects like different brands, product hierarchies, sales channels, promotions, and customer loyalty.
Module:
data_generators.synthetic_sales_data- Arguments:
start_date(str): The beginning date for the data generation period (e.g., “YYYY-MM-DD”).end_date(str): The end date for the data generation period (e.g., “YYYY-MM-DD”).
Returns: A
pandas.DataFramecontaining the synthetic transaction data. Columns include operational metrics likeORDERDATE,QUANTITYORDERED,PRICEEACH,UNIT_COST,SALES,DISCOUNT,NET_SALES,FULFILLMENT_COST,MARKETING_COST,RETURN_COST,COST_OF_GOODS_SOLD,SHIPPING_REVENUE,PROFIT,PROFIT_MARGIN, as well as contextual dimensions likeBRAND,MERCHANDISE_HIERARCHY,SALES_CHANNEL, etc.
from data_generators.synthetic_sales_data import generate_fashion_data_with_brand
# Generate synthetic retail data for the year 2023
df_synthetic = generate_fashion_data_with_brand(start_date="2023-01-01", end_date="2023-12-31")
print("Generated DataFrame sample:")
print(df_synthetic.head())
Injecting Controlled Anomalies#
A key feature for testing is the ability to inject specific, known anomalies into the dataset. This allows you to verify if ProRCA correctly identifies the root causes you intentionally introduced.
Function:
inject_anomalies_by_datePurpose: Modifies specific metrics on given dates according to a predefined schedule. Crucially, after modifying the source metric (e.g.,
DISCOUNTorUNIT_COST), it recalculates all causally dependent downstream metrics (likeNET_SALES,PROFIT,PROFIT_MARGIN) based on the system’s defined logic. This ensures that the effect of the injected anomaly propagates realistically through the causal chain, providing a valid test case for root cause analysis.Module:
data_generators.synthetic_sales_data- Arguments:
df(pd.DataFrame): The input DataFrame (can be synthetic or your own data if you want to simulate ‘what-if’ scenarios).anomaly_schedule(dict): A dictionary defining the anomalies to inject.Keys: Date strings (‘YYYY-MM-DD’) specifying when the anomaly occurs.
- Values: A tuple containing four elements:
anomaly_type(str): Defines which metric is directly impacted (e.g., ‘ExcessiveDiscount’, ‘COGSOverstatement’, ‘FulfillmentSpike’, ‘ReturnSurge’).severity(float): A factor controlling the magnitude and direction of the change. The interpretation depends on the anomaly_type. For example, a severity of 0.5 for ‘ExcessiveDiscount’ means the discount is increased by 50% of sales; a severity of 3.0 for ‘FulfillmentSpike’ means the cost triples (original * (1+3)).root_cause(str): A user-defined label for the simulated underlying reason (e.g., ‘PricingError’, ‘SupplierIssue’). This is added to the output DataFrame for validation purposes.scope(str): Specifies the segment affected by the anomaly (e.g., a specific product category like ‘Apparel’, ‘Footwear’, or a sales channel).
Returns: A new
pandas.DataFramecontaining the original data modified by the injected anomalies. It includes additional columns:ANOMALY_TYPE,SEVERITY, andROOT_CAUSE, marking the rows where anomalies were injected.
from data_generators.synthetic_sales_data import inject_anomalies_by_date
# Assume df_synthetic is the DataFrame generated in the previous step
# Define a schedule of specific anomalies
anomaly_schedule = {
'2023-01-10': ('ExcessiveDiscount', 0.5, 'PricingError', 'Apparel'), # 50% discount rate for Apparel
'2023-06-10': ('COGSOverstatement', -0.8, 'SupplierIssue', 'Footwear'), # Simulates 80% higher unit cost impact for Footwear
'2023-09-10': ('FulfillmentSpike', 3.0, 'LogisticsIssue', 'Beauty'), # Fulfillment cost quadruples for Beauty
'2023-12-10': ('ReturnSurge', 10.0, 'QualityIssue', 'Accessories') # Return cost increases 10x for Accessories
}
# Inject these anomalies into the synthetic data
df_anomalous = inject_anomalies_by_date(df_synthetic, anomaly_schedule)
print("\nDataFrame sample after injecting anomalies:")
# Notice the new columns ANOMALY_TYPE, SEVERITY, ROOT_CAUSE might be None for non-anomalous rows/dates
print(df_anomalous[df_anomalous['ORDERDATE'] == '2023-01-10'].head())
Future Directions#
We plan to significantly expand the data_generators module in future releases. Our roadmap includes adding synthetic data generators for various other sectors, potentially including:
Manufacturing (e.g., sensor data, production line metrics)
Finance (e.g., transaction data, market indicators)
IT Operations (e.g., server logs, application performance metrics)
Supply Chain & Logistics
This will broaden the applicability of ProRCA and provide tailored testing environments for diverse use cases.