This Python script generates a dataset of fake financial transactions, designed for audit training and testing purposes. It creates a CSV file containing a mix of normal transactions and various types of irregularities, ranging from simple anomalies to sophisticated patterns that require advanced analytical techniques to detect.
It creates two CSV files:
- A file containing a mix of normal transactions and various types of irregularities.
- A file listing all the irregularities applied, including their types and descriptions.
- Generates a configurable number of transactions over a specified date range
- Includes both recurring and random transactions
- Implements Benford's Law for realistic amount distribution
- Introduces a variety of configurable irregularities and anomalies
- Provides a separate log of all applied irregularities for easy reference
- Flexible and modular design
- Python 3.6 or higher
- No additional libraries required (uses only Python standard library)
- Clone this repository or download the
generate_transactions.py
script. - Ensure you have Python 3.6 or higher installed on your system.
-
Create a configuration file (e.g., config.json) with your desired settings.
-
Run the script using the following command:
python generate_transactions.py -c config.json -o transactions.csv -a irregularities.csv
-c
or--config
: Path to the configuration file (default: 'config.json')-o
or--output
: Name of the output CSV file for transactions (default: 'fake_transactions.csv')-a
or--anomalies
: Name of the output CSV file for irregularities (default: 'irregularities.csv')
The script will generate transactions based on your configuration and save them to the specified output files.
-
Transactions CSV file:
- Columns: Transaction ID, Date, Type, Amount, Account, Description, Vendor
-
Irregularities CSV file:
- Columns: Transaction ID, Irregularity Type, Description
The irregularities file provides a detailed log of all anomalies introduced into the dataset, making it easier to validate detection algorithms or train auditors.
The config.json
file allows you to customize various aspects of the generated data. Here's an example structure:
{
"num_transactions": 1000,
"start_date": "2023-01-01",
"end_date": "2023-12-31",
"irregularities": {
"high_amount": {
"count": 10
},
"frequency_change": {
"count": 10
},
"double_spend": {
"count": 10
},
"missing_id": {
"count": 2
},
"incorrect_date": {
"count": 10
},
"mismatched_description": {
"count": 10
},
"wrong_account": {
"count": 10
},
"personal_expense": {
"count": 10
},
"benford_violation": {
"count": 10
},
"subtle_skimming": {
"count": 30
},
"seasonal_anomaly": {
"count": 10
},
"round_number_bias": {
"count": 10
},
"cumulative_irregularity": {
"enabled": true,
"count": 36,
"threshold": 0.005
}
},
"vendors": [
"ABC Office Supplies",
"XYZ Tech Solutions",
"123 Cleaning Services"
],
"personal_vendors": [
"Luxury Resort & Spa",
"Designer Clothing Co.",
"Gourmet Restaurant"
],
"personal_expense_descriptions": [
"Team Building Retreat",
"Client Entertainment",
"Office Decor"
],
"recurring_transactions": [
{
"vendor": "City Power & Utilities",
"amount": 500,
"day": 15,
"description": "Monthly Utility Bill"
},
{
"vendor": "Prime Office Rentals",
"amount": 2000,
"day": 1,
"description": "Office Rent"
}
]
}
In this configuration:
total
sets the total number of irregularities to generate- Each irregularity type has a
count
field to specify how many of that type to generate - If the sum of individual counts is less than
total
, additional random irregularities will be added to reach the total cumulative_irregularity
has its own configuration withenabled
,threshold
, andprobability
fields
- Recurring transactions (e.g., rent, utilities)
- Random transactions with amounts following Benford's Law
- High Amount: Unusually large transaction amounts
- Frequency Change: Altered dates for recurring transactions
- Double Spend: Duplicated transactions with slight time differences
- Missing ID: Transactions with no ID
- Incorrect Date: Future-dated transactions
- Mismatched Description: Descriptions not matching transaction types
- Wrong Account: Incorrect account number formats
- Personal Expense: Personal expenses disguised as business expenses
- Benford's Law Violation: Amounts not following Benford's Law distribution
- Subtle Skimming: Slight reductions across multiple transactions
- Seasonal Anomaly: Out-of-season transactions
- Round Number Bias: Suspiciously round transaction amounts
- Cumulative Irregularity: Small discrepancies spread across multiple transactions
You can easily customize the script by:
- Modifying the configuration file to change transaction parameters, vendors, etc.
- Adding new irregularity types by defining new functions and adding them to the
irregularity_functions
dictionary inapply_irregularities()
- Adjusting the logic in existing irregularity functions to match specific scenarios
This dataset is designed for:
- Training auditors and financial analysts
- Testing fraud detection algorithms
- Practicing data analysis and visualization techniques
- Developing and validating audit procedures
The transactions and irregularities in this dataset are entirely fictional and generated for educational and testing purposes only. They do not represent real financial data or actual fraudulent activities.
Feel free to fork this project and submit pull requests with improvements or additional features. Some ideas for enhancements:
- Add more types of irregularities
- Implement additional configuration options
- Create visualizations of the generated data
- Develop companion analysis tools
This project is open-source and available under the MIT License.