mardi 17 octobre 2023

Dynamically create a fake dataset based on the subset of another (real) dataset

I've got a few datasets and for each, I'd like to create a fake dataset that is kind of a representative of that dataset. I need to do it dynamically, only based on the type of data (numeric, obj)

Here's an example

import pandas as pd
import random

# Create a dictionary with columns as lists
data = {
    'ObjectColumn1': [f'Object1_{i}' for i in range(1, 11)],
    'ObjectColumn2': [f'Object2_{i}' for i in range(1, 11)],
    'ObjectColumn3': [f'Object3_{i}' for i in range(1, 11)],
    'NumericColumn1': [random.randint(1, 100) for _ in range(10)],
    'NumericColumn2': [random.uniform(1.0, 10.0) for _ in range(10)],
    'NumericColumn3': [random.randint(1000, 2000) for _ in range(10)],
    'NumericColumn4': [random.uniform(10.0, 20.0) for _ in range(10)]
}

# Create the DataFrame
df = pd.DataFrame(data)

enter image description here

Let's say the above dataset has m (=3) object columns and n (=4) numeric columns. the dataset has x (=10) rows. I'd like to create a fake dataset of N (=10,000) rows, so that:

  1. ObjectColumn1, ObjectColumn2, ..., and ObjectColumn_m in the fake dataset are random selections of entries in ObjectColumn1, ObjectColumn2, ..., and ObjectColumn_m of data respectively
  2. ExtraObjectColumn in the fake dataset is an added fake column, which is a random selection of a list(e.g. list = [ran1, ran2, ran3])
  3. all NumericColumns in the fake data are a randomly generated number that is between the minimum and median of each of those columns in data respectively. for example, NumericColumn1 in the fake data would be a randomly generated data between (3 and 71.5)
  4. I don't want columns to be hard-coded. imagine m and n and x and N are all dynamic. I need to use this on many multiple datasets and the function needs to detect the object and numeric columns and do the above dynamically. The only column that is NOT dynamic is the ExtraObjectColumn, which needs to be given a list to be created from.
  5. Obviously, I need this to be reasonable performance. N is usually a large number (at least 10,000)

here's how fake_data should look like if N = 4

enter image description here




Aucun commentaire:

Enregistrer un commentaire