Weighted List Generation

JeffmnJeffmn Posts: 2
Can someone shed some light on how the weighting works in a weighted list generation?

I wrote a custom weighted list generator that imports a bunch of values from a .csv files and builds the generator xml. That .csv file as several thousand items that will ultimately be populated into a database containing many million rows.

My question is how is the weighting computed? On one hand the documentation says I need to express each item on the list as a percentage with the values totalling 100%, but I found elsewhere in the documentation those values are expressed in ratios.

I tried expressing each of the items as a percent but Data Generator sets the minimum value at 1 so that's out of the question given I've got 1000 items in my list that I want to weight.

I then changed all the values to be different larger numbers ranging from 1 up to 20,000 or so. How does this translate when my 1000 row list is used to populate a 10 million row table? I'm trying to understand what the individual weight values mean in the 10 million row databases I'm populating.

Comments

  • Hi there,

    A while ago we had a similar query about the weighted list generator where even at a simple level it behaved unexpectedly - for instance if you tried to generate 10 rows of values x, y and z on a 20, 20, 60 basis, you'd expect to get 2 x, 2 y, and 6 z. But it would often not produce this.

    I queried it with the developers and apparently it's working as designed, in their words: "The values are generated at random using the weightings. Not generated in the weighted ratio then randomized."

    As for how it works- it seems both ratios and a percentage should be feasible, as the popup help states:
    For example, if you enter 2 for value Yes and 1 for value No, Yes will occur twice as many times as No in the selected column.
    To specify as percentages, ensure all the weight ratios add up to 100.

    The new version of Data Generator has an option to use a Python Script as a generator, and they were kind enough to produce a sample that would lead to a more predictable result, which I've pasted below. Hopefully it's of some use although I see you're actually working with a CSV file of values, so I'm not sure how easily you'll be able to convert it across.
    #Python script is generate strings in a strict ratio 
    __randomize__ = True 
    
    weightedStrings = (('xxx',2), ('yyy',2), ('zzz',6)) 
    
    def main(config): 
        n_rows=config["n_rows"] 
        return list(next_string(n_rows)) 
    
    def next_string(n_rows): 
        for i in range(n_rows): 
            for item in weightedStrings: 
                string = item[0] 
                weight = item[1] 
                for i in range(weight): 
                    yield string 
    
    Systems Software Engineer

    Redgate Software

Sign In or Register to comment.