lundi 25 juin 2018

Random selection in SQL Server

I have two sets and for each value in the first set I want to apply a number of random values from the second. The approach I have chosen uses a select from the first with a cross apply from the second. A simplified MWE is as follows:

DROP TABLE IF EXISTS #S;
CREATE TABLE #S (c CHAR(1));
INSERT INTO #S VALUES ('A'), ('B');
DROP TABLE IF EXISTS #T;
WITH idGen(id) AS (
    SELECT 1 
    UNION ALL 
    SELECT id + 1 FROM idGen WHERE id < 1000
) 
SELECT id INTO #T FROM idGen OPTION(MAXRECURSION 0);
DROP TABLE IF EXISTS #R;
SELECT c, id INTO #R FROM #S
CROSS APPLY (
    SELECT id, ROW_NUMBER() OVER (
        /* 
        -- this gives 100% overlap
        PARTITION BY c
        ORDER BY RAND(CHECKSUM(NEWID()))
        */
        -- this gives the expected ~10% overlap
        ORDER BY RAND(CHECKSUM(NEWID()) + CHECKSUM(c))
    ) AS R
    FROM #T 
) t
WHERE t.R <= 100;
SELECT COUNT(*) AS PercentOverlap -- ~10%
FROM #R rA JOIN #R rB
ON rB.id = rA.id AND rB.c = 'B'
WHERE rA.c = 'A';

While this solution works, I am wondering why changing to the (commented) partitioning method does not? Also, are there any caveats using this solution, seeing as it feels sort of dirty to add two checksums?

In the actual problem there is also a count in the first set containing the number of random values to select from the second set, which replaces the static 100 in the example above. However, using the fixed 100 made it easy to verify the expected overlap.




Aucun commentaire:

Enregistrer un commentaire