vendredi 31 juillet 2020

PySpark/Spark Correlated scalar subqueries with order by

I'm trying to join a column value from a table, expl, to my main table, co, under an equality condition, in Spark SQL. The catch is that because there are many rows that join from expl, I want to only join one random row, and use its column value.

But I'm running into Correlated scalar subqueries errors either in the subquery select statement or in the order by. There aren't posts on SO that deal with the ORDER BY part of the subquery or with subquery random row retrieval in Spark.

In this case I tried to use a random number generator, which does not work as it seems like RAND() needs to be an aggregate, somehow.

cooccurrences = spark.sql("""
    SELECT C.word AS word, C.word2, (
            SELECT FIRST(e.word2) word2 FROM expl e
            WHERE e.word = C.word and e.word2 ORDER BY RAND() 
        ) word3
        FROM cu C
 """)

In this case, I built a column with a random value, which can be used to order the columns, so a random number generator isn't needed and instead just the row with the max value is returned. But I dislike this because it could return the same row for duplicate groups.

cooccurrences = spark.sql("""
                              SELECT 
                                e.word, 
                                e.word2,
                                (SELECT z.word2 from 
                                  (SELECT 
                                        FIRST(c.word2) word2, MAX(C.rand_n) rand_n
                                        FROM cu C
                                        WHERE e.word = C.word and MAX
                                    ) z
                                  ) word3
                                FROM expl AS e
                          """)

These queries all throw a variant of correlated scalar subqueries must be aggregated




Aucun commentaire:

Enregistrer un commentaire