I'm trying to join a column value from a table, expl, to my main table, co, under an equality condition, in Spark SQL. The catch is that because there are many rows that join from expl, I want to only join one random row, and use its column value.
But I'm running into Correlated scalar subqueries
errors either in the subquery select statement or in the order by. There aren't posts on SO that deal with the ORDER BY part of the subquery or with subquery random row retrieval in Spark.
In this case I tried to use a random number generator, which does not work as it seems like RAND() needs to be an aggregate, somehow.
cooccurrences = spark.sql("""
SELECT C.word AS word, C.word2, (
SELECT FIRST(e.word2) word2 FROM expl e
WHERE e.word = C.word and e.word2 ORDER BY RAND()
) word3
FROM cu C
""")
In this case, I built a column with a random value, which can be used to order the columns, so a random number generator isn't needed and instead just the row with the max value is returned. But I dislike this because it could return the same row for duplicate groups.
cooccurrences = spark.sql("""
SELECT
e.word,
e.word2,
(SELECT z.word2 from
(SELECT
FIRST(c.word2) word2, MAX(C.rand_n) rand_n
FROM cu C
WHERE e.word = C.word and MAX
) z
) word3
FROM expl AS e
""")
These queries all throw a variant of correlated scalar subqueries must be aggregated
Aucun commentaire:
Enregistrer un commentaire