The columns of A don't match the number of elements of x. A: 768, x: 1536 #14368

maziyarpanahi · 2024-08-14T10:27:31Z

Discussed in #14362

^{Originally posted by SidWeng August 8, 2024}
I use the following pipeline with BioBERT Sentence Embeddings.
However, it throws The columns of A don't match the number of elements of x. A: 768, x: 1536 when execute pipeline.fit(). I trace the code and find out the dimension of randMatrix used by BucketedRandomProjectLSHModel is determined by DatasetUtils.getNumFeatures().
Does it imply something wrong with the data I feed into fit() ? The data I feed is a DataFrame with a String column code and a String column text. The longest length of text is 229.

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en")
  .setInputCols("sentence")
  .setOutputCol("sentence_embeddings")

val document_similarity_ranker = new DocumentSimilarityRankerApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("doc_similarity_rankings")
  .setSimilarityMethod("brp")
  .setNumberOfNeighbours(1)
  .setBucketLength(2.0)
  .setNumHashTables(3)
  .setVisibleDistances(true)
  .setIdentityRanking(false)

val document_similarity_ranker_finisher = new DocumentSimilarityRankerFinisher()
  .setInputCols("doc_similarity_rankings")
  .setOutputCols("finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors")
  .setExtractNearestNeighbor(true)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    embeddings,
    document_similarity_ranker,
    document_similarity_ranker_finisher
  ))

24/08/08 03:19:13.581 [task-result-getter-3] WARN o.a.spark.scheduler.TaskSetManager - Lost task 7.2 in stage 10.0 (TID 370) (10.0.0.12 executor 4): org.apache.spark.SparkException: Failed to execute user defined function (LSHModel$$Lambda$5263/1056329262: (struct<type:tinyint,size:int,indices:array,values:array>) => array<struct<type:tinyint,size:int,indices:array,values:array>>)
at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:177)
at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.serializefromobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
at org.sparkproject.guava.collect.Ordering.leastOf(Ordering.java:670)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at org.apache.spark.rdd.RDD.$anonfun$takeOrdered$2(RDD.scala:1539)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 768, x: 1536
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:579)
at org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel.hashFunction(BucketedRandomProjectionLSH.scala:87)
at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
... 22 more

```</div>

The text was updated successfully, but these errors were encountered:

maziyarpanahi · 2024-08-14T10:28:27Z

thanks @SidWeng, we will look into this

SidWeng · 2024-08-15T06:12:33Z

@maziyarpanahi I found the root cause but I'm guessing it is not a bug, please take a look
#14362 (comment)

danilojsl · 2024-08-16T20:53:54Z

Hi @SidWeng

Yes, that's exactly the root cause. We are working on adding a parameter to DocumentSimilarityRankerApproach to choose the aggregation method when a document has multiple sentences. I hope we can include it in the next release.

maziyarpanahi · 2024-08-17T15:18:04Z

Hi @SidWeng @danilojsl

I totally missed that you are using SentenceDetector. The DocumentSimilarityRankerApproach annotator is designed to only deal with the document level embeddings.

Until we implement a simple averaging to put everything together, here are a few options:

exploding the sentences so you end up with 1 sentence per row (if sentence level is important)
removing the sentence detector and directly going from document to sentence embeddings

maziyarpanahi assigned danilojsl Aug 14, 2024

danilojsl mentioned this issue Aug 16, 2024

[SPARKNLP-1059] Adding aggressiveMatching parameter to DocumentSimilarityRanker #14370

Merged

10 tasks

maziyarpanahi linked a pull request Aug 28, 2024 that will close this issue

release/542-release-candidate #14381

Merged

maziyarpanahi closed this as completed in #14381 Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The columns of A don't match the number of elements of x. A: 768, x: 1536 #14368

The columns of A don't match the number of elements of x. A: 768, x: 1536 #14368

maziyarpanahi commented Aug 14, 2024

maziyarpanahi commented Aug 14, 2024

SidWeng commented Aug 15, 2024

danilojsl commented Aug 16, 2024 •

edited

Loading

maziyarpanahi commented Aug 17, 2024

The columns of A don't match the number of elements of x. A: 768, x: 1536 #14368

The columns of A don't match the number of elements of x. A: 768, x: 1536 #14368

Comments

maziyarpanahi commented Aug 14, 2024

Discussed in #14362

maziyarpanahi commented Aug 14, 2024

SidWeng commented Aug 15, 2024

danilojsl commented Aug 16, 2024 • edited Loading

maziyarpanahi commented Aug 17, 2024

danilojsl commented Aug 16, 2024 •

edited

Loading