Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The columns of A don't match the number of elements of x. A: 768, x: 1536 #14368

Closed
maziyarpanahi opened this issue Aug 14, 2024 Discussed in #14362 · 4 comments · Fixed by #14381
Closed

The columns of A don't match the number of elements of x. A: 768, x: 1536 #14368

maziyarpanahi opened this issue Aug 14, 2024 Discussed in #14362 · 4 comments · Fixed by #14381
Assignees

Comments

@maziyarpanahi
Copy link
Member

Discussed in #14362

Originally posted by SidWeng August 8, 2024
I use the following pipeline with BioBERT Sentence Embeddings.
However, it throws The columns of A don't match the number of elements of x. A: 768, x: 1536 when execute pipeline.fit(). I trace the code and find out the dimension of randMatrix used by BucketedRandomProjectLSHModel is determined by DatasetUtils.getNumFeatures().
Does it imply something wrong with the data I feed into fit() ? The data I feed is a DataFrame with a String column code and a String column text. The longest length of text is 229.

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en")
  .setInputCols("sentence")
  .setOutputCol("sentence_embeddings")

val document_similarity_ranker = new DocumentSimilarityRankerApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("doc_similarity_rankings")
  .setSimilarityMethod("brp")
  .setNumberOfNeighbours(1)
  .setBucketLength(2.0)
  .setNumHashTables(3)
  .setVisibleDistances(true)
  .setIdentityRanking(false)

val document_similarity_ranker_finisher = new DocumentSimilarityRankerFinisher()
  .setInputCols("doc_similarity_rankings")
  .setOutputCols("finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors")
  .setExtractNearestNeighbor(true)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    embeddings,
    document_similarity_ranker,
    document_similarity_ranker_finisher
  ))

24/08/08 03:19:13.581 [task-result-getter-3] WARN o.a.spark.scheduler.TaskSetManager - Lost task 7.2 in stage 10.0 (TID 370) (10.0.0.12 executor 4): org.apache.spark.SparkException: Failed to execute user defined function (LSHModel$$Lambda$5263/1056329262: (struct<type:tinyint,size:int,indices:array,values:array>) => array<struct<type:tinyint,size:int,indices:array,values:array>>)
at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:177)
at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.serializefromobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
at org.sparkproject.guava.collect.Ordering.leastOf(Ordering.java:670)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at org.apache.spark.rdd.RDD.$anonfun$takeOrdered$2(RDD.scala:1539)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 768, x: 1536
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:579)
at org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel.hashFunction(BucketedRandomProjectionLSH.scala:87)
at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
... 22 more

```</div>
@maziyarpanahi
Copy link
Member Author

thanks @SidWeng, we will look into this

@SidWeng
Copy link

SidWeng commented Aug 15, 2024

@maziyarpanahi I found the root cause but I'm guessing it is not a bug, please take a look
#14362 (comment)

@danilojsl
Copy link
Contributor

danilojsl commented Aug 16, 2024

Hi @SidWeng

Yes, that's exactly the root cause. We are working on adding a parameter to DocumentSimilarityRankerApproach to choose the aggregation method when a document has multiple sentences. I hope we can include it in the next release.

@maziyarpanahi
Copy link
Member Author

Hi @SidWeng @danilojsl

I totally missed that you are using SentenceDetector. The DocumentSimilarityRankerApproach annotator is designed to only deal with the document level embeddings.

Until we implement a simple averaging to put everything together, here are a few options:

  • exploding the sentences so you end up with 1 sentence per row (if sentence level is important)
  • removing the sentence detector and directly going from document to sentence embeddings

@maziyarpanahi maziyarpanahi linked a pull request Aug 28, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants