As an AI researcher, I trained a large language model on@OpenLedger ($OPEN ) with the aim of generating articles about "cryptocurrency market analysis." The training data includes millions of articles, reports, and community discussion posts. When the#OpenLedger model generates a prediction about Bitcoin price fluctuations, I wonder which training data it relies on for its judgments.
If using traditional methods, either the computation is too slow or it can only roughly indicate the contributions of the entire dataset, making it impossible to accurately track specific documents or paragraphs. At this point, use Infini-gram.#OpenLedger The system establishes symbolic correspondence between each keyword output by the model and the training corpus, efficiently matching using suffix array structures.
The results show:
When the @OpenLedger model predicted 'short-term Bitcoin pullback', it mainly referenced specific paragraphs from three market analysis articles and one community discussion post. The influence of each document was quantified, allowing me to see which text contributed the most to the model's decision-making. This process enables me to:
Verify model decisions: ensure the model does not learn from biased data;
Provide feedback to data contributors: quantify contributions and distribute rewards through OpenLedger($OPEN );
Optimize the dataset: identify high-impact data to enhance model performance.
For me and the team, Infini-gram is not just a technology; it makes the value of every piece of data transparent. Every judgment made by the model has a clear and traceable source, and data contributors can receive recognition on the chain, thereby establishing a fair and verifiable AI ecosystem.
