nerdculture.de is one of the many independent Mastodon servers you can use to participate in the fediverse.
Be excellent to each other, live humanism, no nazis, no hate speech. Not only for nerds, but the domain is somewhat cool. ;) No bots in general. Languages: DE, EN, FR, NL, ES, IT

Administered by:

Server stats:

1.2K
active users

#mllm

0 posts0 participants0 posts today

"OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us"

Headline of the week. 🥰

OpenAI shocked that an AI company would train on someone else's data without permission or compensation.

404media.co/openai-furious-dee… (no-charge subscription wall for full article)

#OpenAI #DeepSeek
#AI #LLM #MLLM
#GenAI #GenerativeAI

404 Media · OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From UsOpenAI shocked that an AI company would train on someone else's data without permission or compensation.

Survey: Multimodal #LLMs like GPT-4V are redefining AI by excelling at tasks like image-based storytelling & OCR-free math reasoning, hinting at AGI potential. This paper reviews their progress, architectures, and challenges while exploring new horizons for research. 🚀 #AI #MLLM

Replied in thread

@peer Dann wäre das #LLM ein Multimodal Large Language Model (#MLLM). Genau das ist der Punkt: Wenn Du in einem Raum sitzt, in dem chinesisches Radio läuft, lernst Du nicht Chinesisch. Ein LLM schon. Es lernt ganz anders als wir. Es lernt nur die Distribution von Sprachteilen. Wir lernen mit Grounding. Das wird in der KI auch kommen, aber jetzt ist es noch nicht so weit und deshalb sind die LLMs noch nicht der Beweis, dass Chomsky falsch lag, aber 1) wussten wir das schon vor den LLMs und 2) machen die LLMs das auch für Laien und hardcore Chomskyaner (die vorher einfach die Literatur nicht gelesen hatten) plausibel.

Die Anzeichen verdichten sich, dass #apple demnächst ein oder mehrere brauchbare #LLM auf die Straße bringt (hier #MLLM):

arxiv.org/abs/2403.09611

arXiv.orgMM1: Methods, Analysis & Insights from Multimodal LLM Pre-trainingIn this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.