FreshRSS

๐Ÿ”’
โŒ About FreshRSS
There are new available articles, click to refresh the page.
Before yesterdayYour RSS feeds

Multimodal LLMs Are Here (updated)

โ€œWhatโ€™s in this picture?โ€ โ€œLooks like a duck.โ€ โ€œThatโ€™s not a duck. Then whatโ€™s it?โ€ โ€œLooks more like a bunny.โ€

Earlier this week, Microsoft revealed Kosmos-1, a large language model โ€œcapable of perceiving multimodal input, following instructions, and performing in-context learning for not only language tasks but also multimodal tasks.โ€ Or as Ars Technicaย put it, it can โ€œanalyze images for content, solve visual puzzles, perform visual text recognition, pass visual IQ tests, and understand natural language instructions.โ€

Researchers at Microsoft provided details about the capabilities of Kosmos-1 in โ€œLanguage Is Not All You Need: Aligning Perception with Language Modelsโ€œ. Itโ€™s impressive.ย Hereโ€™s a sample of exchanges with Kosmos-1:

And here are some more:

Examples of Kosmos-1's capabilities include (1)-(2) visual explanation, (3)-(4) visual question answering, (5) web page question answering, (6) simple math equation, and (7)-(8) number recognition.

Selected examples generated from KOSMOS-1. Blue boxes are input prompt and pink boxes are KOSMOS-1 output. The examples include (1)-(2) visual explanation, (3)-(4) visual question answering, (5) web page question answering, (6) simple math equation, and (7)-(8) number recognition.

The researchers write:

Properly handling perception is a necessary step toward artificial general intelligence.

The capability of perceiving multimodal input is critical to LLMs. First, multimodal perception enables LLMs to acquire commonsense knowledge beyond text descriptions. Second, aligning perception with LLMs opens the door to new tasks, such as robotics, and document intelligence. Third, the capability of perception unifies various APIs, as graphical user interfaces are the most natural and unified way to interact with. For example, MLLMs can directly read the screen or extract numbers from receipts. We train the KOSMOS-1 models on web-scale multimodal corpora, which ensures that the model robustly learns from diverse sources. We not only use a large-scale text corpus but also mine high-quality image-caption pairs and arbitrarily interleaved image and text documents from the web.

Their plans for further development of Kosmos-1 include scaling it up in terms of model size and integrating speech capability into it. You can read more about Kosmos-1 here.

Philosophers, Iโ€™ve said it before and will say it again: there is a lot to work on here. There are major philosophical questions not just about the technologies themselves (the ones in existence and the ones down the road), but also about their use, and about their effects on our lives, relationships, societies, work, government, etc.


P.S. Just a reminder that quite possibly the stupidest response to this technology is to say something along the lines of, โ€œitโ€™s not conscious/thinking/intelligent, so no big deal.โ€


UPDATEย (3/3/23): While weโ€™re on the subject of machine โ€œvision,โ€ recently researchers have made advances in machines being able to determine and reconstruct what a human is seeing simply by looking at whatโ€™s happening in the personโ€™s brain. Basically, they trained an image-oriented model (the neural network Stable Diffusion) on data obtained by observing what peopleโ€™s brains are doing as they look at different imagesโ€”such data included fMRIs of the brains of 4 subjects as they looked at thousands of images, and the images themselves. They then showed the neural network fMRIs that had been excluded from the training setโ€”ones that had been taken while the subjects looked at images that also were excluded from the training setโ€”and had it reconstruct what it thought the person was looking at when the fMRI was taken. Here are some of the results:

The leftmost column is what the four subjects were shown. The remaining four columns are the neural networkโ€™s reconstruction of what each of the subjects saw based on their individual fMRIs when doing so.

More details are in the paper: โ€œHigh-resolution image reconstruction with latent diffusion models from human brain activityโ€œ. (via Marginal Revolution)

ย 

โŒ