A medium-close shot shows a red panda wearing a gold-trimmed cap and travel satchel on a bright seaside wave with a painted surfboard, foam spray, and a glowing summer sky. Subject fills frame; premium detail, clear focus, lively eyes, readable motion. tracking shot. It rides the wave, lifts one paw in balance, and laughs as spray catches the light.
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Lance is a 3B native unified multimodal model for image and video understanding, generation, and editing, trained from scratch within a training budget of no more than 128 GPUs using a staged multi-task recipe.
Text-to-Video
Nine text-conditioned cases focused on character motion, fantasy animals, two-person interaction, and cinematic dreamlike scenes.
A premium animated-film shot shows a brass robot playing violin in a lantern-lit city square with one puppy seated nearby under warm evening light. The main subject occupies at least two-thirds of the frame and remains the clear visual focus. The scene is whimsical, beautiful, and richly detailed, with strong character focus and elegant atmosphere. fixed shot. The robot draws the bow in smooth arcs while the puppy listens quietly.
A medium-close shot shows a Persian cat wearing ornate spectacles and a velvet academic robe inside a candlelit salon with carved shelves, chandeliers, and mosaic floors. The cat fills the frame with crisp fur detail and lively eyes. fixed shot. It lifts a slender magic wand and traces a soft glowing arc through the air.
A cinematic landscape shot shows a tropical coastline at sunset with pink sky, moving waves, black rocks, and palms swaying in warm wind. The scene is majestic, highly aesthetic, and rich in layered natural detail, with refined atmosphere and premium scenic clarity. wide shot. The sun sinks toward the horizon while wave foam advances and retreats along the shore.
A close-to-medium cinematic shot shows a handsome motorcyclist riding a classic black motorcycle along a coastal road with cliffs, sea spray, and dramatic sky. The background stays bright, layered, and aesthetically refined, with luminous depth and elegant environmental variation while remaining secondary to the main subject. The eyes are lively and expressive, with subtle blinking, natural gaze shifts, and gentle movement in the brows and mouth that keep the face vivid on camera. The subject is beautiful, highly detailed, and photographed with a premium cinematic aesthetic. The subject occupies at least two-thirds of the frame, with beautiful styling, refined facial detail, convincing skin texture, and anatomically correct hands. The rider's body posture matches the bike's motion and the hands grip the handlebars naturally. the camera follows from the side as the motorcycle leans through a curve.
A detailed cinematic portrait begins from a medium view and gradually moves into a close facial framing of a beautiful young woman shaping clay on a pottery wheel in a bright ceramic workshop with sunlit shelves, bowls, and hanging tools. The person is the dominant subject in the frame, styled with a tied-back apron, delicate earrings, rolled sleeves, and a simple pendant, and shown with premium skin detail, expressive eyes, subtle brow and cheek motion, anatomically convincing hands, and rich costume texture. Her hands guide the spinning clay in one smooth controlled motion as her expression moves from serene focus into a soft smile. Her gaze starts on the camera, follows the clay, briefly rises toward the window light, and returns to the lens while her head inclines naturally with the wheel.
A detailed cinematic portrait begins from a medium view and gradually moves into a close facial framing of a beautiful young woman playing a grand piano in a luminous marble music hall with tall windows, gold sconces, flowing curtains, polished floors, and refined floral arrangements. Styled with pearl earrings, a delicate crystal hairpin, and a layered silver necklace above an elegant satin gown. Subject dominates; sharp face, open eyes, subtle micro-expressions, correct visible hands. Both hands stay clearly visible on the piano keys, and every finger movement is elegant, natural, and easy to read as she plays a calm melodic phrase; her head gives a subtle natural sway in time with the music while the smile slowly grows warmer.
An elegant medium-close shot centers a shiba inu and a chrome boxing robot inside a palace-inspired championship ring with carved ivory columns, bright gold trim, glossy stone steps, and sweeping crystal chandeliers. The shiba inu wears an embroidered brocade boxing robe, a jeweled waist sash, and refined round goggles, and both fighters wear premium boxing gloves; robot has exposed polished mechanical body. Bright luxury arena; fighters dominate frame; slow readable boxing. steady camera. Controlled footwork and visible punches, with brief pauses after each exchange.
A cinematic shot shows two young adults meeting again on a quiet train platform in warm sunset light with drifting steam and long shadows. Subject fills frame; premium face/detail, correct hands and posture. medium shot. They pause in disbelief, step closer, and embrace tightly; the camera then pushes into a close-up of their tearful relieved faces.
Displayed with 2x super-resolution and 2x frame interpolation.
Video Editing
Nine prompt-driven single-step and compositional editing cases spanning background transformation, object addition and removal, subject replacement, appearance restyling, stylization, and action edits.
Replace the background with a campfire.
Add a row of colorful balloons.
Change the boy to a girl with black shirt.
Change the dog to a cat.
Change the style to watercolor painting, soft colors, natural and dreamy.
Make the car a shiny red color and add a snowy street background.
Have the woman raise her right hand to gently brush her hair, slightly turn her body to the right, soften her expression, and shift her gaze to the right.
Add a scarf around her neck and replace the background with a snowy park.
Remove face stickers.
Multi-turn Consistency Editing
Source video followed by four linked edits on the same subject: replacement, accessory addition, background rewrite, and motion update.
Replace short straight hair with French curly hair.
Add a floral headband with red and white flowers to her hair.
Change the background to a fairytale castle by a lake.
Make her raise one hand to wave slowly.
Intelligent Video Generation
Structured planning and physics-oriented examples that probe control over multi-step spatial behavior.
Describe the key elements of the input maze image (layout, white path, black walls, blue star, red flag, and overall background), then generate a 2D animation. The blue star should slide smoothly along the white path, stop exactly on the red flag, and then acquire a trophy. Ensure the blue star never crosses or enters the black maze walls. Keep the camera as a static top-down view showing the entire maze.
Describe the key elements of the input maze image (layout, white path, black walls, blue star, red flag, and overall background), then generate a 2D animation. The blue star should slide smoothly along the white path, stop exactly on the red flag, and then acquire a trophy. Ensure the blue star never crosses or enters the black maze walls. Keep the camera as a static top-down view showing the entire maze.
Describe the key elements of the input maze image (layout, white path, black walls, blue star, red flag, and overall background), then generate a 2D animation. The blue star should slide smoothly along the white path, stop exactly on the red flag, and then acquire a trophy. Ensure the blue star never crosses or enters the black maze walls. Keep the camera as a static top-down view showing the entire maze.
Describe the key elements of the input maze image (layout, white path, black walls, blue star, red flag, and overall background), then generate a 2D animation. The blue star should slide smoothly along the white path, stop exactly on the red flag, and then acquire a trophy. Ensure the blue star never crosses or enters the black maze walls. Keep the camera as a static top-down view showing the entire maze.
Describe the key elements of the input maze image (layout, white path, black walls, blue star, red flag, and overall background), then generate a 2D animation. The blue star should slide smoothly along the white path, stop exactly on the red flag, and then acquire a trophy. Ensure the blue star never crosses or enters the black maze walls. Keep the camera as a static top-down view showing the entire maze.
Describe the key elements of the input maze image (layout, white path, black walls, blue star, red flag, and overall background), then generate a 2D animation. The blue star should slide smoothly along the white path, stop exactly on the red flag, and then acquire a trophy. Ensure the blue star never crosses or enters the black maze walls. Keep the camera as a static top-down view showing the entire maze.
Video Understanding
Selected video question answering and captioning cases that evaluate temporal reasoning, motion recognition, and concise-to-detailed description.
How many times did the person launch objects on the
table?
Options:
(A) 3
(B) 2
(C) 4
(A) 3
The person makes sets of repeated actions. How many distinct
repeated actions did the person do?
Options:
(A) 2
(B) 3
(C) 4
(A) 2
In which direction does the purple sphere move in the
video?
Options:
(A) Down and to the right.
(B) Up and to the left.
(C) Up and to the right.
(D) The object is stationary.
(A) Down and to the right.
What is the unrealistic phenomenon displayed in the video?
Options:
(A) The man can manipulate time via phone.
(B) Man grabs an object through a phone screen.
(C) Chocolate transforms into different objects.
(D) Visible means of propulsion enables flight.
(B) Man grabs an object through a phone screen.
Offer a succinct account of the culinary process shown in this video.
ResponseAdd tomato puree and mix it well with chicken pieces.
Provide a detailed description of the given video, capturing its key moments.
ResponseIn a sunlit meadow, a small tortoiseshell butterfly rests on a purple flower. A bee, with black and yellow stripes, lands on the same flower. The butterfly flaps its wings gently, while the bee busies itself, collecting nectar. The flower sways slightly in the breeze. The butterfly then takes off, and the bee follows, both heading to the next flower. The scene is a vivid display of insect interaction in a natural setting, with the colors of the butterfly and the bee contrasting against the green background of the meadow. The video captures this peaceful moment in a short 6-second duration.
Text-to-Image
Representative text-to-image outputs spanning photorealistic, stylized, compositional, and typography-heavy prompts.
Image Editing
Instruction-guided image editing cases showing local replacement, style transfer, object-aware modifications, and layout-preserving transformations.
Image Understanding
Six selected visual question answering cases spanning charts, trade data, OCR, documents, landmarks, and natural phenomena.
Is the largest segment greater than sum of all the other segments?
Response Yes, the largest segment in the pie chart is the blue one, which is the most populous segment.
What percentage of respondents want better border security?
Response 29%
What is the license plate number of the car?
Response The license plate number of the car is BX62 BFY.
According to the proprietary market research data, how much was spent on promotional meetings and events during 1998?
Response According to the data from the proprietary market research, the total amount spent on the promotional meetings and events during 1998 was approximately $1.3 billion.
What is the appearance of the Colosseum in Rome, Italy?
Response The Colosseum in Rome, Italy is a magnificent ancient amphitheater made of stone and concrete. It is characterized by its large, arched openings and its semi-circular shape. The Colosseum has been partially restored, and its original color has faded, but the architectural details, like the columns and the semi-circular arches, are still visible. The lighting in the image highlights the grandeur and historical significance of the Colosseum, making it stand out against the backdrop of the evening sky.
How does a total solar eclipse look like from Earth?
Response A total solar eclipse looks like a dark circle with a bright, white outer edge. This is a result of the Earth's rotation and the Earth's shadow, which partially obscures the sun's light. The bright white edge of the eclipse is caused by the sun's high energy and its interaction with the Earth's atmosphere, while the dark part of the eclipse is due to the Earth's shadow and the surrounding air currents. The solar eclipse's shape, with its bright white edge and dark center, is similar to the shape of a full moon or a dark disk. It is a natural phenomenon that occurs in the atmosphere of the Earth and is an important part of the solar system.
Framework
Lance keeps a shared interleaved sequence for text, image, and video context, then separates semantic understanding and visual generation through dedicated experts.
Comparison on multimodal benchmarks
Radar charts compare Lance with representative unified and task-specialized baselines. Detailed tables: GenEVAL, DPG-Bench, GEdit-Bench, VBench, and MVBench.
Image generation on GenEVAL
GenEVAL measures object count, color, position, and attribute binding. Lance ties the best overall score among listed unified models while remaining a compact 3B model.
Scroll horizontally to inspect all metrics.
| Method | # Params. | Overall↑ | Single Obj. | Two Obj. | Counting | Colors | Position | Color Attri. |
|---|---|---|---|---|---|---|---|---|
| Generation-only models | ||||||||
| FLUX.1-dev | 12B | 0.82 | 0.98 | 0.93 | 0.75 | 0.93 | 0.68 | 0.65 |
| GPT Image 1 | - | 0.84 | 0.99 | 0.92 | 0.85 | 0.92 | 0.75 | 0.61 |
| Qwen-Image | 20B | 0.87 | 0.99 | 0.92 | 0.89 | 0.88 | 0.76 | 0.77 |
| Unified models | ||||||||
| MetaQuery-XL† | 7B | 0.80 | - | - | - | - | - | - |
| OmniGen2 | 4B | 0.80 | 1.00 | 0.95 | 0.64 | 0.88 | 0.55 | 0.76 |
| Show-o2 | 7B | 0.76 | 1.00 | 0.87 | 0.58 | 0.92 | 0.52 | 0.62 |
| UniWorld-V1 | 13B | 0.80 | 0.99 | 0.93 | 0.79 | 0.89 | 0.49 | 0.70 |
| BAGEL† | 7B | 0.88 | 0.98 | 0.95 | 0.84 | 0.95 | 0.78 | 0.77 |
| Mogao | 7B | 0.89 | 1.00 | 0.97 | 0.83 | 0.93 | 0.84 | 0.80 |
| TUNA | 7B | 0.90 | 1.00 | 0.97 | 0.81 | 0.91 | 0.88 | 0.83 |
| Lance | 3B | 0.90 | 1.00 | 0.94 | 0.84 | 0.97 | 0.87 | 0.81 |
† indicates methods that use LLM rewriters for prompt rewriting before generation.
Image generation on DPG-Bench
DPG-Bench stresses complex prompt following across global, entity, attribute, relation, and other compositional dimensions; Lance is especially strong on relation grounding.
Scroll horizontally to inspect all metrics.
| Method | # Params. | Overall↑ | Global | Entity | Attribute | Relation | Other |
|---|---|---|---|---|---|---|---|
| Generation-only models | |||||||
| PixArt-a | 0.6B | 71.11 | 74.97 | 79.32 | 78.60 | 82.57 | 76.96 |
| SDXL | 3.5B | 74.65 | 83.27 | 82.43 | 80.91 | 86.76 | 80.41 |
| Hunyuan-DiT | 1.5B | 78.87 | 84.59 | 80.59 | 88.01 | 74.36 | 86.41 |
| Playground v2.5 | - | 75.47 | 83.06 | 82.59 | 81.20 | 84.08 | 83.50 |
| DALL-E 3 | - | 83.50 | 90.97 | 89.61 | 88.39 | 90.58 | 89.83 |
| SD3-Medium | 2B | 84.08 | 87.90 | 91.01 | 88.83 | 80.70 | 88.68 |
| Emu3-Gen | 8B | 80.60 | 85.21 | 86.68 | 86.84 | 90.22 | 83.15 |
| FLUX.1-dev | 12B | 83.84 | 74.35 | 90.00 | 88.96 | 90.87 | 88.33 |
| Qwen-Image | 20B | 88.32 | 91.32 | 91.56 | 92.02 | 94.31 | 92.73 |
| Unified models | |||||||
| Emu3-DPO | 8B | 81.60 | - | - | - | - | - |
| Janus | - | 79.68 | 82.33 | 87.38 | 87.70 | 85.46 | 86.41 |
| Janus-Pro-7B | 7B | 84.19 | 86.90 | 88.90 | 89.40 | 89.32 | 89.48 |
| Ovis-U1 | 1.2B | 83.72 | 82.37 | 90.08 | 88.68 | 93.35 | 85.20 |
| OmniGen2 | 4B | 83.57 | 88.81 | 88.83 | 90.18 | 89.37 | 90.27 |
| Show-o2 | 7B | 86.14 | 89.00 | 91.78 | 89.96 | 91.81 | 91.64 |
| UniWorld-V1 | 13B | 81.38 | 83.64 | 88.39 | 88.44 | 89.27 | 87.22 |
| BAGEL† | 7B | 85.07 | 88.94 | 90.37 | 91.29 | 90.82 | 88.67 |
| Mogao | 7B | 84.33 | 82.37 | 90.03 | 88.26 | 93.18 | 85.40 |
| InternVL-U | 1.7B | 85.18 | 90.39 | 90.78 | 90.68 | 90.29 | 88.77 |
| TUNA | 7B | 86.76 | 90.42 | 91.68 | 90.94 | 91.87 | 90.73 |
| Lance | 3B | 84.67 | 83.89 | 91.07 | 89.36 | 93.38 | 80.80 |
† indicates methods that use LLM rewriters for prompt rewriting before generation.
Image editing on GEdit-Bench
GEdit-Bench evaluates instruction-guided edits such as background, color, material, subject, style, and tone changes. Lance reports the best average score among listed unified models.
Scroll horizontally to inspect all metrics.
| Method | # Params. | Avg/G-O↑ | BC | CA | MM | MC | PB | ST | SA | SR | SRp | TM | TT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Generation-only models | |||||||||||||
| Gemini 2.0 | - | 6.32 | - | - | - | - | - | - | - | - | - | - | - |
| GPT Image 1 | - | 7.49 | 6.96 | 6.85 | 7.10 | 5.41 | 6.74 | 7.44 | 7.51 | 8.73 | 8.55 | 8.45 | 8.69 |
| Qwen-Image-Edit | 20B | 8.01 | 8.23 | 8.30 | 7.33 | 8.05 | 7.49 | 6.74 | 8.57 | 8.09 | 8.29 | 8.48 | 8.50 |
| Unified models | |||||||||||||
| Lumina-DiMOO | 8B | 3.91 | 3.43 | 4.27 | 3.08 | 2.77 | 4.74 | 5.19 | 4.44 | 3.80 | 4.38 | 2.68 | 4.20 |
| Ovis-U1 | 1.2B | 6.42 | 7.49 | 6.88 | 6.21 | 4.79 | 5.98 | 6.46 | 7.49 | 7.25 | 7.27 | 4.48 | 6.31 |
| BAGEL | 7B | 6.52 | 7.32 | 6.91 | 6.38 | 4.75 | 4.57 | 6.15 | 7.90 | 7.16 | 7.02 | 7.32 | 6.22 |
| InternVL-U | 1.7B | 6.66 | 7.08 | 7.05 | 6.38 | 7.02 | 6.03 | 6.27 | 7.13 | 6.55 | 6.33 | 6.59 | 6.85 |
| InternVL-U (w/ CoT) | 1.7B | 6.88 | 7.05 | 7.87 | 6.50 | 6.99 | 5.77 | 6.10 | 7.33 | 7.16 | 7.12 | 7.36 | 6.46 |
| Lance | 3B | 7.30 | 7.73 | 7.74 | 7.28 | 7.83 | 7.50 | 7.03 | 7.64 | 7.85 | 7.71 | 4.46 | 7.57 |
Video generation on VBench
VBench covers video quality, semantic alignment, object attributes, spatial relations, and motion-related dimensions. Lance obtains the top total score in the unified model group.
Scroll horizontally to inspect all metrics.
| Model | # Params. | Total Score↑ | Quality Score | Semantic Score | Subj. Consist. | Bkg. Consist. | Temp. Flicker | Motion Smooth. | Dynamic Degree | Aesthetic Quality | Imaging Quality | Object Class | Multi. Objects | Human Action | Color | Spatial Relation | Scene | Appear. Style | Temp. Style | Overall Consist. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Generation-only models | ||||||||||||||||||||
| ModelScope | 1.7B | 75.75 | 78.05 | 66.54 | 89.87 | 95.29 | 98.28 | 95.79 | 66.39 | 52.06 | 58.57 | 82.25 | 38.98 | 92.40 | 81.72 | 33.68 | 39.26 | 23.39 | 25.37 | 25.67 |
| LaVie | 3B | 77.08 | 78.78 | 70.31 | 91.41 | 97.47 | 98.30 | 96.38 | 49.72 | 54.94 | 61.90 | 91.82 | 33.32 | 96.80 | 86.39 | 34.09 | 52.69 | 23.56 | 25.93 | 26.41 |
| Show-1 | 6B | 78.93 | 80.42 | 72.98 | 95.53 | 98.02 | 99.12 | 98.24 | 44.44 | 57.35 | 58.66 | 93.07 | 45.47 | 95.60 | 86.35 | 53.50 | 47.03 | 23.06 | 25.28 | 27.46 |
| AnimateDiff-V2 | - | 80.27 | 82.90 | 69.75 | 95.30 | 97.68 | 98.75 | 97.76 | 40.83 | 67.16 | 70.10 | 90.90 | 36.88 | 92.60 | 87.47 | 34.60 | 50.19 | 22.42 | 26.03 | 27.04 |
| VideoCrafter-2.0 | - | 80.44 | 82.20 | 73.42 | 96.85 | 98.22 | 98.41 | 97.73 | 42.50 | 63.13 | 67.22 | 92.55 | 40.66 | 95.00 | 92.92 | 35.86 | 55.29 | 25.13 | 25.84 | 28.23 |
| CogVideoX | 5B | 81.61 | 82.75 | 77.04 | 96.23 | 96.52 | 98.66 | 96.92 | 70.97 | 61.98 | 62.90 | 85.23 | 62.11 | 99.40 | 82.81 | 66.35 | 53.20 | 24.91 | 25.38 | 27.59 |
| Kling | - | 81.85 | 83.39 | 75.68 | 98.33 | 97.60 | 99.30 | 99.40 | 46.94 | 61.21 | 65.62 | 87.24 | 68.05 | 93.40 | 89.90 | 73.03 | 50.86 | 19.62 | 24.17 | 26.42 |
| Open-Sora-2.0 | - | 81.71 | 82.10 | 80.14 | 98.75 | 98.00 | 99.40 | 99.49 | 20.74 | 64.33 | 65.62 | 94.50 | 77.72 | 95.40 | 85.98 | 76.18 | 52.71 | 22.98 | 25.91 | 27.57 |
| Gen-3 | - | 82.32 | 84.11 | 75.17 | 97.10 | 96.62 | 98.61 | 99.23 | 60.14 | 63.34 | 66.82 | 87.81 | 53.64 | 96.40 | 80.90 | 65.09 | 54.57 | 24.31 | 24.71 | 26.69 |
| Step-Video-T2V | 30B | 81.83 | 84.46 | 71.28 | 98.05 | 97.67 | 99.40 | 99.08 | 53.06 | 61.23 | 70.63 | 80.56 | 50.55 | 94.00 | 88.25 | 71.47 | 24.38 | 23.17 | 26.01 | 27.12 |
| Hunyuan Video | - | 83.43 | 85.07 | 76.88 | 97.22 | 97.60 | 99.39 | 99.05 | 71.94 | 60.28 | 67.24 | 83.48 | 66.71 | 94.40 | 89.79 | 72.13 | 54.46 | 22.21 | 24.52 | 26.95 |
| Wan2.1-T2V | 14B | 83.69 | 85.59 | 76.11 | 97.52 | 98.09 | 99.46 | 98.30 | 65.46 | 66.07 | 69.43 | 86.28 | 69.58 | 95.40 | 88.59 | 75.39 | 45.75 | 22.64 | 23.19 | 25.91 |
| Unified models | ||||||||||||||||||||
| HaploOmni | 7B | 78.10 | - | - | 96.40 | 97.60 | - | 96.80 | 65.30 | - | - | - | - | - | - | - | 34.60 | - | - | - |
| Emu3 | 8B | 80.96 | - | - | 95.32 | 97.69 | - | 98.93 | 79.27 | 59.64 | - | 86.17 | 44.64 | 77.71 | - | 68.73 | 37.11 | 20.92 | - | - |
| VILA-U | 7B | 74.01 | 76.26 | 65.04 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| Show-o2 | 2B | 81.34 | 82.10 | 78.31 | 97.28 | 96.78 | 97.68 | 98.25 | 40.83 | 65.15 | 67.06 | 94.81 | 76.01 | 95.20 | 80.89 | 62.61 | 57.67 | 23.29 | 25.27 | 27.00 |
| TUNA | 1.5B | 84.06 | 84.32 | 83.04 | 95.99 | 96.72 | 98.02 | 98.33 | 69.39 | 65.88 | 66.83 | 95.41 | 92.31 | 97.50 | 87.67 | 78.12 | 58.59 | 23.18 | 24.68 | 27.71 |
| Lance | 3B | 85.11 | 85.14 | 84.96 | 94.52 | 94.28 | 99.66 | 95.93 | 75.83 | 64.33 | 66.78 | 96.58 | 93.86 | 97.80 | 92.61 | 93.61 | 64.75 | 23.14 | 25.53 | 27.04 |
Video understanding on MVBench
MVBench evaluates video understanding across action, object, spatial, temporal, and reasoning categories. Lance achieves the best average score among listed unified models.
Scroll horizontally to inspect all metrics.
| Model | # Params. | Avg.↑ | AS | AP | AA | FA | UA | OE | OI | OS | MD | AL | ST | AC | MC | MA | SC | CO | EN | ER | CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Understanding-only models | |||||||||||||||||||||
| Video-LLaMA | 7B | 34.1 | 27.5 | 25.5 | 51.0 | 29.0 | 39.0 | 48.0 | 40.5 | 38.0 | 22.5 | 22.5 | 43.0 | 34.0 | 22.5 | 32.5 | 45.5 | 40.0 | 30.0 | 21.0 | 37.0 |
| LLaMA-Adapter | 7B | 31.7 | 23.0 | 28.0 | 51.0 | 30.0 | 33.0 | 53.5 | 32.5 | 33.5 | 25.5 | 21.5 | 30.5 | 29.0 | 22.5 | 41.5 | 39.5 | 31.5 | 22.5 | 28.0 | 32.0 |
| Video-ChatGPT | 7B | 32.7 | 23.5 | 26.0 | 62.0 | 22.5 | 26.5 | 54.0 | 28.0 | 40.0 | 23.0 | 20.0 | 31.0 | 30.5 | 25.5 | 39.5 | 48.5 | 33.0 | 29.5 | 26.0 | 35.5 |
| VideoChat | 7B | 35.5 | 33.5 | 26.5 | 56.0 | 33.5 | 40.5 | 53.0 | 40.5 | 30.0 | 25.5 | 27.0 | 48.5 | 35.0 | 20.5 | 42.5 | 46.0 | 41.0 | 23.5 | 23.5 | 36.0 |
| VideoChat2 | 7B | 51.1 | 66.0 | 47.5 | 83.5 | 49.5 | 60.0 | 58.0 | 71.5 | 42.5 | 23.0 | 23.0 | 88.5 | 39.0 | 42.0 | 58.5 | 44.0 | 36.5 | 35.0 | 40.5 | 65.5 |
| ST-LLM | 7B | 54.9 | 66.0 | 53.5 | 84.0 | 44.0 | 58.5 | 80.5 | 73.5 | 38.5 | 42.5 | 31.0 | 86.5 | 36.5 | 56.5 | 78.5 | 43.0 | 46.5 | 34.5 | 41.5 | 58.5 |
| GPT-4V | - | 43.5 | 55.5 | 63.5 | 72.0 | 46.5 | 73.5 | 18.5 | 59.0 | 29.5 | 12.0 | 40.5 | 83.5 | 39.0 | 12.0 | 22.5 | 45.0 | 52.0 | 31.0 | 59.0 | 11.0 |
| PLLaVA | 34B | 58.1 | 67.5 | 53.0 | 82.0 | 47.0 | 79.0 | 68.5 | 67.5 | 36.5 | 37.5 | 49.5 | 91.0 | 40.5 | 43.0 | 70.0 | 51.5 | 66.5 | 39.5 | 63.5 | 59.0 |
| Video-CCAM | 9B | 64.6 | 83.0 | 67.0 | 89.5 | 49.0 | 72.0 | 86.5 | 81.0 | 45.0 | 28.0 | 29.0 | 90.0 | 59.0 | 67.0 | 85.0 | 63.5 | 77.0 | 34.0 | 73.5 | 59.0 |
| Qwen2.5-VL | 3B | 67.0 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| TimeMarker | 8B | 67.4 | 79.0 | 74.5 | 89.0 | 53.5 | 77.0 | 94.0 | 76.0 | 41.5 | 52.5 | 47.0 | 91.5 | 53.0 | 76.5 | 92.5 | 57.0 | 70.5 | 23.5 | 53.5 | 82.5 |
| InternVideo2 | 7B | 67.3 | 86.0 | 70.0 | 87.0 | 56.0 | 75.0 | 91.0 | 86.0 | 40.0 | 48.0 | 53.0 | 90.0 | 41.0 | 73.0 | 92.0 | 52.0 | 56.0 | 33.0 | 57.0 | 74.0 |
| Unified models | |||||||||||||||||||||
| Show-o2 | 1.5B | 50.6 | 63.8 | 59.5 | 63.5 | 40.0 | 70.5 | 54.5 | 66.0 | 36.5 | 36.0 | 27.0 | 88.0 | 43.5 | 43.0 | 58.0 | 44.5 | 54.0 | 28.5 | 39.5 | 45.0 |
| Show-o2 | 7B | 55.7 | 60.1 | 67.0 | 68.0 | 45.5 | 78.0 | 51.0 | 73.5 | 44.5 | 36.0 | 39.0 | 92.5 | 51.5 | 36.0 | 59.5 | 52.0 | 64.0 | 38.0 | 60.0 | 43.0 |
| TUNA | 1.5B | 54.4 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| UniVideo | 7B | 46.3 | 54.3 | 41.5 | 77.5 | 50.0 | 62.5 | 68.2 | 50.5 | 37.5 | 36.0 | 29.5 | 35.5 | 28.5 | 52.5 | 70.5 | 33.5 | 40.5 | 37.5 | 36.5 | 38.0 |
| Lance | 3B | 62.0 | 73.9 | 76.5 | 71.5 | 49.0 | 63.5 | 96.0 | 72.5 | 33.0 | 63.5 | 33.0 | 86.0 | 41.0 | 82.0 | 97.5 | 43.0 | 47.5 | 31.5 | 40.0 | 77.0 |
Citation
@misc{fu2026lanceunifiedmultimodalmodeling,
title = {Lance: Unified Multimodal Modeling by Multi-Task Synergy},
author = {Fengyi Fu and Mengqi Huang and Shaojin Wu and Yunsheng Jiang and Yufei Huo and Hao Li and Yinghang Song and Fei Ding and Jianzhu Guo and Qian He and Zheren Fu and Zhendong Mao and Yongdong Zhang},
year = {2026},
eprint = {2605.18678},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.18678},
}