AI Tool Profile

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e. g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR.

Paper and LLMs

Website

github.com

Pricing model

Free

Price start

Free

GitHub Link

The GitHub link is https://github.com/michel-liu/grouppose-paddle

Introduce

This repository contains the official PaddlePaddle implementation for the ICCV 2023 paper titled "Group Pose A Simple Baseline for End-to-End Multi-person Pose Estimation." The paper introduces Group Pose, a straightforward transformer-based approach for multi-person pose estimation, treating keypoint prediction as a set of queries. The method simplifies decoder self-attention by using specific group self-attentions instead of interactions between different query types. Experimental results on MS COCO and CrowdPose datasets demonstrate that Group Pose outperforms previous methods without human box supervision, even slightly surpassing ED-Pose, which uses such supervision. The repository provides code, pretrained models, and detailed results for evaluation. The work is released under the Apache 2.0 license. State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e. g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR.

Content

Introduction

In this paper, we study the end-to-end multi-person pose estimation and present a simple yet effective transformer approach, named Group Pose. We simply regard �-keypoint pose estimation as predicting a set of �×� keypoint positions, each from a keypoint query, as well as representing each pose with an instance query for scoring � pose predictions.

Motivated by the intuition that the interaction, among across-instance queries of different types, is not directly helpful, we make a simple modification to decoder self-attention. We replace single self-attention over all the �×(�+1) queries with two subsequent group self-attentions: (i) � within-instance self-attention, with each over � keypoint queries and one instance query, and (ii) (�+1) same-type across-instance self-attention, each over � queries of the same type. The resulting decoder removes the interaction among across-instance type-different queries, easing the optimization and thus improving the performance. Experimental results on MS COCO and CrowdPose show that our approach without human box supervision is superior to previous methods with complex decoders, and even is slightly better than ED-Pose that uses human box supervision.

Alternatives & Similar Tools

Free Google Gemini: the best largest and most capable AI model Free

Google Gemini, a multimodal AI by DeepMind, processes text, audio, images, and more. Gemini outperforms in AI benchmarks, is optimized for varied devices, and has been tested for safety and bias, adhering to responsible AI practices.

Visit →

LongLLaMA-handle very long text contexts, up to 256,000 tokens Open Source

LongLLaMA is a large language model designed to handle very long text contexts, up to 256,000 tokens. It's based on OpenLLaMA and uses a technique called Focused Transformer (FoT) for training. The repository provides a smaller 3B version of LongLLaMA for free use. It can also be used as a replacement for LLaMA models with shorter contexts.

Visit →

LAMA: Human motion data to realistic complex 3D model actions Open Source

LAMA utilizes a reinforcement learning framework combined with a motion matching algorithm. Reinforcement learning helps the model make appropriate decisions in various scenarios, while motion matching algorithms ensure that synthesized actions match real human actions. In addition, LAMA also utilizes the motion editing framework of manifold learning to cover various possible changes in interactions and operations.

Visit →

Replicate-AI model GFPGAN can help restore old photos Paid

Replicate – Run open-source machine learning models with a cloud API

Visit →

Video ReTalking-focuses on audio-based lip synchronization for talking head video editing Open Source

Video ReTalking, advanced real-world talking head video according to input audio, producing a high-quality

Visit →

UniSim-Chat Control Video and Virtual simulation Open Source

Then transplant it to the real world to solve complex problems

Visit →

Compare Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

Quick compare routes for nearby alternatives.

All compare routes →

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation vs Free Google Gemini: the best largest and most capable AI model

Compare Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation with Free Google Gemini: the best largest and most capable AI model and jump into the preserved compare route.

Open compare route →

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation vs LongLLaMA-handle very long text contexts, up to 256,000 tokens

Compare Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation with LongLLaMA-handle very long text contexts, up to 256,000 tokens and jump into the preserved compare route.

Open compare route →

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation vs LAMA: Human motion data to realistic complex 3D model actions

Compare Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation with LAMA: Human motion data to realistic complex 3D model actions and jump into the preserved compare route.

Open compare route →