



  • 发布时间:2021年
  • 主要特点
    • 全球首个图文音三模态大模型,能够实现图像、文本和语音之间的统一表示与相互生成。
    • 主要应用于基础的多模态任务,如图像识别、文本生成和语音识别等。


  • 发布时间:2023年6月
  • 主要特点
    • 在1.0版本的基础上,加入了视频、传感信号和3D点云等模态数据,进一步增强了多模态理解和生成能力。
    • 突破了认知增强的多模态关联等关键技术,具备全模态理解能力、生成能力和关联能力。
    • 应用场景扩展到医疗、法律咨询、交通管理、智能制造、智慧城市等多个领域。


  • 预计发布时间:2024年上半年
  • 主要特点(预期):
    • 进一步提升大模型对各行业的赋能能力,具备自主选择和使用工具的能力,满足更深层次的逻辑交互需求。
    • 在智能驾驶领域,通过大语言模型和多模态能力,大幅缩短和优化训练过程,提升智能汽车感知世界的效率。
    • 进一步优化语音、视频和文本的融合认知以及常识计算等功能。



  • 调用次数:每次调用API都会计入总调用次数,用户需要根据调用次数支付相应费用。
  • 数据量:根据处理的数据量进行收费,数据量越大,费用越高。



  • 中文文本:1个Token大约对应1-2个汉字。
  • 英文文本:1个Token大约对应3-4个字母。



  • 基础套餐:适合小规模使用,费用较低。
  • 高级套餐:适合大规模使用,提供更多的调用次数和数据处理量,费用相对较高。







  • 神经外科手术导航:在神经外科手术中,能够实时融合视觉、触觉等多模态信息,协助医生进行实时推理判断,提高手术的精确性和安全性。
  • 多模态鉴别诊断:通过分析多模态医疗数据(如影像、文本、信号等),提供更准确的诊断结果,辅助医生进行复杂病例的诊断。


  • 法律咨询服务:通过多模态数据分析,提升法律咨询的准确性和效率,能够处理复杂的法律问题,提供高质量的法律建议。


  • 交通违规图像研读:通过图像与声音的结合,完成交通场景的分析,识别交通违规行为,提高交通管理的效率和准确性。
  • 智能驾驶:在智能驾驶领域,通过大语言模型和多模态能力,大幅缩短和优化训练过程,提升智能汽车感知世界的效率。


  • 生产流程优化:通过对生产流程的智能分析和优化,提高生产效率和产品质量,适用于智能制造领域的各个环节。


  • 城市管理:助力城市管理、交通调度和公共安全等,通过多模态数据的分析和处理,提升城市管理的智能化水平。


  • 文化旅游:在智慧文旅领域,通过多模态数据的融合和分析,提供个性化的旅游推荐和智能导览服务,提升游客的体验。


  • 教育辅助:在智慧教育领域,通过多模态数据的分析和处理,提供个性化的学习建议和智能辅导,提升教育质量和学习效果。


  • 音乐理解与生成:能够理解和生成音乐内容,适用于音乐创作和音频编辑等。
  • 图像生成与编辑:根据文本描述生成相应的图像,支持创意设计和广告制作等应用。
  • 视频生成与编辑:可以生成和编辑视频内容,适用于短视频制作和影视特效等领域。


Zi Dong Tai Chu is a fully multimodal large model jointly developed by the Institute of Automation, Chinese Academy of Sciences, and the Wuhan Artificial Intelligence Research Institute. The model was first released in 2021 as “Zi Dong Tai Chu 1.0,” marking the world’s first tri-modal model that integrates image, text, and audio modalities, enabling unified representation and mutual generation among them. Building on this, the upgraded version “Zi Dong Tai Chu 2.0” was released in 2023, further enhancing its multimodal understanding and generation capabilities.

Model Versions

Zi Dong Tai Chu 1.0

  • Release Date: 2021
  • Key Features:
    The world’s first tri-modal model integrating image, text, and audio modalities. It can achieve unified representation and mutual generation among these modes.
    Primarily applied to basic multimodal tasks such as image recognition, text generation, and speech recognition.

Zi Dong Tai Chu 2.0

  • Release Date: June 2023
  • Key Features:
    Building on the 1.0 version, it adds video, sensor signals, and 3D point cloud data, further enhancing multimodal understanding and generation capabilities.
    Breaks through key technologies such as multimodal association for cognitive enhancement, with comprehensive abilities in understanding, generating, and associating multimodal data.
    Application scenarios have expanded to fields such as healthcare, legal consultation, traffic management, smart manufacturing, and smart cities.

Zi Dong Tai Chu 3.0

  • Expected Release: First half of 2024
  • Expected Features:
    Further improves the model’s ability to empower various industries with the ability to autonomously select and use tools, meeting deeper logical interaction needs.
    In the smart driving field, by leveraging large language models and multimodal capabilities, it significantly shortens and optimizes the training process, enhancing the efficiency of intelligent vehicles’ perception of the world.
    Further optimizes the cognitive integration of speech, video, and text, as well as functions such as commonsense reasoning.

Pricing Models

Charge by Number of Calls or Data Volume
The API usage of Zi Dong Tai Chu is generally charged based on the number of API calls or the data volume processed. This pricing model is flexible and easy to understand, allowing users to pay according to their actual usage.

  • Number of Calls: Each API call is counted towards the total, and users are charged based on the number of calls.
  • Data Volume: Charges are based on the amount of data processed, with larger data volumes incurring higher fees.

Token-Based Billing
Zi Dong Tai Chu uses token-based billing as the unit of charge. A token is a basic unit in natural language processing. In Chinese, 1 token typically corresponds to 1-2 Chinese characters; in English, 1 token corresponds to approximately 3-4 letters.

  • Chinese Text: 1 token corresponds to around 1-2 Chinese characters.
  • English Text: 1 token corresponds to around 3-4 letters.

Subscription Packages
Users can choose from different subscription packages based on their needs. The official provider may offer various package options to meet the needs of different users.

  • Basic Package: Suitable for small-scale use with lower costs.
  • Advanced Package: Suitable for large-scale use, offering more calls and data processing at a higher cost.

Free Tier
The official provider may offer a free tier, allowing users to try the service within a certain limit before requiring payment for any usage beyond that.

Customized Services
For enterprises or developers with special needs, customized services may be provided, typically with a separate fee structure.

Application Scenarios


  • Neurosurgery Navigation: In neurosurgery, it can integrate visual, tactile, and other multimodal information in real-time, assisting doctors in making real-time decisions, thereby improving the accuracy and safety of surgery.
  • Multimodal Diagnostic Differentiation: By analyzing multimodal medical data (such as images, text, and signals), it provides more accurate diagnostic results to assist doctors in diagnosing complex cases.

Legal Consultation

  • Legal Consultation Services: Through the analysis of multimodal data, it enhances the accuracy and efficiency of legal consultation, handling complex legal issues and providing high-quality legal advice.


  • Traffic Violation Image Analysis: Combines image and audio data to analyze traffic scenes, identify traffic violations, and improve the efficiency and accuracy of traffic management.
  • Intelligent Driving: In the field of intelligent driving, its large language model and multimodal capabilities significantly shorten and optimize the training process, enhancing the efficiency of intelligent vehicles’ perception.

Smart Manufacturing

  • Production Process Optimization: By intelligently analyzing and optimizing production processes, it improves production efficiency and product quality, applicable to all stages of smart manufacturing.

Smart Cities

  • City Management: Supports city management, traffic scheduling, and public safety by analyzing and processing multimodal data, enhancing the intelligence of urban management.

Smart Tourism

  • Cultural Tourism: In the smart tourism field, it combines and analyzes multimodal data to provide personalized travel recommendations and intelligent navigation services, enhancing the tourist experience.

Smart Education

  • Educational Assistance: In the smart education field, it analyzes and processes multimodal data to provide personalized learning suggestions and intelligent tutoring, improving education quality and learning outcomes.

Creativity and Entertainment

  • Music Understanding and Generation: Capable of understanding and generating music content, applicable to music creation and audio editing.
  • Image Generation and Editing: Generates images based on text descriptions, supporting creative design and advertising production.
  • Video Generation and Editing: Capable of generating and editing video content, applicable to short video production and visual effects in film and television.

Open-Source Strategy

Zi Dong Tai Chu primarily offers open-source versions, promoting technological innovation through open-source strategies, reducing usage costs, and driving rapid development of large model technology through community collaboration.
