©2024 Zhejiang Zhiben Law Firm. All rights reserved.Zhejiang
LABEL: AI , Intellectual property , Digital economy ,
Generative artificial intelligence ("AIGC"), as a revolutionary technology in the field of artificial intelligence, is rapidly changing the ecology of content creation. AIGC technology, through deep learning models, can automatically generate various forms of content such as text, images, audio, and video, bringing unprecedented opportunities to the creative industry. However, with the rapid development of technology, data related issues have gradually emerged, becoming one of the key factors restricting the healthy development of AIGC. The "Interim Measures for the Management of Generative Artificial Intelligence Services" (the "Interim Measures"), which will come into effect on August 15, 2023, aim to guide and promote the compliant use of AIGC technology through legal means, protect data security, respect intellectual property rights and personal privacy, and prevent the occurrence of data bias and discrimination.This article will analyze the potential data compliance risks involved in AIGC from three stages: model training, model application, and model optimization, and provide relevant compliance recommendations for AIGC technical support providers, AIGC platform operators, and AIGC service users.
1、 Model training stage
Article 7 of the Interim Measures stipulates that providers of generative artificial intelligence services shall carry out training data processing activities such as pre training and optimization training in accordance with the law, and comply with the following provisions: (1) using data and basic models from legitimate sources; (2) Those involving intellectual property rights shall not infringe upon the intellectual property rights enjoyed by others in accordance with the law; (3) If personal information is involved, personal consent or other circumstances that comply with laws and administrative regulations shall be obtained; (4) Take effective measures to improve the quality of training data, enhance the authenticity, accuracy, objectivity, and diversity of training data; (5) Other relevant provisions of laws and administrative regulations such as the Cybersecurity Law of the People's Republic of China, the Data Security Law of the People's Republic of China, and the Personal Information Protection Law of the People's Republic of China, as well as regulatory requirements from relevant competent authorities. Article 8 stipulates that for data annotation during the research and development process of generative artificial intelligence technology, providers shall formulate clear, specific, and operable annotation rules that comply with the requirements of these Measures; Conduct quality assessment of data annotation and verify the accuracy of annotation content through sampling; Provide necessary training to annotation personnel, enhance their awareness of respecting and abiding by the law, and supervise and guide them to carry out annotation work in a standardized manner.
Based on the above regulations, the factors related to data compliance during the model training phase mainly involve two aspects: training data sources and data quality.
(1) Data source
The legality of training data sources is the starting point for discussions on training data compliance. AIGC technical support providers often obtain training data through public collection, self collection, third-party procurement, and other methods. The main risks they may face in this process are as follows:
1. Infringement of intellectual property rights
Article 53 of the Copyright Law of the People's Republic of China stipulates that those who engage in the following acts of infringement shall bear civil liability as stipulated in Article 52 of this Law, depending on the circumstances: (1) Copying, distributing, performing, screening, broadcasting, compiling, or disseminating their works to the public through information networks without the permission of the copyright owner, except as otherwise provided in this Law
Based on the above regulations, if the data obtained by AIGC technical support includes materials protected by copyright or other intellectual property rights, and it has not been fully authorized, it often involves infringement of copyright or other intellectual property rights. For example, in the case of using web crawlers to obtain data, whether it is articles, images, user comments, or even the website's own database, they may constitute works in the sense of copyright law with originality, regardless of whether they are freely accessible on the original website. Unauthorized crawling and use of such data may constitute copyright infringement.
It is worth exploring whether the behavior of AIGC technology support providers in obtaining data for model training is suitable for "fair use". On the one hand, AIGC technical support providers generally copy or download relevant training data to their own or third-party servers for storage for easy use. This behavior often involves the "copying" behavior in copyright, and AIGC technical support providers' use of training data is generally for their own commercial purposes, which seems to be difficult to meet the "reasonable use" conditions clearly stipulated in the Copyright Law. On the other hand, large model training is an "intermediate replication" for the replication of works, that is, during the large model training stage, although it may involve the replication of training data (which may contain a large number of works protected by copyright law), such replicas are not the final form of the large model product. Normally, AIGC technical support providers do not disseminate or display such copies to the public. In addition, from the perspective of usage purposes, in fact, AIGC technology supports replicating training data and performing preprocessing steps such as cleaning and labeling on relevant training data, with the aim of transforming the training data into numerical data that is easy for machines to understand, in order to summarize and learn its inherent patterns and characteristics. Therefore, it is worth exploring in depth whether the principle of fair use is applicable to the training of large models.
In addition, according to the Anti Unfair Competition Law of the People's Republic of China ("Anti Unfair Competition Law"), trade secrets refer to technical information, business information, and other commercial information that is not known to the public, has commercial value, and has been subject to corresponding confidentiality measures by the rights holder. In the process of obtaining training data by AIGC technical support, if the relevant data constitutes trade secrets and AIGC technical support fails to identify and use such data without authorization, it may constitute trade secret infringement and be liable for trade secret infringement.
2. Unfair competition
In practice, AIGC technical support providers often obtain training data through crawling and other technical means, which may pose a risk of unfair competition. Article 127 of the Civil Code of the People's Republic of China stipulates that if the law has provisions for the protection of data and virtual property on the internet, such provisions shall be followed. This is the legal basis for protecting data rights and interests. However, this clause is only a framework and consequential provision, and does not make specific provisions on the rights attributes and protection requirements of data. In judicial practice, courts tend to cite relevant provisions of the Anti Unfair Competition Law for illegal data crawling.
Article 2 of the Anti Unfair Competition Law stipulates that operators shall follow the principles of voluntariness, equality, fairness, honesty and credibility in market transactions, and comply with recognized business ethics. The use of web crawling technology to bypass the robots protocol (especially the Disallow statement used by the target website) and crawl relevant data may be deemed a violation of the "recognized business ethics" mentioned above, and thus constitute unfair competition behavior. The users of related technology also need to bear responsibilities such as stopping infringement and compensating for damages. Furthermore, if the use of crawlers interferes with the normal operation of the visited website or is used to replace the services of the crawled party, there is a higher possibility of being deemed to constitute unfair competition.
For example, in the case of crawling and using real estate transaction information platform housing data [2], the court held that S company used technological means to massively crawl the involved data, and stored the involved data on its own servers to remove the original platform website watermark, add other subject watermarks, and spread it to social media and third-party real estate information platforms, providing important tools and convenient conditions for the publication of "false housing", objectively promoting the spread of "false housing", and clearly violating the principles of integrity and business ethics of the real estate brokerage industry. Moreover, while S company has explicitly promised to immediately cease the accused behavior in the lawsuit, it has also continued to carry out the accused behavior in a more covert manner, demonstrating obvious subjective malice. The accused behavior has seized user traffic that originally belonged to Company L, affecting user stickiness and trust, directly damaging consumers' right to know, choice, and transaction security due to "false housing", making it impossible for operators who rely on honest operation to obtain competitive advantages to obtain effective incentives, disrupting the competitive ecology and order of the real estate brokerage industry, and constituting unfair competition behavior.
3. Infringement of personality rights
Article 990 of the Civil Code stipulates that personality rights refer to the rights to life, body, health, name, image, reputation, honor, privacy, and other rights enjoyed by civil subjects. In addition to the personality rights stipulated in the preceding paragraph, natural persons enjoy other personality rights and interests based on personal freedom and dignity. Article 991 of the Civil Code stipulates that the personality rights of civil subjects are protected by law, and no organization or individual may infringe upon them. Specifically, Article 1018 of the Civil Code stipulates that natural persons have the right to their portrait and have the right to lawfully make, use, publicly disclose, or license others to use their own portrait. Article 1019 stipulates that no organization or individual shall infringe upon the portrait rights of others by defaming, defacing, or using information technology to forge. Without the consent of the portrait owner, the portrait of the portrait owner shall not be made, used, or publicly disclosed, except as otherwise provided by law. Article 1023 stipulates that the protection of the voice of natural persons shall refer to the relevant provisions on the protection of portrait rights.
In practice, considering that the training data may include images, videos, and other content, if the portraits or sounds in these data can reflect the characteristics of natural persons, or if the general public can associate them with the true characteristics of natural persons through relevant images or sounds, such images or sounds may be considered as belonging to the scope of natural person portrait rights and sound rights. AIGC technical support should obtain authorization for the use of relevant training data regarding the portraits or sounds of such natural persons, otherwise it may constitute infringement.
4. Infringement of personal information
Article 44 of the Cybersecurity Law of the People's Republic of China stipulates that no individual or organization shall steal or obtain personal information through other illegal means. Article 27 of the Personal Information Protection Law of the People's Republic of China (the "Personal Information Protection Law") stipulates that personal information processors may, within a reasonable scope, process personal information that individuals have voluntarily disclosed or that has already been lawfully disclosed; Except for individuals who explicitly refuse. If a personal information processor processes publicly available personal information that has a significant impact on an individual's rights and interests, they shall obtain the individual's consent in accordance with the provisions of this Law. Specifically, handling sensitive personal information also requires obtaining individual consent. Therefore, if the data obtained by AIGC technical support for model training contains personal information, it should comply with the relevant provisions of the Personal Information Protection Law mentioned above. Collecting users' personal information without their consent may constitute an illegal act of infringing on personal information.
5. Processing core and important data
Article 21 of the Data Security Law of the People's Republic of China stipulates that core data refers to "data related to national security, the lifeline of the national economy, important people's livelihoods, major public interests, etc. Article 19 of the Measures for Security Assessment of Data Export stipulates that important data refers to "data that, once tampered with, destroyed, leaked, or illegally obtained or utilized, may endanger national security, economic operation, social stability, public health and safety, etc. At present, many places, industries, and some pilot zones have issued rules or catalogs to clarify core and important data. For example, the Ministry of Industry and Information Technology has refined the criteria for identifying important and core data in the field of industry and information technology in the "Measures for the Management of Data Security in the Field of Industry and Information Technology (Trial)"; The "Several Provisions on the Management of Automotive Data Security (Trial)" jointly issued by five departments has defined the scope of identification for six important data in the automotive industry. If the data used by AIGC technical support for training large models involves core data or important data, they need to fulfill a series of stricter obligations, and the specific ways of fulfilling obligations vary from industry to industry, including but not limited to: (1) fulfilling obligations to regulators, such as filing with industry regulatory authorities in their region and continuously completing change procedures for changes in filing content, conducting risk assessments and submitting risk assessment reports, and regularly reporting data security management situations; (2) The obligation of data security management, such as establishing a data security work system for relevant departments of the unit and clarifying data security responsibilities, taking corresponding security measures according to the data security level, etc.
Therefore, AIGC technical support providers need to identify important and core data that may be included in the training data, and then fulfill relevant compliance obligations based on their industry, region, and confidentiality level. However, the current standards and directories for identifying core and important data are scattered in various departmental regulations, industry standards, and local laws and regulations. AIGC technical support may find it difficult to ensure the accuracy and completeness of this identification work, and thus it may be difficult to fully fulfill compliance obligations on this basis.
6. Criminal risk
According to Articles 285 and 286 of the Criminal Law of the People's Republic of China, unauthorized access to "data stored, processed, or transmitted in computer information systems", "illegal control of computer information systems", or interference with computer information system functions, if the circumstances are serious, may result in criminal punishment. For example, if the AIGC technical support party intentionally avoids or forcibly breaks through the anti crawling technology settings of the website, or invades computer information systems other than those specified in Article 285 (1) of the Criminal Law, but the web crawlers are too fast or repeatedly accessed in large quantities, occupying a large amount of server bandwidth and computing power, significantly increasing the burden of computer processing, and thereby interfering with the normal operation of computer information systems with serious consequences, criminal responsibility may be involved.
AIGC technical support provider obtains TIPs from training data during model training phase:
Obtaining authorization and consent from the rights holder of training data: In practice, model training requires massive amounts of data, and obtaining authorization from each data subject is generally difficult to achieve. But for certain high-risk data, such as sensitive personal information such as biometric, religious beliefs, specific identities, medical health, financial accounts, and whereabouts, AIGC technical support should obtain separate authorization and consent from the relevant rights holders.
Legitimate use of crawlers and other technical means: AIGC technical support providers shall not breach or bypass technical measures when obtaining training data through crawlers and other technical means, and shall comply with the Robots protocol; Avoid crawling personal information, copyrighted works of others, etc; Avoid massive and high-frequency data crawling to prevent disruption to the normal operation of the website. In addition, when crawling and using open source datasets, AIGC technical support providers also need to comply with the requirements of the open source license.
Avoid collecting and processing core and important data: In principle, avoid collecting and processing training data that contains core and important data, while paying attention to the identification of core and important data. Once the training data used for model training is identified or recognized as core or important data, AIGC technical support needs to focus on protecting such core or important data and fulfill the relevant obligations of data processors.
Strict review of third-party procurement data sources: AIGC technical support providers should sign clear cooperation agreements with third-party data suppliers when purchasing training data from them, requiring them to make non infringing statements and guarantees regarding the intellectual property rights and civil rights (including but not limited to personality rights, personal information, etc.) of the relevant training data, and requiring such data suppliers to ensure the integrity of the authorization chain.
Establish data compliance management and technical response plans: AIGC technical support providers should also comply with relevant data protection regulations and AI ethical standards, use technical means to establish and improve risk response plans, such as data encryption, anonymization, etc., strictly control the use and disclosure scope of training data, protect relevant training data from unauthorized access, and reduce the risk of infringement.
(2) Data Quality
Training large models requires large-scale, high-quality, and multimodal datasets, typically collected from various fields and multiple data sources. Data quality directly affects the effectiveness of model training. High quality data should have accuracy and representativeness, and be able to comprehensively reflect the features and patterns that the model needs to learn. The accuracy of data annotation is also crucial for the understanding ability of the model. Labeling should not only be precise and accurate, but also follow ethical and legal standards, respect the rights of all individuals involved in the data, including but not limited to avoiding bias, discrimination, and ensuring the diversity and inclusiveness of the data. Specifically, the main risks that AIGC technical support may face in terms of training data quality are as follows:
1. Annotated data with uneven quality generates misleading content
On the one hand, inconsistencies in data labeling may cause bias in the model's recognition of specific categories. For example, in the image recognition task, if the annotator has different recognition standards for the objects in the image, the model may confuse different categories, resulting in the generated content inconsistent with the actual situation. On the other hand, errors and noise in the dataset can weaken the model's generalization ability. When the dataset contains a large number of incorrectly labeled samples, the model may learn these erroneous features instead of the true data distribution, which in turn affects the performance of the model when facing new data. In addition, bias in data annotation may lead to the generation of discriminatory content in the model. If the annotator is influenced by their own biases during the annotation process, the model may learn and replicate these biases, resulting in unfairness in generating content.
2. Lack of diversity in training data leads to value bias
On the one hand, the lack of diversity in training data may lead to biases in the model's understanding of certain groups or cultures. If the training data mainly comes from specific regions or social groups, the model may overemphasize the values and perspectives of these groups while ignoring the voices of other groups, resulting in defects in the generated content in terms of cultural diversity and inclusiveness. On the other hand, the limitations of training data may also result in poor performance of the model when dealing with complex topics and abstract concepts. Complex themes and abstract concepts often require broader knowledge and deeper understanding. If the training data lacks these aspects of data, the model may not be able to generate in-depth and comprehensive content, which may affect its application effectiveness in professional fields. In addition, deviations in training data may also lead to models exhibiting unfair tendencies when generating content. If there are biases such as gender, race, or social status in the training data, the model may replicate these biases when generating content, resulting in discriminatory content.
3. The timeliness bias of training data reduces the credibility of the model
On the one hand, the timeliness bias of training data may make the model appear inadequate when dealing with the latest events or trends. For example, in fields such as news reporting or market analysis, if the model relies on outdated data, the generated content may not accurately reflect the latest developments, thereby misleading user decisions. On the other hand, insufficient timeliness of training data may affect the professionalism and authority of the model in specific fields. In professional fields such as law and healthcare, knowledge updates rapidly. If the training data relied upon by the model fails to keep up with the latest research results or regulatory changes, the generated content may lose its professionalism and even become misleading. In addition, the timeliness of training data may also lead to user distrust. Users expect the model to provide accurate and reliable information. If the model frequently outputs outdated or inaccurate content, users may question the credibility of the model, thereby affecting its long-term development.
AIGC technical support provider trains data quality management TIPs during model training phase:
Adopt strict data quality management measures: Strictly manage the quality of training data, including data cleaning, annotator training, multiple rounds of annotation and verification, etc; Continuously monitor and evaluate the output content of the model to ensure its quality and safety, minimize the risks caused by uneven data annotation quality, and improve the reliability and effectiveness of the model.
Enhance the diversity of training data: Ensure that the training data has sufficient representativeness, covering different cultures, regions, and social groups; Conduct detailed analysis and screening of training data to ensure its quality and diversity; Continuously monitor and evaluate the model to ensure that its generated content complies with social values and ethical standards.
Regularly update and monitor training data: Regularly update training data to ensure that the information reflected is consistent with the current actual situation; Establish effective data monitoring and feedback mechanisms to promptly identify and correct timeliness issues in training data; Strengthen cooperation with professional fields to ensure that the model can timely absorb the latest research results and knowledge updates.
2、 Model application stage
Article 11 of the Interim Measures stipulates that providers shall fulfill their obligation to protect users' input information and usage records in accordance with the law, and shall not collect unnecessary personal information, illegally retain input information and usage records that can identify the user's identity, or illegally provide users' input information and usage records to others. Providers shall promptly accept and handle personal requests for access, copying, correction, supplementation, deletion of their personal information in accordance with the law. During the model application phase, AIGC service providers need to process the relevant data input by AIGC service users when using AIGC services. During this process, both AIGC service providers and AIGC service users may face certain data compliance risks, mainly as follows:
(1) Data processing
1. Processing personal information does not have a legal basis
Article 5 of the Personal Information Protection Law stipulates that the processing of personal information shall follow the principles of legality, legitimacy, necessity, and good faith, and shall not be processed through misleading, fraudulent, coercive, or other means. Article 6 stipulates that the processing of personal information should have a clear and reasonable purpose, and should be directly related to the processing purpose, adopting the method that minimizes the impact on individual rights and interests. The collection of personal information should be limited to the minimum scope necessary to achieve the processing purpose, and excessive collection of personal information is not allowed. Article 7 stipulates that the processing of personal information shall follow the principles of openness and transparency, disclose the rules for processing personal information, and clearly state the purpose, method, and scope of processing. Article 10 stipulates that no organization or individual shall illegally collect, use, process, or transmit personal information of others, nor shall they illegally buy, sell, provide, or disclose personal information of others; Do not engage in personal information processing activities that endanger national security and public interests. The relevant provisions of Article 11 of the Interim Measures are a reiteration of the above principles established by the Personal Information Protection Law in the context of artificial intelligence. In practice, AIGC service providers who directly provide services to AIGC service users are usually responsible for fulfilling the above obligations. In the event that AIGC service providers exceed their scope or illegally process personal information of AIGC service users, they also need to bear corresponding legal liabilities.
2. Risk of cross-border data transmission
In the case where AIGC service providers intervene in services provided by overseas service providers through APIs or deploy their own servers overseas, the data uploaded by AIGC service users when using related services may be transmitted overseas. Considering the significant uncertainty of the types of data provided by AIGC service providers to overseas, it may trigger relevant data export compliance obligations. According to relevant regulations such as the Data Security Law, the Personal Information Protection Law, and the Measures for Security Assessment of Data Export, China has identified three main paths for data export, including conducting security assessments organized by the national cyberspace administration, obtaining personal information protection certification from professional institutions, or entering into contracts with overseas recipients in accordance with standard contracts established by the national cyberspace administration to define the rights and obligations of both parties. At the same time, the "Important Regulations on Promoting and Regulating Cross border Data Flow" also stipulate several exemption situations for data export, such as situations where the export data does not contain personal information or important data, or where it is expected to provide personal information of less than 10000 people to overseas within one year, there is no need to declare security assessment, standard contract filing or certification.
3. Lack of protection of data subject rights
The Personal Information Protection Law clarifies through principle provisions that individuals have the right to be informed and make decisions regarding the processing of their personal information, as well as the right to restrict or refuse others from processing their personal information. It also specifies the right to access, copy, transfer, correct and supplement, delete, and request interpretation. At the same time, the Personal Information Protection Law also requires enterprises, as personal information processors, to establish a convenient mechanism for accepting and processing applications for individuals to exercise their rights. If a request for individuals to exercise their rights is refused, the reasons should be explained. If a personal information processor refuses an individual's request to exercise their rights, the personal information subject may file a lawsuit with the court. Therefore, AIGC service providers should handle AIGC service users' exercise requests with caution and respond promptly, and should not use difficulties as an excuse for not processing or not processing them in a timely manner.
AIGC service providers use TIPs for data processing during the model application phase:
Ensure that the processing of personal information has a legal basis: When AIGC service providers process personal information such as input information and usage records of AIGC service users, they should clearly and inform AIGC service users of the processing purpose, processing method, and retention period. Within the necessary scope, based on clear and reasonable purposes, they should process and retain personal information in a way and for a period that minimizes the impact on the rights and interests of AIGC service users, and should not excessively collect personal information of AIGC service users.
Fulfilling compliance obligations for cross-border data transmission: AIGC service providers should, based on specific business situations and relevant legal regulations, choose to apply for data export security assessments, sign standard contracts with overseas recipients, implement personal information protection authentication, and other methods to ensure the legality and compliance of data export.
Establish a response mechanism for personal information subject rights: AIGC service providers should systematically sort out the personal information that may be involved in the use of the model, set up and publicize a response mechanism for personal information subject rights, and promptly accept and handle personal information subject's requests for access, duplication, correction, supplementation, deletion, and explanation.
(2) Data Security
1. The input data contains sensitive data
When AIGC service users use the model, if the input data contains sensitive data, such as sensitive files within the enterprise, trade secrets of the enterprise, and personal information, AIGC service users will inadvertently face a great risk of data leakage. For example, in the incident of Samsung employees leaking trade secrets, when employees use ChatGPT for code optimization or extracting meeting minutes, they may provide the company's confidential information to supplier OpenAI, leading to the risk of leakage. Furthermore, if the AIGC service provider uses sensitive data input by AIGC service users as training data for the model, it will lead to the risk of secondary leakage. For example, Amazon's corporate lawyer claims to have found text in ChatGPT generated content that is "very similar" to company secrets, possibly due to some Amazon employees entering internal company data information when using ChatGPT to generate code and text. The lawyer is concerned that the information entered by employees may have been used as training data for ChatGPT iteration optimization.
2. Model data security incidents
If the security measures taken by AIGC service providers are insufficient, they will also face multiple risks of data breaches. Hackers may gain unauthorized data access by identifying and exploiting model vulnerabilities, such as software defects or improper configurations. In addition, phishing attacks or deception may also inadvertently expose sensitive data to internal personnel of AIGC service providers.
AIGC service providers and users manage data security TIPs during the model application phase:
Establish an external model usage control mechanism: AIGC service users should impose clear restrictions on employees' use of external models, such as prohibiting unauthorized uploading of internal data to external models and setting up alert mechanisms; For example, encrypting sensitive data to ensure that even if the relevant data is uploaded non compliantly, the file content will not be leaked.