The field of
neural network technology is currently undergoing rapid development, becoming
more sophisticated every day and gaining more and more skills and capabilities.
In particular, neural networks that can process images in a variety of ways,
from animating photos to automatically creating full-fledged images based on a
user's text request, are becoming particularly popular these days.
The task of such a
neural network is to form plausible images for a variety of sentences that
explore the compositional structure of language. Another task becomes the
simultaneous management of multiple objects, their attributes and their spatial
relations. In order to correctly interpret a query sentence, the algorithm must
not only correctly compose each object attribute, but also correctly form
associations. For example, to visualize the sentence "hedgehog in red hat,
yellow gloves, blue shirt and green trousers", the neural network needs to
recognize in the text and generate object images in a given combination of
features and object (hat, red), (gloves, yellow), (shirt, blue) and (trousers,
green) without mixing them [1].
It should also be
noted that the user of this type of neural network cannot yet predict in
advance the visual result that the neural network will produce for the entered
textual query. The correlation between the original query text and the
resulting visual image is a separate class of problems, which is currently
being actively studied and solved by the developers of the largest
text-to-image neural networks, such as Midjourney.
2022 is a pivotal
year for many creative professions. In March 2022, the Midjourney image
generation neural network opened to the public and quickly gained a large
following, not least due to the fact that it became publicly available before
the likes of DALL-E and Stable Diffusion. With a distinctive and already
recognisable style, it's rapidly evolving and improving, allowing users to
recreate more and more complex queries in graphical form.
Also in 2022, USA-based
machine learning technology licensing and development company OpenAI unveiled
DALL-E 2, an updated version of the neural network first shown in January 2021.
The new version generates even better and more realistic images from the
description in English.
One of the major
events in the world of AI image generators was the public release of Stable
Diffusion neural network, because unlike DALL-E and Midjourney, Stable
Diffusion model source code is open and allows users to conduct their own
experiments to improve the neural network. Stable Diffusion has become the
basis for a dozen new projects, the number of which is now only growing.
Nowadays the main
area of application of such neural networks is in the media and entertainment
industry, although in potential applications the field is almost limitless:
from illustrations for presentations to logos, sketches for films and official
covers for glossy magazines.
Manufacturers in
the fashion world are also turning their attention to the potential of neural
network graphics. The rapid and high-quality development of unique designs
(Figure 1) to suit individual user requirements can become closely linked to
production and be in demand in society.
Figure 1. A unique design for sneakers generated by
Midjourney's neural network using the query "nike sneakers in khokhloma
style" [2].
It can be assumed
that the use of neural networks will evolve in the future as a tool for one of
the most sought-after scenarios for businesses - personalizing content to suit
individual user needs.
However, the
ability of neural networks to quickly and automatically generate an infinite
number of different images from a given textual description opens up
opportunities for scientific work as well.
With the ability
to train off-the-shelf algorithms on thematically selected material (prepared
image database), it is possible to create specialized neural networks adapted
to domain-specific terms and queries.
For example, the
use of text-to-image neural networks is possible in areas such as environmental
monitoring or biomedical technology.
In order to organize
environmental monitoring, it is first necessary to collect data. As the data
from different sources are analyzed in order to monitor the processes taking
place in the environment: images, heterogeneous sensor data, textual data and
others, the collected data is heterogeneous [3]. After analyzing the data and
identifying the main components that have the most impact on the overall
situation, it becomes possible to summarize what is happening in textual form.
It is then possible to generate an appropriate textual query and model a visual
image from the linguistic data.
A complex analysis
allows to get the most effective picture of the processes taking place and draw
adequate conclusions.
The use of neural
network graphics for rapid generation of illustrative images of the processes
under study allows one to get the most complete impression of what is
happening.
With the
availability of textual eyewitness testimonies, it becomes possible to quickly
reconstruct the visual picture of the events and visualize the overall
situation for further analysis.
The aggregate of
various data can be transformed into a visual form without the need to use
human imagination, but with the possibility of online expert corrections to
bring the final representation to the desired form that most accurately
reflects the phenomenon being described. Figure 2 shows a visualization of a
rather general query (query text: "crimson sky, high waves, a storm is
approaching"). Nevertheless, the image is already highly detailed and
presented in four versions, from which the user can choose the one most
suitable for his needs and make the necessary adjustments until a satisfying
result is achieved.
Figure 2. Example of a visual image reconstructed from
a textual description.
For more specialized
tasks, such as manufacturing or medicine, specially trained neural networks are
needed, capable of understanding certain jargon or science-based linguistic
phrases without allowing ambiguous interpretations.
Given the vast
amount of accumulated material, and the existence of specialized archives for
many fields, it may be a matter of time before specifically oriented graphical
neural networks are developed.
Potentially, their
application gives ample opportunities for analysis of various types of data,
combining them and displaying in a clear and understandable way. They can also
be used extensively in teaching and learning tools.
For example, a
neural network can represent a typical condition of some organ, tissues at a
certain set of symptoms listed in a query. In case a textual description
contains an indication of some pathology, a visual representation can help to
highlight it and make the right decision.
Neural networks
are already widely used in different fields of science.
For example, tasks
that are performed by an inpainting function (removing objects and then shading
empty areas of an image so that the fact of such shading is unnoticeable), are
in demand in archeology in the case when it is necessary to recreate a building
of which only ruins are left. A neural network can generate an image based on data
about similar buildings and styles in architecture.
This section
presents brief descriptions of the most popular and large commercial
text-to-image neural networks that have become widely known in the last year. These
include Midjourney, which opened in March 2022, an updated version of DALL-E 2,
first demonstrated in January 2021; Stable Diffusion, an open-source neural
network that has become the basis for dozens of new projects; and ruDALL-E, a
Russian neural network based on generative models from SberDevices and Sber AI.
Midjourney [4] is
proprietary software that creates images from text descriptions. The project
was founded in February 2022 by scientist and entrepreneur David Holtz. The
Midjourney team has positioned itself as an independent research laboratory
dedicated to expanding humanity's creative abilities.
Midjourney's work
is enabled by two relatively recent technological breakthroughs in artificial
intelligence: the ability of neural networks to understand human speech and
create images.
The neural network
is trained to match textual descriptions with visual images across hundreds of
millions of examples, using specially compiled collections that contain
billions of images gathered in the network, as well as matched image-text
pairs. Such datasets can be commercial or open source, such as LAION [5], on
which the famous Stable Diffusion neural network was trained. The results of
such training allow solving various cross-modal tasks - generation of pictures
based on text descriptions, generation of text descriptions based on pictures,
regeneration or rendering of image parts, etc. This makes it possible to
advance in solving such topical tasks as visualization of incomplete data and
their replenishment.
Midjourney, like
most neural networks of this type, is well capable of making explicit queries,
without getting specific. For example, if you give it a query to generate
"red car on the road", it will generate quite satisfactory options.
You can experiment with car colour, size, background - these are quite general
queries.
Problems may arise
with more specific queries. For example, a car model may already cause problems
for a neural network. The rarer this model occurs in the network space, the
less chance that a neural network will be able to draw it.
However, graphical
neural networks at this point in time are an extremely fast developing and
progressing area of computer graphics, so versions of Midjourney are constantly
being updated and improved. The paper [6] provides a comparative review of
Midjourney versions v3 and v4, looking at the key differences and features of
the updated version. In March 2023, an update to Midjourney version v5 was
released and its features are only being explored.
DALL-E 2 [7] is
one of the most popular neural network graphics systems, developed by OpenAI
with 12 billion parameters based on GPT-3 (Generative Pre-trained Transformer 3
- the largest and most advanced language model in the world from OpenAI),
trained to generate images from text descriptions using the text-image pair
data set. It is able to generate original images from textual descriptions and
allows users to upload images and edit them, for example by adding elements.
Furthermore DALL-E can not only generate an image from scratch, it can also
regenerate any rectangular area of an existing image,
According to the
developers, "DALL-E 2 is an artificial intelligence system that can create
realistic images and drawings from a natural language description".
DALL-E 2 started
as a research project and is primarily of interest due to the publications of
the developers, who have done a lot of work in creating algorithms and studying
the behavior and capabilities of the developed neural network [1, 8, 9].
A neural network
can create images in a wide variety of drawing styles and techniques. It can be
an image that looks like a frame from a cartoon, or the image will look like a
real photograph.
DALL-E 2 was
trained on image pairs and their respective captions. According to the
developers, the pairs were taken from a combination of publicly available and
licensed sources [10].
The software is
now available to a limited number of people, only by subscription. This is due
to both limited server infrastructure capacity and the developers' desire to
control the development and self-learning of the neural network through user
testing. In particular, due to concerns about the misuse of the neural network,
the developers carefully filter content for its training and incoming requests
for prohibited topics (violence, adult content, etc.).
Among the features
provided in the latest updates are such as:
•
higher resolution of
images
•
query processing in
more than 107 languages, including Russian
•
high request
recognition accuracy
•
possibility of setting
colour filters and image style
•
can take an existing
image as an input and create a creative variation of it
•
possibility to refine
the uploaded image.
On 22 August,
Stability AI released its open-source image generation model that could compete
with DALL-E 2 in terms of quality.
Stable Diffusion
(SD) stands out from similar neural networks primarily due to its open source
code under the Creative ML OpenRail-M license [11]. This makes it possible to
run SD on your own computer, rather than via the cloud, which is accessed via a
website or API.
For decent
results, the developers recommend an NVIDIA 3xxx series GPU with at least 6GB
of RAM.
Stable Diffusion
is a system made up of many components and models that are responsible for
different parts of the system. These include a text understanding component,
which converts textual information into digital form, and an image information
space creation component, from which the image itself is subsequently drawn
using an image decoder. This is done only once at the end of the process and
produces a finished pixel image. Such an algorithm speeds up the process
compared to previous diffusion models operating in pixel space (Figure 3).
Figure 3. The main components of Stable Diffusion.
For more on the
work of Stable Diffusion, see [12].
ruDALL-E [13] is a
family of generative models from SberDevices and Sber AI. The neural network
was developed and trained by Sber AI researchers with the partner support of
scientists from AIRI Institute of Artificial Intelligence on a combined Sber AI
and SberDevices dataset of 1 billion text-image pairs. Teams from Sber AI,
SberDevices, Samara University, AIRI and SberCloud actively participated in the
project.
Specialists
created and trained two versions of the model, named after two great Russian
abstractionists, Vasily Kandinsky and Kazimir Malevich:
•
ruDALL-E Kandinsky
(XXL) with 12 billion parameters;
•
ruDALL-E Malevich (XL)
with 1.3 billion parameters.
Both models are
capable of generating colourful images on a variety of topics from a short
textual description. According to the developers, Kandinsky uses backward
diffusion and can process queries in 101 languages, without any loss in quality
or speed. Among those languages are both common languages such as Russian and
English, as well as rarer languages such as Mongolian. The system will
understand even if a query contains words in different languages.
Training the
ruDALL-E neural network on a Christofari cluster was the biggest computational
challenge in Russia. It involved 196 NVIDIA A100 cards, each with 80 GB of
memory. The whole training took 14 days or 65,856 GPU-hours. It was first
trained for 5 days at 256x256 resolution, then 6 days at 512x512 resolution and
3 days at maximum clean data.
The ruDALL-E
Kandinsky 2.0 system is claimed to be the first multilingual diffusion neural
network capable not only of accepting requests in different languages, but also
of forming linguistic-visual shifts in language cultures.
This statement is
supported by a number of experiments [14]. In particular, such queries as
"national dish" or "person with higher education" are
tested (Figures 4 and 5). For the Russian-language query, the neural network
produces predominantly white males, while for the same query in French, the
results are more varied. For the query in Chinese, the results have more
stylized images, but in most cases they also reflect the national component.
Figure
4.
Testing
the
query
"photo
of
a
person
with
higher
education"
in
Russian,
French
and
Chinese.
Figure 5. Testing the query "national dish"
in Russian, Japanese and Hindi.
The author also
conducted an experiment (Figure 6) on the FusionBrain platform [15], which
confirmed the orientation of this neural network to different language
environments. The query "national dish", performed in several
languages, produced completely different results.
Figure
6.
Testing
the
query
"national
dish"
in
Russian,
Hindi
and
Italian
(rows
across).
It is worth noting
that queries in different languages make sense to test either on the
above-mentioned platform or by interacting with developers' repositories
directly. The rudalle.ru platform is not adapted to such queries; it is capable
of perceiving a foreign language, identifying it, translating the query into
Russian, and then generating a visual image.
Such experiments
open up a separate area for research, as preliminary studies suggest that
neural networks of different language groups will have their own distortions
and differences in the interpretation of the same phenomenon, depending on the
mass culture belonging to one or another language group.
This rapid
development of neural network technologies' capabilities in the field of
graphics and photorealistic images brings to the forefront the task of
interaction between humans and neural network technologies using natural
language. The linguistic construct that a human uses to formulate a task often
contains much more meaning and historical context than a neural network, which
focuses on a specific set of parameters and phrases, can understand. [16, 17].
For example, since
most neural networks can only understand queries in a certain language (English
being the most common), the linguistic context and subtleties of translation
must be taken into account when dealing with them. This issue also needs
research.
So, popular on the
web experiments on visualization of well-known proverbs and sayings in Russian
are of dubious effectiveness, because such experiments are most often carried
out in Midjourney neural network, which specializes in queries in English and
understands requests in other languages poorly. Accordingly, the cultural
layer, on which it relies, refers rather to the English-language space and
reflects its specificity.
Thus, in Figure 7
the user asked the neural network a query in the form of the Russian proverb
"âîëêè
íîãè
êîðìÿò".
Figure
7.
Neural
network's
attempt
to
generate
an
image
using
the
Russian
phrase
"âîëêà íîãè êîðìÿò".
Unfortunately, the
original article [18] does not provide the exact text of the query, but judging
from the result, we can conclude that there was a direct literal translation of
"wolf feet fed", and the neural network reproduced this query quite
literally. Meanwhile, this proverb has the full English analogue of the
semantic idiom "The dog that trots about finds a bone" or the
translation offered by the online translator DeepL, "the wolf feeds the
wolf", which implies absolutely different visual images, but has the same
meaning. Therefore, when giving a neural network a query, you should take into
account the difference between the semantic translation and the direct
translation, because the results can be drastically different. Thus, making the
right query becomes, in a sense, a profession. People who have learned to get
the intended and high-quality result are already called "prompt-engineers",
and more and more offers to form a precise query for a neural network are
appearing on freelance exchanges.
In addition to
poor correlation between linguistic query and graphical representation of the
result, when a neural network does not understand the query or recognizes and
visualizes only part of the meaning put into the query text by the user, a
number of artifacts typical of neural networks are revealed.
These artifacts
are widespread and typical for neural network generated images. In particular,
their presence can be used to identify an image generated by a neural network.
Conventionally a
common set of artifacts can be divided into three main groups:
1.
"Chimeras". The case when a neural net cannot correctly reflect the
requested object or mixes given objects with each other, generating surreal and
sometimes frightening images. Such things can also be planned by the user, but
in this case the query text itself implies combining incompatible notions.
One of the most
famous examples is the human hand, or more precisely the position of the
fingers. The most common artifact in the generation of human images - distorted
hands, where there are either missing or six or more fingers, or they are
intertwined and bent at an anatomically inconceivable angle. There is
speculation that the neural network combines multiple hand arrangements, but
does not filter out minor details like extra fingers.
2. Distorted
composition. Neural networks in the majority of cases cannot create fully
realistic or stylized images with a lot of details. Objects merge with one
another, there are under-drawn or mis-matched objects.
This phenomenon
differs from "chimeras" in that the overall structure of the
generated picture at first glance is not violated and seems natural, but upon
closer examination it appears that some objects are not completed, located
relative to each other with a distorted perspective or flowing one into
another.
Figure 8 shows one
example of this problem. According to the request (man waiting in line at Mcdonald's
in Thailand, detailed facial features, full body, fuji color film, 2005 -v 4)
the picture shows a young man standing in a line at Mcdonald's with the
appropriate entourage and appropriate appearance. There is a high degree of
photorealism and detail in the image, but a closer look at the background
reveals a number of clear signs that the image belongs to the neural network.
The people blend into each other: for example, one man's face is sunken into
another man's T-shirt, his head has no neck, and another man's arm is
disproportionately thin and merges with the edge of his blue sleeve, blurring
around the edges and seeming to shine through.
Figure 8. Image generated using Midjourney v4 [6].
Another artefact
that occurs quite often in Midjourney is the bent spoons in the food pictures
(Figures 9-10).
Figure 9. Presence of artefact - deformed spoon in the
generated image [19].
Figure 10. Presence of artefact - deformed spoon in
the generated image [20].
Texture artefacts.
In this case, the artifacts do not affect the overall image and occur in
places where the neural network cannot adequately process some highly detailed
area or recreate the desired structure. This could be hair, clothing fabric, or
skin.
Such artifacts are
inherent to neural networks that can reconstruct part of the image, enhance the
quality, or generate the image from scratch.
More often than
not, at the location of the artefact, when zoomed in you can see a visible
difference between the damaged area and the rest of the image. In Figure 11,
for example, you can see an odd ripple in one section of hair, unlike the rest
of the hair. A neural network often produces this pixel grid effect, but in
most cases it is only visible when it is greatly exaggerated.
Figure 11. An example of the presence of artefacts in
hair texture [21].
Neural network
technologies are currently in their heyday. From individual artists creating
jewellery to large companies such as Adobe, the fruits of their work are
beginning to be used en masse for their own purposes. Such a leap is generating
quite a few social phenomena. Some companies are already banning on their site
the uploading and selling of illustrations created with AI and tools such as
DALL-E, Midjourney and Stable Diffusion. The reason Getty Images (the US photo
agency that owns one of the world's largest image banks) is rejecting AI
creativity is because of possible copyright issues.
Content creation
tools are predominantly trained on images taken from the internet and protected
by copyright. These sources may include personal art blogs, news sites and
stock sites (stock is a photo image on a particular subject which is sold on
publicly available marketplaces and can be used as an illustration or
advertisement for photographs). Scraping (extracting data from web pages) is recognized
as legal in the US and falls under the category of 'fair use'. A number of
artists whose work has been copied or imitated by neural network image
generators have called for this area to be regulated by law, as a neural
network is able to copy a particular style very precisely and reproduce it on
its own content (Figure 12).
Figure 12. Original photo by photographer Richard Avedon (left) and a portrait in the same style generated in the photorealistic model of the Stable Diffusion dreamlike-photoreal-2.0 algorithm (right). [22].
Another emerging
issue is the possibility for users to train ready-made neural network
algorithms on their own data, given the appropriate technical capabilities.
While most other paid neural networks have certain markers set by the
developers that limit a number of user requests, and the system bans the
account in case of abuse (such requests include 18+ topics, shocking content
and violence), open-source neural networks like Stable Diffusion allow users to
experiment relatively freely and train their own neural networks, targeting
certain areas.
This can cause
uncontrolled spew of unfiltered graphic content into the online information
space, including photorealistic images that, if misused, can trigger meaningful
social movements and spread unconfirmed fake information.
In February 2023,
a precedent of the use of images generated by graphic neural networks was first
reported in the media [23]. Fraudsters used neural network images to cash in on
the earthquakes in Turkey and Syria by distributing generated disaster images
on Twitter along with addresses to cryptocurrency wallets asking for charitable
donations (Figure 13).
Figure 13. Fake photo generated in a neural network.
Thanks to visible
artefacts (distortions of the child's face and fingers on the hands), the
fraudulent scheme was quickly uncovered, but this case risks being only the
first of many. Neural networks are rapidly improving and being updated, and
neural network images are becoming increasingly dense in everyday life. It is
quite possible that their use will soon be aimed not only at creative
activities, but also at fraudulent and provocative ones. An instantaneous mass
of fake images of current political events or shocking content, generated at
high speed and in large quantities, can mislead unprepared people and lead to
negative social reactions.
Photorealistic
neural networks carry the potential risk of discretizing and destroying the
legal value of photographic or video evidence and distorting and falsifying
historical sources.
Such perspectives
bring to the fore the need to address the task of verification and
identification of generated or processed photorealistic images in order to
effectively counteract their malicious use.
The first measures
proposed are to identify and classify the main features of neural network
images and typical artefacts (as described in section 3.2).
Currently, in most
cases neural network images can be identified by a set of direct and indirect
features, but for a number of images this becomes a difficult task. For
example, these could be single portraits of people with no hands or complex
poses, realistic 'photographs' of animals, abstract landscapes or paintings. In
these cases, the neural network has been trained on a huge database and
produces almost no artefacts.
This in turn
raises the challenge of developing algorithms to distinguish the work of the
neural network and identify computer "fakes" among the original
photos.
In this paper,
state-of-the-art text-to-image graphical neural networks and methods of
text-to-image transformation have been examined and the results achieved have
been analyzed.
A number of
problems generated by these systems were considered. Ways of applying neural
network approaches to text-to-image transformation for environmental
monitoring, infrastructure and medical data analysis tasks were proposed.
1. Ramesh A.,
Pavlov M., Goh G., Gray S., Voss C., Radford A., Chen M, Sutskever I., 2021.
Zero-Shot Text-to-Image Generation, arXiv:2102.12092 [cs.CV],
https://doi.org/10.48550/arXiv.2102.12092
2. Telegram
Channel «Neurodesign», 2023a,
https://t.me/neurodes/343
(19 March 2023)
3. Yazikov E.G., Talovskaya
A.V., Nadeina L.V., 2013. Geoecological environmental monitoring: ñîursebook /
Tomsk Polytechnic University
4. Midjourney,
https://www.midjourney.com/
(19
April
2023)
5. LAION.
Large-scale Artificial Intelligence Open Network.
https://laion.ai/
(19 March 2023)
6. Yubin Ma, 10
Incredible Prompt Styles to Try in Midjourney V4.
https://aituts.com/midjourney-v4-prompts-to-try/
(23
January 2023)
7. DALL•E 2,
https://openai.com/product/dall-e-2
(19
April
2023)
8.
Dhariwal P., Nichol A. 2021, Diffusion Models Beat GANs on Image
Synthesis.
arXiv:2105.05233
https://doi.org/10.48550/arXiv.2105.05233
9. Radford A.,
Jong W.K., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A.,
Mishkin P., Clark J., Krueger G., Sutskever I. 2021. Learning Transferable
Visual Models From Natural Language Supervision. arXiv preprint
arXiv:2103.00020 [cs.CV].
https://doi.org/10.48550/arXiv.2103.00020
10. DALL•E 2
Preview - Risks and Limitation, 2022,
https://github.com/openai/dalle-2-preview/blob/main/system-card.md#model
(19 March 2023)
11.
Stable Diffusion
Online,
https://stablediffusionweb.com/
(19
April
2023)
12. Alammar J. 2022,
The Illustrated Stable Diffusion.
https://jalammar.github.io/illustrated-stable-diffusion/
(19 March 2023)
13. ruDALL-E,
https://rudalle.ru/
(19
April
2023)
14. Shakhmatov A.,
Razhigayev A., Arkhipkin V., Nikolic A., Pavlov I., Kuznetsov A., Dimitrov D.,
Shavrina T., Markov S. Kandinsky 2.0 - the first multilingual diffusion for text-based
image generation.
https://habr.com/ru/company/sberbank/blog/701162/
(19
March
2023)
15.
FusionBrain.
https://fusionbrain.ai/diffusion
(19 March 2023)
16. Isola, P.,
Zhu, J.-Y., Zhou, T., and Efros, A. A., 2017. Image-toimage translation with
conditional adversarial networks. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 1125–1134.
17. Koh, J. Y.,
Baldridge, J., Lee, H., and Yang, Y., 2021. Text-toimage generation grounded by
fine-grained user attention. In Proceedings of the IEEE/CVF Winter Conference
on Applications of Computer Vision, pp. 237–246.
18. Midjourney and
idioms.
https://pikabu.ru/story/midjourney_i_frazeologizmyi_9768400
(23 January 2023)
19. Telegram
Channel «Neurodesign», 2023b,
https://t.me/neurodes/619
(19 March 2023)
20. Telegram
Channel «Neurodesign», 2023c,
https://t.me/neurodes/303
(19 March 2023)
21. Telegram
Channel «Neurodesign», 2023d, https://t.me/neurodes/750 (19 March 2023)
22. Makushin A.
https://t.me/makushinphoto/541
(23 January 2023)
23. Gelbart
H.,
2023, Scammers are
profiting from the earthquake in Turkey by raising money, supposedly to help
the victims.
https://www.bbc.com/russian/news-64640487
(19 March 2023)