It is always highly desired to have a well-defined relationship between the chemistry in semiconductor processing and the device characteristics. With the shrinkage of technology nodes in the semiconductors roadmap, it becomes more complicated to understand the relation between the device electrical characteristics and the process parameters such as oxidation and wafer cleaning procedures. In this work, we use a novel machine learning approach, i.e., physics-assisted multitask and transfer learning, to construct a relationship between the process conditions and the device capacitance voltage curves. While conventional semiconductor processes and device modeling are based on a physical model, recently, the machine learning-based approach or hybrid approaches have drawn significant attention. In general, a huge amount of data is required to train a machine learning model. Since producing data in the semiconductor industry is not an easy task, physics-assisted artificial intelligence has become an obvious choice to resolve these issues. The predicted C-V uses the hybridization of physics, and machine learning provides improvement while the coefficient of determination (R 2) is 0.9442 for semisupervised multitask learning (SS-MTL) and 0.9253 for transfer learning (TL), referenced to 0.6108 in the pure machine learning model using multilayer perceptrons. The machine learning architecture used in this work is capable of handling data sparsity and promotes the usage of advanced algorithms to model the relationship between complex chemical reactions in semiconductor manufacturing and actual device characteristics. The code is available at https://github.com/albertlin11/moscapssmtl.
It is always highly desired to have a well-defined relationship between the chemistry in semiconductor processing and the device characteristics. With the shrinkage of technology nodes in the semiconductors roadmap, it becomes more complicated to understand the relation between the device electrical characteristics and the process parameters such as oxidation and wafer cleaning procedures. In this work, we use a novel machine learning approach, i.e., physics-assisted multitask and transfer learning, to construct a relationship between the process conditions and the device capacitance voltage curves. While conventional semiconductor processes and device modeling are based on a physical model, recently, the machine learning-based approach or hybrid approaches have drawn significant attention. In general, a huge amount of data is required to train a machine learning model. Since producing data in the semiconductor industry is not an easy task, physics-assisted artificial intelligence has become an obvious choice to resolve these issues. The predicted C-V uses the hybridization of physics, and machine learning provides improvement while the coefficient of determination (R 2) is 0.9442 for semisupervised multitask learning (SS-MTL) and 0.9253 for transfer learning (TL), referenced to 0.6108 in the pure machine learning model using multilayer perceptrons. The machine learning architecture used in this work is capable of handling data sparsity and promotes the usage of advanced algorithms to model the relationship between complex chemical reactions in semiconductor manufacturing and actual device characteristics. The code is available at https://github.com/albertlin11/moscapssmtl.
Predicting device characteristics
as the function of semiconductor
process conditions is a difficult but important task.[1] The difficulty lies in the fact that there are many chemical
reactions and multidisciplinary knowledge is necessary in semiconductor
manufacturing procedures. To use the analytical approaches in chemistry
and physics to tackle the modeling of semiconductor process and its
relation to the device characteristics, the results are always away
from the experiments. In some cases, the model is even absent for
this purpose. With the rise of machine learning,[2−7] it becomes more feasible to use machine learning algorithms to describe
the complex relations between semiconductor processes,[1] such as etching, deposition, oxidation, wafer cleaning,
and lithography, and semiconductor device characteristics.Metal
oxide semiconductor field-effect transistors (MOSFETs) play
a vital role in today’s integrated circuit (IC) technologies.[8] Ideally, for a MOS device, it is desired that
the electrical characteristics should be controlled by the bulk, not
by the interfaces. By analyzing the undesirable charges present in
the interface, capacitance–voltage (C–V) characteristics help in determining the stability and
overall performance. It has been seen that the silicon and silicon
dioxide interface is badly affected by the interface trap and fixed
charge.[9] Thus, it becomes of utmost importance
to study the process parameters,[10,11] which are
in turn influencing the electrical characteristics of devices. Industry
4.0 gives an idea about smart manufacturing[12] in which foundries are looking to artificial intelligence[13] and the Internet of Things for reducing the
cost and maintaining state-of-the-art fabrication quality.[14] Depending on the different capacitance present
in MOS capacitors (MOSCAPs), C–V is divided into three major sections: accumulation, depletion, and
inversion.[8] The total capacitance per area
(C) is expressed as the series combination of surface
differential capacitance per unit area (Cs) and oxide capacitance per unit area (Cox). As transistors shrink and the process becomes more complex, there
are more process steps in semiconductor manufacturing. Conventionally,
people use theory-based approaches to describe device behaviors,[15] but mapping the device characteristics to real
process parameters such as oxidation, RCA cleaning, etc., can be difficult
without the help of regression or machine learning (ML). With the
rise of ML in electronic design automation (EDA) in recent years,[2−7] it becomes more practical to incorporate the process parameters,
such as annealing temperature and plasma power, into the compact model
and circuit simulation.ML models have been working excellently
in many commercial business
applications like fraud detection, machine translation,[16] and recommendation systems.[17] The drawback of ML-based compact models (CMs) is that a
huge set of labeled data is required to implement these applications.
Due to the requirement of a large set of labeled data for ML models
and the complexity that occurs in pure physical modeling, it may not
be feasible to design either a pure ML or pure physical device model
for emerging technology nodes with process-aware capability. The best
modeling technique for the semiconductor industry is to mix the idea
of ideal physics[18,19] with ML algorithms, and these
limited practical labeled data available. This type of hybrid model
has been accomplished in many diverse fields like in the chemical
synthesis of pharmaceutical products,[20] water management systems,[21] and the semiconductor
design automation industry.[22−25] Specific to IC foundries, the availability of noise-free
practical silicon data is a matter of great concern. A semiconductor
MOS device requires a large set of fabrication machines which costs
a lot of time and money. With the limited resource of the data, the
physics-based ML models recognize the pattern and behavior of the
physical processes; ergo, they lead to state-of-the-art modeling.
The device physics can be processed with the hyperparameter of ML-like
calibration of the current–voltage characteristics of a tunnel
FET with the help of different activation functions.[26] This paradigm shift in semiconductor modeling gives a better
convergence and improves the accuracy of the model. The neural network
simulation models can also be extended to the entire family range,
like a unified resistive random access memory model for the whole
memristor family.[27]The efficacy
of supervised learning is very high but comes with
the requirement of a large amount of label data. Semisupervised multitask
learning (SS-MTL) and transfer learning (TL) are advanced ways to
look into this problem. For a classification-type problem, these algorithms
reached state-of-the-art performance.[28] In the semiconductor industry, TL-based image defect classification
is quite common in wafer manufacturing industries.[29] As the performance of models has achieved the human-level
classification, this gives a boost to the manufacturer in lowering
the cost and increasing yield. The performance of the TL algorithm
on regression problems is also significant, e.g. predicting the phonon
characteristics of a wideband semiconductor using its electronic properties.[30] It is also successfully implemented in determining
the formation energy of the material, further improving material properties.[31] With the fewer data available, the study of
some new materials can also be analyzed using TL.[32,33] C. Park et al. presented the application of multitask learning (MTL)
in determining the quality of wafers coming from different process
chambers.[34] Regardless of the different
process parameters in various chambers, the operation is the same.
Thus, the idea of MTL reduces the necessity of separate models for
each process variation[35] and helps in the
production of semiconductor components.We are presenting a
physics-based stimulated ML regression model
that will ease out the data dependency. One model is related to the
sharing layer-based semisupervised multitask learning approach,[36−38] and another is the TL-based approach.[39,40] In this research
paper, we present an ML-based model for different tasks. Here tasks
are categorized, and they are based on the prediction of C–V for intermediate frequencies. The measured
data will be exclusively used for testing and not for training. This
implies that our presented model is only trained on the measured data
at low and high frequencies. The physical modeling for all frequencies
will supplement the ML model to improve the prediction accuracy. The
knowledge of the physical model will either be shared or transferred
to facilitate prediction, depending on either SS-MTL or TL is used.
Our presented models outclass the baseline multilayer perceptron (MLP),
which is the most famous ML algorithm in the electronic design automation
industry. We have incorporated the effect of dry/wet oxidation, different
types of wafer cleaning methods, and different metal deposition techniques,
and therefore our compact modeling is process-aware. With a little
more computational power, our models can be easily expanded to any
large number of process parameters in semiconductor manufacturing.
The initial results of this work can be found in our previous conference
presentation[38] and student theses.[41,42]
Methodology
A three-layered metal-oxide-semiconductor
structure is fabricated
with different process parameters. We have considered different types
of wafer cleaning, thermal oxidation, metal deposition, and metal
electrode lithography as process parameters. The entire flow from
fabrication and measurement to machine learning is illustrated in Figure .
Figure 1
Flow of MOS capacitor
modeling.
Flow of MOS capacitor
modeling.
Data Collection
We have 144 288
theoretical data and 4896 measurement data. The input variables are
the frequency, area, RCA/standard clean, dry/wet oxidation, sputter/e-gun
Al, and voltage. The frequencies are 1, 3, 5, 10, 50, or 100 kHz.
The area is from 0.005, 0.0049, 0.01, 0.0102, 0.0025, or 0.0026 mm2. The applied voltage across the MOSCAP is from −4
to 2.5 V. The training data are 1 and 100 kHz, and the test data is
3, 5, 10, and 50 kHz. Out of the training data, 0.01 mm2 data served as the validation set. The output variable is capacitance
in units of picofarad.
Sample Fabrication
The reliability
and overall yield strongly depend on the amount of contamination present
on the wafer. Consequently, it becomes important to study the effect
of cleaning. First, a p-type wafer is cleaned in two different types
of processes: one is RCA cleaning,[43] and
another is standard cleaning. Chemicals, i.e., NH4OH, H2O2, HCl, H2SO4, etc., are
used in different concentrations in these two cleaning methods. In
the standard cleaning, first, the chemical composition of NH4OH, H2O2, and H2O in a 1:4:20 ratio
is used for removing particles and organic contamination. Next, a
composition of HCl, H2O2, and H2O
in the proportion of 1:1:6 is used for removing metallic ions. During
these chemical treatments, there may be a chance of native oxide growth.
To eliminate this residue, the wafers are dipped into a 1:50 proportionate
solution of HF and H2O. After every chemical treatment
of wafers, a quick dump rinser is also used for maintaining wafer
quality. The entire wafer’s cleaning process is carried out
on a wet bench followed by a spin dryer. Once the cleaning is done,
the next step is to grow oxide. There are many oxide deposition methods
like thermal oxidation, plasma anodization, and vapor phase reaction.
Among these methods, thermal oxidation is a widely used process in
industries. This wet or dry oxidation deposition of 200 Å thickness
is completed through a horizontal diffusion furnace. The top metal
layer of aluminum is deposited through two different types of physical
vapor deposition (PVD) techniques. One is the AST E-GUN PEVA-600I
electron-beam PVD (E-gun), and another is the FSE Cluster PVD sputter.The deposition in the E-gun takes place in the presence of an electric
field where high-energy electron beams are generated through thermionic
emission. These accelerated and focused beams fall on the source material
that is kept in a crucible. The source material is evaporated and
deposited on the substrate. In sputtering, the high-energy plasma
hits the target material, and after condensation, a thin layer of
the target is get deposited on the substrate. After the deposition
of the MOS structure, we have done the lithography of all the samples.
The square and circle shapes of electrodes have been used. During
lithography, a thin positive photoresist layer of thickness 8500 ±
150 Å is deposited by using TEL CLEAN TRACK MK-8. An exposure
dose of 2000 J/m2 is used in the Canon FPA-3000i5+ I-Line
stepper. It has a resolution of nearly 0.35 μm. Again, the track
machine is used for the removal of the exposed photoresist. Once the
lithography is done, the electrodes are formed by dry etching, and
ozone asher is used afterward for residual resist removal. TCP 9600SE
etching machine gives the desired structure. Due to the limitation
of lithography, the wafers are sliced into a small die area of 2 cm
× 2 cm. During the frontend process, native oxide gets developed
on the backside of the samples. To handle such a scenario, an etching
machine is used again, and a thin layer of the aluminum film is deposited
by the AST PEVA 600I e-gun. The process flow of sample fabrication
is depicted in Figure and can further be seen from the TEM image, as shown in Figure .
Figure 2
Fabrication flow of MOS
capacitor manufacturing.
Figure 3
(a) TEM image. (b) TEM-EDS
line profile for Al/SiO2/Si.
(c) Scanning TEM and 2D EDS mapping of Al, Pt, Si, and O.
Fabrication flow of MOS
capacitor manufacturing.(a) TEM image. (b) TEM-EDS
line profile for Al/SiO2/Si.
(c) Scanning TEM and 2D EDS mapping of Al, Pt, Si, and O.
C–V Measurement
We use an Agilent 4284A for MOSCAP measurement.
One probe is placed on the top electrode, which will give the AC signal,
and the back electrode is connected by a chuck and grounded for measurement.
Afterward, the semiconductor device analyzer software (EasyEXPERT)
is used to adjust the sweep voltage range (from −4 to 2.5 V,
step voltage 0.13 V), integration time (medium), and measurement frequency
(from 1 kHz to 1 MHz). Our sample is measured under illumination using
a Schott ACE light source EKE 150W halogen lamp. A detailed description
of fabrication and measurement can be found in student theses.[41,42]
Theoretical C–V Generation
Typical C–V characteristics of a MOS capacitor are affected by many
physical parameters.[8,9] Typically, electrons (n) and holes (p) as mobile charges and
ionized acceptor (Na+) and
donor (Nd+) as fixed charges
are present in the silicon. Poisson’s equation can be written
as[9]where ψ is the potential and
ξ
is the electric field. All symbols are explained in Table . On further solving and integrating 1 from bulk to the surface, we have[9]Equations and 2 describe the profile of
the electric field and are represented by[9]
Table 1
Values Used in the Generation of Theoretical C–V
parameter
symbol
values
permittivity of free space
εo
8.85 × 10–14 F/m
permittivity of oxide
εox
3.9εo
permittivity of silicon
εsi
11.9εo
concentration
of acceptor atoms
Na
5 × 1015 cm–3
intrinsic carrier concentration
ni
1.5 × 1010 cm–3
oxide thickness
tox
20 nm
energy bandgap
Eg
1.12 eV
silicon electron
affinity
X
4.05 eV
temperature
T
300 K
According to Gauss’s law, the total charge per unit area
(Qs) induced in the semiconductor is defined
by 4. At the limiting condition of x = 0, ψ = ψs, and ξ = ξs, the total charge density will be the surface charge density, and
potential will be the surface potential:[9]The overall capacitance can be obtained
byThe gate voltage consists of the voltage drop
across the oxide and the surface potential. In the ideal case, flat-band
voltage (Vfb) is zero, but practically
it depends on metal work function and traps. A semiempirical voltage
shift in flat band voltage calculation provides a better closeness
with silicon data . In this work, by comparing to the training set data, a
empirical voltage shift of −0.78 V is added to Vfb. The voltage across MOSCAP can be expressed asWhen the
gate voltage is positive, the inversion
charges start to increase and slowly enter into the inversion region.
The inversion region is dominated by minority carriers. Hence, 4 can be simplified and shown by 7:[9]According to our experiment, we considered
gate voltage from 0 to 2.5 V, and the lowest frequency as 1 kHz. For
a specific frequency, if the gate voltage is increased, we see an
increment in the inversion charge. On further increase in the voltage,
inversion charges become saturated, and the threshold of ΔQinv is reached. This is the main reason behind
the flat capacitance curve in the inversion region. Since the measurement
is done in the presence of light, there will be some movement in the
charge carrier, and it is expressed as[44]where ΔQinvth0 is the threshold of ΔQinv for
1 kHz, ΔQinvth is the threshold
for a specific frequency (f), k is
the constant, and t is the transfer time which is
equal to the period T = 1/f. In
our measurement, 1 kHz of the C–V characteristic behaves like the low-frequency characteristic under
illumination. In calculation, we assume 100 kHz C–V is the lowest frequency that will show
ideal high-frequency C–V,
and this determines the ΔQinvth /ΔQinvth0 at 1 kHz, which in turn gives the value
of k. as the basis of a low-frequency signal. Transferred
carriers establish an exponential relationship with the transfer time.[44] The threshold value of ΔQinvth for different frequencies is calculated using 8.When ΔQinv reaches ΔQinvth, the capacitance
value no longer increases
due to the fact that the lateral transport of the minority electrons
from the illuminated region surrounding the metal electrode becomes
insufficient to sustain the ΔQinv. We can denote the corresponding voltages as Vinvth. When the gate voltage is equal to or larger than Vinvth, the overall capacitance reaches a fixed
value, and this can be observed in Figure . The R2 between
the experiment and theoretical data is 0.9176 in this work. It should
be emphasized that the purpose of this work is to show that analytical
principles can assist ML, but in the end, ML models, instead of physical
models, should be the choice to be used in semiconductor manufacturing.
The reason is that, until now, there have not been reasonable models
for many manufacturing processes. For example, the relation between
wafer cleaning to device is not well described by any physical or chemistry
models.
Figure 4
Theoretical MOS capacitor C–V characteristics for the frequencies from 1 kHz to 100 kHz.
Theoretical MOS capacitor C–V characteristics for the frequencies from 1 kHz to 100 kHz.
Machine Learning Algorithm
MLP is
the most commonly used algorithm in the electronic area. This model
works excellent for a single task, but the problem associated with
this is its extension for newer data. Advanced algorithms give leverage
to extend the prediction capability for unseen data. The success of
an ML lies in the test data performance. Thereby, the concept of overfitting
becomes of utmost importance to deal with. The model hyperparameters
are listed in Table . The first important hyperparameter is the training epochs, which
are tuned by considering validation data set performance.[45] Another important consideration in hyperparameter
selection is to ensure that the three models, i.e., baseline MLP,
SSMTL, and TL, are of the same complexity to have a fair comparison.
Thus, the neuron number per hidden layer is uniformly varied from
20 to 60, and two hidden layers are uniformly used in this work. Varying
neuron number is a way to justify that the effectiveness is a more
general phenomenon instead of only relevant to a specific setup. Other
hyperparameters follow the standard ML context. We do not want to
use particularly tuned hyperparameters to demonstrate the effectiveness
of SSMTL and TL to attain generality to all scenarios.
Table 2
List of Model Parameters and Hyperparameters
transfer
learning
MLP (baseline)
SSMTL
pretrained model
main model
number of hidden layers
2
2
2
2a
number of neurons (sweep)
20–60
20–60
50
20–60
activation function
Relu
Relu
Relu
Relu
batch size
50
50
50
50
size of training sample
(4080, 6)
(124320, 6)
(120240,
6)
(4080, 6)
size of test sample
(9792, 6)
(9792, 6)
(9792, 6)
patience in earlystop
50
50
50
number of epoch cycle
164
179
292
30
The first hidden
layer is from the
pretrained model (number of neurons = 50).
The first hidden
layer is from the
pretrained model (number of neurons = 50).
Multilayer Perceptron (MLP)
MLP
is a fully connected feedforward network that consists of the input
layer, hidden layers, and an output layer, as shown in Figure a. The number of hidden layers
and the number of neurons present in each layer are considered as
two of the most important model hyperparameters. Typically, in a neuron,
the incoming signals get multiplied with their respective weight and
added with their bias.In 9, y is
the output from the neuron, i is the total number
of features, Xi is the input, Wi is the weight, and b is the
bias. Different activation functions introduce nonlinearity in the
ML model.
Figure 5
Machine learning algorithms: (a) multilayered perceptron,[56] (b) semisupervised multitasked learning,[37,57] (c) and transfer learning.[39,58]
Machine learning algorithms: (a) multilayered perceptron,[56] (b) semisupervised multitasked learning,[37,57] (c) and transfer learning.[39,58]
Semisupervised Multitask Learning (SSMTL)
The idea behind multitask learning (MTL) is to train the model
with similar types of tasks rather than training separate models for
each task.The sharing of weights and biases among the models
elevates the model performance. Additionally, getting labeled data
for every task is very difficult and expensive. If the sizes of the
labeled data for the tasks are different, then the methodology is
essentially a semisupervised (SS) MTL. Semisupervised learning gives
a boost to the model in which unlabeled and labeled data are mixed
to enhance the performance of the model. Figure b shows the architecture of SSMTL in which
two similar types of tasks, i.e., theoretical and measured C–V characteristics of MOS capacitors,
are trained together. The knowledge of theoretical C–V assists in predicting the values of capacitance
at intermediate frequencies.
Transfer
Learning
The intuition
behind transfer learning is to utilize the knowledge of a previously
trained model, as illustrated in Figure c, i.e., reusability of a pretrained model.[46] For example, the knowledge of riding a bicycle
can be used in learning motorcycles. Similarly, in ML-based compact
device modeling, the knowledge of training theoretical C–V can be used in predicting real silicon C–V data. In transfer learning,
the optimized weights and biases of the corresponding pretrained model
are transferred to a new model. Hence, the new model constitutes previously
trained layers and only needs one or two more layers for any new task.
This also helps in saving a lot of computational power.
Results and Discussion
A threshold voltage of any MOS
device is the most important parameter
to evaluate its operating condition, and it is observed as the onset
point of strong inversion. In our experiments, the illumination is
the source of minority charge carriers for MOSCAP to achieve the strong
inversion, and the capacitance depends on the frequency of that sweeping
AC voltage. The illumination leads to optical generation at the peripheral
of the MOSCAP top metal electrodes. The photogenerated electrons need
to diffuse laterally to the region under the metal top electrode to
enable the capacitor to go into the inversion region. At the higher
frequency, the time period of the AC sweeping voltage signal will
be much smaller than the charge lateral transfer time, and thus, the
inversion capacitance becomes smaller. Eventually, it becomes crucial
to define the C–V as a function
of frequency. Our presented model can easily be used for analyzing
the dependency of C–V on
different frequencies, and simultaneously, it is used for optimizing
the fabrication conditions. Generally, a large volume of data is required
for the standard deep learning algorithm. Through this paper, we have
demonstrated the assistance of physics can enable deep learning with
a limited amount of measured silicon data. The idea of appending theoretically
generated C–V with practical C–V leads to the physics-assisted
ML model. The theoretical C–V is calculated in MATLAB.[47] The baseline
MLP and our presented advanced deep learning models are implemented
using Python 3.7.7,[48] TensorFlow[49,50] 2.0 with Keras, Numpy 1.17.0,[51] Scipy
1.6.2,[52] Matplotlib,[53] Scikit-Learn 0.23.1,[54] and Pandas.[55]An ML-based compact model comprises the
quality of data, model
selection, and hyperparameter tuning based on prediction. The performance
of any model is judged on various factors such as training loss, test
loss, fitting performance, etc. The most important performance metric
for describing the loss and extent of the fitting are the root-mean-square
error (RMSE) and R2 score. Statistically,
RMSE depicts more insight detail compared to mean square error (MSE)
and is defined as the square root of MSE.[27]where Y is the output
of
any experiment, n is the feature size, and i is the total number of data points in an experiment. R2 is defined asEquation measures the closeness of
fitting parameters with
the original curve, and Ymean is the average
value of the original output. Values approaching 1 mean curve tending
toward their true value. The concept of the validation set is employed
in such a way that the main training data set(D)
is split into two parts D1 and D2 i.e., D = D1 ∪ D2. There is no
overlapping of the validation set D2 with
the actual training data set D1, i.e., D1 ∩ D2 =
0. It has been observed that early stopping using validation loss
makes our model stable and robust. The loss function depends on the
number of training epochs. The problem of over and underfitting is
very common in regression types of prediction. In general, advanced
and complex ML algorithms eliminate the underfitting problem. To mitigate
the overfitting problem and obtain the best-fit prediction, the callback
class of Tensorflow provides the concept of early stopping.
Performance of MLP Model
Since MLP
is a conventional supervised learning ML model[59] that requires a huge set of training data for a specific
task, it is not going to be an attractive algorithm for designing
a unified, robust model, especially when a limited amount of data
is available. In Figure a, the baseline feedforward model is trained entirely on experimental
silicon data. The two stages of the hidden layer are incorporated
with the input and output layer. Figure b defines the nomenclature of various symbols
that are used in the following set of equations. Figure c is the associated equations
and symbol definitions, respectively. A structure of 50 neurons in
each hidden layer is used for training. The speed and convergence
of a model depend on the range of the data set. Under such circumstances,
it becomes important to perform some preprocessing. MinMaxScaler form
Scikit[60] is used for scaling, and this
method does not change the significance of outliers. The rectified
linear unit (ReLU) type of activation function provides the nonlinearity
in the MLP. The output signals coming from the previous hidden layer
act as input for the current hidden layer. To make the model robust
for any generalized date, no tuning of hyperparameters is done in
accordance with the test data set. In MLP, the validation set consists
of the 1 and 100 kHz of 0.01 mm2 area. The remaining data
of 1 and 100 kHz for different device areas are used for training
purposes. The equation shown in Figure a gives the value of capacitance for different frequencies
range. It has been found from Table that RMSE and R2 score
for training are 0.3525 and 0.9946, respectively. Figure a–c shows the training
set fitting for MLP, SSMTL, and TL, respectively.
Figure 6
MLP architecture used
as a baseline model: (a) architecture, (b)
symbol definition, and (c) symbol pattern definition.
Table 3
Performance Metrics
RMSE (ideal value = 0)
R2-score (ideal value = 1)
data
frequency (kHz)
neural
unit
MLP
SSMTL
TL
MLP
SSMTL
TL
train
1,
100
50
0.3525
0.2687
0.7060
0.9946
0.9968
0.9782
test
3, 5, 10, 50
50
3.1387
1.1890
1.3752
0.6108
0.9442
0.9253
test
3
20
2.2544
1.6389
2.0799
0.7750
0.8811
0.8085
test
5
3.2344
1.9014
2.3588
0.5414
0.8415
0.7561
test
10
4.2492
2.0010
2.0064
0.2719
0.8385
0.8377
test
50
1.8813
1.0769
1.2184
0.8725
0.9582
0.9465
test
3
30
2.2358
1.2458
1.8533
0.7787
0.9313
0.8479
test
5
3.2173
1.4405
2.0942
0.5462
0.9090
0.8077
test
10
4.2535
1.6111
2.4261
0.2704
0.8953
0.7627
test
50
2.1803
0.8155
1.7505
0.8287
0.9760
0.8896
test
3
40
2.1837
1.5527
2.2032
0.7889
0.8933
0.7851
test
5
3.1399
1.9239
2.1482
0.5678
0.8377
0.7977
test
10
4.1207
1.8766
2.5206
0.3153
0.8580
0.7438
test
50
1.9595
1.0005
1.7931
0.8616
0.9639
0.8841
test
3
50
2.2633
1.3807
1.6781
0.7732
0.9156
0.8753
test
5
3.2599
1.4223
1.5566
0.5341
0.9112
0.8938
test
10
4.3049
1.2230
1.2315
0.2527
0.9397
0.9388
test
50
2.2637
0.4769
0.8994
0.8153
0.9918
0.9708
test
3
60
2.2934
1.5667
1.8708
0.7671
0.8913
0.8450
test
5
3.2755
1.8201
1.7370
0.5297
0.8548
0.8677
test
10
4.2911
1.5416
1.6526
0.2575
0.9042
0.8899
test
50
2.2666
0.6480
1.0909
0.8149
0.9849
0.9571
Figure 7
C–V response of ML models
on 1 kHz and 100 kHz, 0.0050 mm2/RCA/Wet/E-Gun, training
data for (a) baseline MLP, (b) SS-MTL, and (c) transfer learning.
50 neurons in a hidden layer is used.
MLP architecture used
as a baseline model: (a) architecture, (b)
symbol definition, and (c) symbol pattern definition.C–V response of ML models
on 1 kHz and 100 kHz, 0.0050 mm2/RCA/Wet/E-Gun, training
data for (a) baseline MLP, (b) SS-MTL, and (c) transfer learning.
50 neurons in a hidden layer is used.The MLP model is exclusively trained
on low frequency (1 kHz) and
high frequency (100 kHz) of the measured C–V data, and the training response is shown in Figure a. The scatter plot of MLP
for the different frequencies on test data is shown in Figure . The performance for 3, 5,
10, and 50 kHz is not satisfactory. This is due to the fact that the
model is trained on 1 and 100 kHz C–V only using purely supervised learning.
Figure 8
Scatter plot of 50 neuron
baseline MLP on the test data for all
process parameters, i.e. frequency/area/clean type/oxide deposition/metal
deposition: (a) 3, (b) 10, and (c) 50 kHz data.
Scatter plot of 50 neuron
baseline MLP on the test data for all
process parameters, i.e. frequency/area/clean type/oxide deposition/metal
deposition: (a) 3, (b) 10, and (c) 50 kHz data.No physical data are used in this algorithm, and the model is also
not exposed to the C–V data
of different intermediate frequencies.Figures a and 8 show the predicted C–V for the training set and the
scatter plot for the test
set for the MLP model. It is observed that the MLP is not able to
utilize the training knowledge to predict the test data sets accurately.
Thus, we need the knowledge of physical equations to assist the machine
learning prediction, especially when the training data is insufficient
or costly to be collected. Figure suggests that the midrange capacitance at 3 kHz and
low capacitance values at 50 kHz do not a satisfactory result. In
the case of 10 kHz, it has been seen that the prediction for both
low and midrange capacitance is worst in Figure . Table shows the R2 score and
RMSE values for the MLP model. Considering the case of 50 neurons,
the RMSE for 3, 5, 10, and 50 kHz is 2.2633, 3.2599, 4.3049, and 2.2637,
respectively. The R2 scores, which gives
the idea of fitting, for 3, 5, 10, and 50 kHz are 0.7732, 0.5341,
0.2527, and 0.8153, respectively. The prediction for 10 kHz of the C–V characteristics possesses a
less satisfactory response. This can be visualized from the scatterplot
of Figure . The prediction
of the 10 kHz data is more difficult because the interpolation in
frequency is far from the training set frequency, i.e., 1 kHz and
100 kHz frequency data. In addition, the MLP model is not able to
map the effect of carrier dynamics at intermediate frequencies due
to the fact that no physical model is available for any assistance.
We have also tested the supported vector regression (SVR) and random
forest (RF) as a baseline model. The SVR and RF are less effective
than MLP. The SVR test set RMSE and R2 are 3.0788 and 0.6252, respectively. The RF is slightly better than
SVR. Nevertheless, the trend in the data is not predicted well, and
thus flat segments are seen in the predicted vs true value plot. RF
RMSE and R2 are 2.5503 and 0.7431, respectively.
The hyperparameters follow the sklearn library default except epsilon
= 0.2 for SVR and max_depth = 2 for RF.
Performance
of the SSMTL Model
In
the previous section, we have observed that the baseline MLP does
not respond well to intermediate frequency data. The physics behind
the C–V for intermediate
frequencies is the same, i.e., the change of the frequency in the
sweeping voltage signal inversion layer affect the amount of the minority
electrons that can be laterally transferred to the region beneath
the top electrode, and this further affects the capacitance across
the semiconductor.[8]Figure a shows the layered architecture of the SSMTL
model. In this model, the model parameter for the first hidden layer
is fed with 1 and 100 kHz of real silicon C–V data and 1, 3, 5, 10, 50, and 100 kHz of theoretical C–V data.
Figure 9
SSMTL
layer-wise representation.
SSMTL
layer-wise representation.If a human brain is trained for a task, the knowledge of that task
can automatically be utilized for similar types of new tasks. Similarly,
the knowledge of all frequency theoretical C–V present in the first hidden layer is being shared with
the measured data, and its pictorial representation is given by Figure a. Since the knowledge
of theoretically generated C–V is to assist the measured C–V in training the model, it becomes necessary to take some data from
both practical and theoretical data as a validation set. One of the
device fabrication process parameters is the area of the electrodes,
and we have six different areas. Out of these 6, we use 0.01 mm2 area data of 1 and 100 kHz from the measured data set and
0.01 mm2 area of all frequencies from the theoretical C–V data set as the validation set.
The remaining data are used for training and testing purposes, and
there is no overlapping of training, validation, and test data set.
The metrics that show the performance of the SSMTL are RMSE and R2, and their values on training data are 0.2687
and 0.9968, respectively. It has been observed through Table that the baseline MLP model
is not efficient if compared with SSMTL. The fitting response on training
data is shown in Figure b.Figure shows
the scatter plot of SSMTL on the test data. Compared with the baseline
model, it has been noticed that the overall RMSE and R2 on test data are improved by 62.12% and 54.58%, respectively.
It is evident from Table that the R2 score and RMSE for
all the individual frequencies are significantly improved. Taking
the case of 50 neurons, the RMSE for 3, 5, 10, and 50 kHz is 1.3807,
1.4223, 1.2230, and 0.4769, respectively, and the R2 score for 3, 5, 10, and 50 kHz is 0.9156, 0.9112, 0.9397,
and 0.9918, respectively. The induction of MOS capacitor physics with
measured data, implemented with the concept of domain knowledge sharing
architecture, improves the quality of the test data set prediction.
If we consider the case of 50 neurons and 5 kHz test data, RMSE for
SSMTL and MLP is 1.4223 and 3.2599, respectively. Thus, there is a
clear improvement of 56.37% in RMSE value. Further improvement in
the performance of SSMTL can be enhanced by generating a more accurate
physical model. A more accurate physical model can be constructed
by combining more numbers of process parameters or using more accurate
device simulation. First, the inclusion of traps as one of the most
important sources of variability can improve performance. Second,
we can simulate the transport of carriers under illumitation in the
MOS capacitor by building a 2D structure with a given value of power
rate. To carry out these simulations, TCAD can be one probable option.
Figure 10
Scatterplot
of 50 neuron SSMTL on the test data for all process
parameters, i.e. frequency/area/clean type/oxide deposition/metal
deposition: (a) 3, (b) 10, and (c) 50 kHz data.
Scatterplot
of 50 neuron SSMTL on the test data for all process
parameters, i.e. frequency/area/clean type/oxide deposition/metal
deposition: (a) 3, (b) 10, and (c) 50 kHz data.In the field of ML, there is always a trade-off between overfitting
and underfitting. An excessive number of training cycles corresponds
to the overfitting issues and additionally leads to the consumption
of unnecessary computational power. To tackle these situations, we
set patience value as the input argument to Keras early stopping function.
From Table , it can
be clearly seen that the model attains the optimal response at 179
epoch cycles despite the fact that the maximum epoch is set to 8000
epochs. This approach prevents overfitting and saves a significant
amount of computational power, ergo making the ML-based compacted
model a state-of-the-art statistical tool. By analyzing the metrics
from Table , it has
been seen that 10 kHz of C–V data has less satisfactory performance. The prediction error of
1.2230 for 50 neurons and 10 kHz data is more pronounced in the low
capacitance region. Nevertheless, compared with the baseline MLP,
there is still a drastic improvement in the R2 score and RMSE values. Therefore, we can see that the interpolation
in frequency is still better at 10 kHz using SSMTL. This indicates
that the phenomenon of change in inversion charge under the effect
of light is well articulated in the equation-based model described
in the methodology section, even if TCAD has not been used in this
work. It is worth mentioning that the accuracy of the physical model
lies in the coverage of different aspects of device physics. If the
energy of the incident photon is equal to or greater than the bandgap
of the semiconductor, the process of optical absorption takes place,
further generating electron–hole pairs. Specific to our experiment,
electrons are the minority charge carrier and responsible for the
transition from depletion to inversion mode.The essence of
physics is reflected in the equation-based physical
approach though further improvement in the physical model, such as
including the effect of the process parameters, can be valuable. Finally,
we want to indicate that, in conventional supervised multitask learning,
the size of the theoretical and measured data sets should be the same.
Nonetheless, semisupervised learning gives an advantage that two data
sets with different sizes can be accommodated in a model.[57] Specific to our experiment, the shape of theoretical
data is (144288,7) while the measurement data is (4896,7). Therefore,
using semisupervision makes the MTL algorithm become more versatile,
and this benefits general industrial applications.
Performance of the Transfer Learning Model
For a user,
an ML model is just a black box. It is desired to reduce
the measurement and fabrication loading in the semiconductor industry
since process development is costly in foundries. This aspect gives
the idea of a transfer learning-based model.[61] The structure of TL is shown in Figure . For implementing the complex task, a deep
network is required.[62] The deeper the network,
the more computing resources will be needed. Thus, to increase the
robustness and productivity, transfer learning using the previously
trained model can be illustrative. Here the pretrained model is trained
on all theoretical frequency C–V data. Afterward, the optimized weights and biases are transferred
to the main model. The main model uses only real silicon C–V data. Hyperparameters with their values
are shown in Table .
Figure 11
Proposed transfer learning layer-wise representation.
Proposed transfer learning layer-wise representation.It has been observed from the performance metric Table that the R2 score and RMSE at the test set for all the
individual
frequencies are improved relative to MLP models. In Table , the neuron number of the transferred
layer, i.e., H1,T in Figure , is fixed at 50, and the neuron number
of the retrained layer, i.e., H2,P in Figure , is varied from 20 to 60.
H2,T is fixed at 50. Considering 50 neurons of TL model,
RMSEs for 3, 5, 10, and 50 kHz are 1.6781, 1.5566, 1.2315, and 0.8994,
respectively, and R2 scores for 3, 5,
10, and 50 kHz are 0.8753, 0.8938, 0.9388, and 0.9708, respectively.
Thus, it has been evident that the TL-based compact model performs
more accurately at the test set. The assistance provided by the theoretical
model enhances the conditional probability distribution of the main
model. The domain knowledge of the predefined and main model is not
exactly the same, but the task of C–V prediction for different frequencies is the same.Through the transfer of domain knowledge from predefined to the
main model, the TL-based compact model outclasses the traditional
baseline MLP. The performance of SS-MTL is slightly more accurate
as compared to TL if the case of 50 neurons in the hidden layers is
examined. In general, the relative strengths of SSMTL and TL show
case-by-case phenomenon and are roughly at a similar level and all
above the baseline MLP models. The advantage of SSMTL over TL is that
there is no need to determine the fixed epoch number, i.e., 30 in
this work, in the training of the main model for SS-MTL. Due to the
transfer of domain knowledge, the main model requires fewer epochs
for training. Only 30 cycles are required to get the desired results
from the main model. The RMSE value for 50 neurons presented in Table shows that the 3
kHz C–V prediction is less
accurate compared to the C–V at other frequencies. This can be observed from the scatter plot
of Figure , where
the predicted capacitance values slightly more deviate from their
measured values, but the R2 score of 0.8753
is still satisfactory. One of the potential further advancements in
TL is the application of the diffused model with TL.[63] The spatial and temporal dependency of data is captured
with the diffusion of convolutional neural networks and recurrent
neural network to upgrade the model predicting behavior.
Figure 12
Scatter plot
of 50 neuron TL on the test data for all process parameters,
i.e. frequency/area/clean type/oxide deposition/metal deposition:
(a) 3, (b) 10, and (c) 50 kHz data.
Scatter plot
of 50 neuron TL on the test data for all process parameters,
i.e. frequency/area/clean type/oxide deposition/metal deposition:
(a) 3, (b) 10, and (c) 50 kHz data.On the other hand, performance-wise, meta transfer learning is
also an advanced concept which further reduces the requirement of
large labeled data and increasing the overall performance.[64] The training performance of baseline MLP on
low frequency and high-frequency test data is shown in Figure a. An RMSE of 0.3525 and R2 of 0.9946 are obtained. MLP is a commonly
used algorithm in a regression-type problem. Nevertheless, when it
comes to the unseen or temporal type of data, it can become ineffective.
This behavior can undoubtedly be observed in Table . The RMSE and R2 values are severely affected in the test data of intermediate frequencies.
The performance measures of transfer learning are shown in Table . The architecture
of the pretrained model consists of two hidden layers and is trained
with all frequencies of theoretical data. Afterward, the weight and
bias are transferred to a new model that predicts the electrical characteristics
of intermediate frequencies. The batch size and epoch cycle of the
pretrained model are tuned in such a way that the processor power
and fitting characteristics are not compromised. From Table and Figure , it can be seen that, after implementing
transfer learning and SSMTL, there is a wide improvement in the performance
metrics. The baseline MLP model has good performance on training data,
but the model applicability lies on the test data performance. By
including the concept of theoretical C–V, the physics-assisted SSMTL and TL models outclass the
baseline model.
Application of ML Models
in Device Fabrication
To make the model robust with broad
applications, it is illustrative
to conjugate the concept of intelligent manufacturing with SPICE modeling.
Our model can easily be used in understanding the effect of various
fabrication processes on device performance. In our experiment, we
have considered the effect of different cleaning techniques, different
thermal deposition techniques, and different metal electrode deposition
techniques. Figure shows the model C–V characteristics
variation for one type of process, i.e., the types of thermal deposition.
From Figure , taking
RCA cleaning, 0.01 mm2, E-gun metal deposition, and 50
kHz data, it is clear that the effect of process variation on C–V characteristics can easily be
studied by our prescribed architecture of the ML models. Finally,
the usability of the method here can lie in that measuring less CV
or IV curves can expedite the process development time and cycles,
especially in the early development stage of semiconductor devices.
It is also illustrative to discuss the potential of using the same
approach here to predict capacitance in different scenarios. Using
the approach here, predicting the missing capacitance values at some
voltages for a device will be quite accurate since this is mostly
an interpolation problem. In case we have to predict the capacitance
values across devices with different process conditions, the accuracy
will be degraded, and the physics assistance has no effect since the
physical model does not encompass process information such as wafer
cleaning.
Figure 13
Effect of dry and wet oxidation on 50 kHz/0.01 mm2/RCA/E-gun
test data: (a) baseline MLP, (b) SSMTL, (c) transfer learning. 50
neurons in a hidden layer is used.
Effect of dry and wet oxidation on 50 kHz/0.01 mm2/RCA/E-gun
test data: (a) baseline MLP, (b) SSMTL, (c) transfer learning. 50
neurons in a hidden layer is used.
Conclusion
The prediction of device performance
as the result of complex semiconductor
manufacturing procedures is always desired. In this work, we demonstrate
two advanced data-driven physics-assisted ML-based models that are
effectively used to map the device C–V to oxidation and wafer cleaning procedures. We observe
that semisupervised multitask learning (SSMTL) and transfer learning
(TL) achieve better performance compared to the standard MLP models.
The improved data-driven models are based on the sharing or transferring
of weights and biases of some layers in the ML models. The hybridization
utilizes the theoretical C–V of 1, 3, 5, 10, 50, and 100 kHz, and the measured C–V of 1 and 100 kHz are used to predict the
measured C–V for intermediate
frequencies at 3, 5, 10, and 50 kHz. The samples are fabricated by
considering different fabrication conditions targeted at constructing
data-driven compact device models with process-aware capability.The most significant advantage of our presented data-driven model
is that it can smoothly expand to any number of process parameters.
In addition, the knowledge sharing and transferring in SSMTL and TL
setups leads to relaxed requirements on the amount of data collected
in semiconductor manufacturing, which can be costly. This makes our
presented model more practical in production lines. Semisupervised
multitask learning (SSMTL) and transfer learning (TL) algorithms are
tested on various unseen C–V data, and R2 values of 0.9442 and 0.9253
are obtained for the test set data for SSMTL and TL, respectively.
We believe that this work will be important for future smart semiconductor
manufacturing, in which the massive chemical reactions in semiconductor
processes have to be accurately modeled to predict the final device
characteristics.
Authors: Charles R Harris; K Jarrod Millman; Stéfan J van der Walt; Ralf Gommers; Pauli Virtanen; David Cournapeau; Eric Wieser; Julian Taylor; Sebastian Berg; Nathaniel J Smith; Robert Kern; Matti Picus; Stephan Hoyer; Marten H van Kerkwijk; Matthew Brett; Allan Haldane; Jaime Fernández Del Río; Mark Wiebe; Pearu Peterson; Pierre Gérard-Marchant; Kevin Sheppard; Tyler Reddy; Warren Weckesser; Hameer Abbasi; Christoph Gohlke; Travis E Oliphant Journal: Nature Date: 2020-09-16 Impact factor: 49.962