Text Lab 1 3 9 – A Text Transformation Toolkit

Exercise 3: CLI text classification utility¶ Using the results of the previous exercises and the cPickle module of the standard library, write a command line utility that detects the language of some text provided on stdin and estimate the polarity (positive or negative) if the text is written in English. The graph of f x = 1 x 2 is vertically compressed by a factor of 1 3, then shifted to the left 2 units and down 3 units. (1.5#65) (1.5#65) For the following exercises, describe how the formula is a transformation of a toolkit function.

By Steven Black

Introduction

This article serves to introduce, illustrate, and explore some of the great ( and not so great ) string handling capabilities of Visual FoxPro.

I always seem to be involved with solving many text-data related problems in my VFP projects. On the surface, handling text isnt very sexy and seemingly not very interesting. I think otherwise, and I hope youll agree.

This document is split into three sections: Inbound is about getting text into the VFP environment so you can work with it. Processing is about manipulating the text, and Outbound is about sending text on its way when youre done.

To illustrate text handling in VFP, I am using the complete text of Tolstoys War And Peace, included on the conference CD as WarAndPeace.TXT, which along with thousands of works of literature, are available on the web, including here among others.

This article was originally written using Visual FoxPro version 6, and has since been updated for VFP 7 and VFP 8.

Some facts about VFP strings

Here are a few things you need to know about VFP strings:

In functional terms, there is no difference between a character field and a memo field. All functions that work on characters also work on memos.

The maximum number of characters that VFP can handle in a string is 16, 777, 184.

Inbound

Swinsian 1 12 0 – music manager and player. This section is all about getting text into your processing environment.

Inbound text from table fields

To retrieve text from a table field, simply assign it to a memory variable.

Inbound from text files

There are many ways to retrieve text from files on disk.

FILETOSTR( cFileName ) is used to place the contents of a disk file into a string memory variable. This is among my favorite new functions in VFP 6. Its both useful and fast. For example, the following code executes in one-seventh of a second on my 220Mhz Pentium laptop.

In other words, on a very modest laptop ( by todays standards ) VFP can load the full text from Tolstoys War And Peace in one-seventh of a second.

Low Level File Functions ( LLFF ) are somewhat more cumbersome but offer great control. LLFF are also very fast. The following example reads the entire contents of Tolstoys War And Peace from disk into memory:

Given the similar execution times, I think we can conclude that internally, LLFF and FILETOSTR() are implemented similarly. However with the LLFF we also have fine control. For example, FGETS() allows us to read a line at a time. To illustrate, the following code reads the first 15 lines of War And Peace into array wpLines.

We can also retrieve a segment from War And Peace. FSEEK() moves the LLFF pointer, and the FREAD() function is used to read a range. Lets read, say, 1000 bytes about half way through the book.

Inbound from text files, with pre-processing

Sometimes you need to pre-process text before it is usable. For example, you may have an HTML file from which you need to clean and remove tags. Or maybe you have the problem exhibited by our copy of War and Peace, which has embedded hard-returns at the end of each line. How can we create a streaming document that we can actually format?

Often the answer is to use the APPEND FROM command, which imports from file into a table, and moreover supports a large variety of file formats. The strategy always works something like this: You create a single-field table, and you use APPEND FROM .. TYPE SDF to load it

Now youre good to go: Youve got a table of records that you can manipulate and transform to your hearts content using VFPs vast collection of functions.

Processing

This section discusses a wide variety of string manipulation techniques in Visual FoxPro. Lets say weve got some text in our environment, now lets muck with it.

Does a sub-string exist?

There are many ways to determine if a sub-string exists in a string. The $ command returns True or False if a sub-string is contained in a string. This command is fast. Try this:

Folx go 5 3 – manage and organize downloads pdf. The AT()and ATC()functions are also great for determining if a sub-string exists, the former having the advantage of being case insensitive and, moreover, their return values gives you an exact position of the sub-string.

The OCCURS() function will also tell you if a sub-string exists, and moreover tell you how many times the sub-string occurs. This code will count the number of occurrences of a variety of sub-strings in War And Peace.

Locating sub-strings in strings is something VFP does really well.

Locating sub-strings

One of the basic tasks in almost any string manipulation is locating sub strings within larger strings. Four useful functions for this are AT(), RAT(), ATC(), and RATC(). These locate the ordinal position of sub-strings locating from the left ( AT() ), from the right ( RAT() ), both of which have case-insensitive variants ( ATC(), and RATC() ). All these functions are very fast and scale well with file size. For example, lets go look for THE END in War And Peace.

You can also check for the nth occurrence of a sub-string, as illustrated below where we find the 1st, 101st, 201st..701st occurrence of the word Russia in War And Peace.

Two other functions are useful for locating strings: ATLINE() and ATCLINE(). These return the line number of the first occurrence of a string.

Note: Prior to VFP 7, functions that are sensitive to SET MEMOWIDTH, like ATLINE() and ATCLINE(), among others, are dog-slow on larger strings and so do not scale well at all.

Traversing text line-by-line

Iterating through text, one line at a time, is a common task. Heres the way VFP developers have been doing it for years: Using the MEMLINES() and MLINE() functions. Like this:

Thats pathetic performance. 20+ seconds to iterate through 767 lines! Fortunately, theres a trick to using MLINE(), which is to pass the _MLINE system memory variable as the third parameter. Like this.

Now thats more like it a fifty-fold improvement. A surprising number of VFP developers dont know this idiom with _MLINE even though its been documented in the FoxPro help since version 2 at least.

Starting in VFP 6 all this is obsolete, since ALINES() is a screaming new addition to the language. Lets see how these routines look and perform with ALINES().

Another twenty-fold improvement in speed. I think the lesion is clear: If you are using MLINE() in your applications, and you are using VFP 6, then its time to switch to ALINES(). There are just two major differences: First, ALINES() is limited by VFPs 65, 000 array element limit, and second, successive lines with only CHR( 13 ) carriage returns are considered as one line. For example:

But if you use carriage return + line feed, CHR( 13 )+CHR( 10 ), youll get the results you expect.

This is a bit unnerving if blank lines are important, so beware and use CHR( 13 )+CHR( 10 ) to avoid this problem.

Now, just for fun, lets rip through War And Peace using ALINES().

Excuse me, but wow, considering were creating a 54, 337 element array from a file on disk, then were traversing the entire array assigning each elements contents to a memory variable, and were back in 3.4 seconds.

What about just creating the array of War And Peace:

So, on my Pentium 233 laptop using VFP 6, we can load War and Peace from disk into a 54, 000-item array in 2.2 seconds. On my newer desktop machine, a Pentium 500, this task is subsecond.

Traversing text word-by-word

You could recursively traverse a string word-by-word by using, among other things, the return value from AT( , x, n )and SUBS( , , ) and, if you are doing that, youre missing a great and little known feature of VFP.

Two new functions are great for word-by-word text processing. The GETWORDCOUNT() and GETWORDNUM() functions, return the number of words and individual words respectively.

Prior to VFP 7, use the Words() and WordNum() functions, which are available to you when you load the FoxTools.FLL library, return the number of words and individual words respectively.

Lets see how they perform. Lets first count the words in War And Peace.

The GETWORDCOUNT() function is also useful for counting all sorts of tokens since you can pass the word delimiters in the second parameter. How many sentences are there in War And Peace?

GETWORDNUM() returns a specific word from a string. Whats the 666th word in War And Peace? What about the 500000th?

Similarly to GETWORDCOUNT(), we can use GETWORDNUM() to return a token from a string by specifying the delimiter. Whats the 2000th sentence in War And Peace?

Substituting text

VFP has a number of useful functions for substituting text. STRTRAN(), CHRTRAN(), CHRTRANC(), STUFF(), and STUFFC().

STRTRAN() replaces occurrences of a string with another. For example, lets change all occurrences of Anna to the McBride twins in War And Peace.

Thats over 125 replacements per second, which is phenomenal. What about removing strings?

So it appears that STRTRAN() both adds and removes strings with equal aplomb. What of CHRTRAN(), which swaps characters? Lets, say, change all s to ch in War and Peace.

Which isnt bad considering that there are 159, 218 occurrences of character s in War And Peace.

However dont try to use CHRTRAN() when the second parameter is an empty string. The performance of CHRTRAN() in these circumstances is terrible. If you need to suppress sub-strings, use STRTRAN() instead.

String Concatenation

VFP has tremendous concatenation speed if you use it in a particular way. Since many common tasks, like building web pages, involve building documents one element at a time, you should know that string expressions of the form x = x+y are very fast in VFP. Consider this:

The same type of performance applies if you build strings small chunks at a time, which is a typical scenario in dynamic Web pages whether a template engine or raw output is used. For example:

This full optimization occurs as long as the string is adding something to itself and as long as the string concatenated is stored in a variable. Using class properties is somewhat less efficient. String optimization does not occur if the first expression on the right of the = sign is not the same as the string being concatenated. So:

is not optimized in this fashion. The above line, placed in the example above, takes 25 seconds! So appending strings to strings is blazingly fast in most common situations.

Outputting text

So you've got text, maybe a lot of it, what are your options for writing it to disk.

Foremostly theres the new STRTOFILE() function which creates a disk file wit the contents of a string. Lets write War And Peace to disk.

Which means that you can dish 3+ Mb to disk in about a half-second.

You can also use Low Level File Functions ( LLFF ) to output text. The FWRITE() function dumps all or part of a string to disk. The FPUTS() function outputs a single line from the string, and moves the pointer

Here again, the similar performance times between FWRITE() and STRTOFILE() are striking, just as they were when comparing FREAD() and FILETOSTR().

Heres an example of outputting War And Peace line-by-line using FPUTS(). Since were using ALINES(), its not that onerous a task. In fact, its very slick!

Conclusion

So, there you have it, a cafeteria-style tour of VFPs text handling capabilities. I personally think that most of the code snippets Ive shown here have amazing and borderline unbelievable execution speeds. I hope Ive been able to show that VFP really excels at string handling.

It is common knowledge that chemical reactions occur more rapidly at higher temperatures. Milk turns sour much more rapidly if stored at room temperature rather than in a refrigerator; butter goes rancid more quickly in the summer than in the winter; and eggs hard-boil more quickly at sea level than in the mountains. For the same reason, cold-blooded animals such as reptiles and insects tend to be more lethargic on cold days.

The reason for this is not hard to understand. Thermal energy relates direction to motion at the molecular level. As the temperature rises, molecules move faster and collide more vigorously, greatly increasing the likelihood of bond cleavages and rearrangements. Whether it is through the collision theory, transition state theory, or just common sense, chemical reactions are typically expected to proceed faster at higher temperatures and slower at lower temperatures.

By 1890 it was common knowledge that higher temperatures speed up reactions, often doubling the rate for a 10-degree rise, but the reasons for this were not clear. Finally, in 1899, the Swedish chemist Svante Arrhenius (1859-1927) combined the concepts of activation energy and the Boltzmann distribution law into one of the most important relationships in physical chemistry:

Take a moment to focus on the meaning of this equation, neglecting the A factor for the time being.

First, note that this is another form of the exponential decay law discussed in the previous section of this series. What is 'decaying' here is not the concentration of a reactant as a function of time, but the magnitude of the rate constant as a function of the exponent –E_a /RT. And what is the significance of this quantity? Recalling that RT is the average kinetic energy, it becomes apparent that the exponent is just the ratio of the activation energy E_a to the average kinetic energy. The larger this ratio, the smaller the rate (hence the negative sign). This means that high temperature and low activation energy favor larger rate constants, and thus speed up the reaction. Because these terms occur in an exponent, their effects on the rate are quite substantial.

The two plots below show the effects of the activation energy (denoted here by E^‡) on the rate constant. Even a modest activation energy of 50 kJ/mol reduces the rate by a factor of 10⁸.

Looking at the role of temperature, a similar effect is observed. (If the x-axis were in 'kilodegrees' the slopes would be more comparable in magnitude with those of the kilojoule plot at the above right.)

Determining the activation energy

The Arrhenius equation,

[k = A e^{-E_a/RT} tag{1}]

can be written in a non-exponential form that is often more convenient to use and to interpret graphically. Taking the logarithms of both sides and separating the exponential and pre-exponential terms yields

[ ln k = ln left(Ae^{-E_a/RT} right) = ln A + ln left(e^{-E_a/RT}right) tag{2}]

[ln k = ln A + dfrac{-E_a}{RT} = left(dfrac{-E_a}{R}right) left(dfrac{1}{T}right) + ln A tag{3}]

which is the equation of a straight line whose slope is (–E_a /R). This affords a simple way of determining the activation energy from values of k observed at different temperatures, by plotting (ln k) as a function of (1/T).

Example 1: Isomerization of Cyclopropane

For the isomerization of cyclopropane to propene,

the following data were obtained (calculated values shaded in pink):

T, °C	477	523	577	623
1/T, K^–1 × 10³	1.33	1.25	1.18	1.11
k, s^–1	0.00018	0.0027	0.030	0.26
ln k	–8.62	–5.92	–3.51	–1.35

From the calculated slope, we have

– (E_a/R) = –3.27 × 10⁴ K

E_a=– (8.314 J mol^–1 K^–1) (–3.27 × 10⁴ K) = 273 kJ mol^–1

Comment: This activation energy is high, which is not surprising because a carbon-carbon bond must be broken in order to open the cyclopropane ring. (C–C bond energies are typically around 350 kJ/mol.) This is why the reaction must be carried out at high temperature.

Calculating (E_a) without a plot

Because the ln k-vs.-1/T plot yields a straight line, it is often convenient to estimate the activation energy from experiments at only two temperatures. To see how this is done, consider that

[ ln k_2 -ln k_1 =left(ln A - frac{E_a}{RT_2} right)left(ln A - frac{E_a}{RT_1} right)= color{red}{boxed{color{black}{ frac{E_a}{R}left( frac{1}{T_1}-frac{1}{T_2} right) }}} ]

The ln-A term is eliminated by subtracting the expressions for the two ln-k terms.) Solving the expression on the right for the activation energy yields

[ E_a = dfrac{R ln dfrac{k_2}{k_1}}{dfrac{1}{T_1}-dfrac{1}{T_2}}]

Example 2

A widely used rule-of-thumb for the temperature dependence of a reaction rate is that a ten degree rise in the temperature approximately doubles the rate. This is not generally true, especially when a strong covalent bond must be broken. For a reaction that does show this behavior, what would the activation energy be?

Solution

Center the ten degree interval at 300 K. Substituting into the above expression yields

[E_a = dfrac{(8.314)(ln 2/1)}{dfrac{1}{295} – dfrac{1}{305}} = dfrac{(8.314text{ J mol}^{-1}text{ K}^{-1})(0.693)}{0.00339,text{K}^{-1} – 0.00328 , text{K}^{-1}} ]

= (5.76 J mol^–1 K^–1) / (0.00011 K^–1) = 52400 J mol^–1 = 52.4 kJ mol^–1

Example 3

It takes about 3.0 minutes to cook a hard-boiled egg in Los Angeles, but at the higher altitude of Denver, where water boils at 92°C, the cooking time is 4.5 minutes. Use this information to estimate the activation energy for the coagulation of egg albumin protein.

Solution

The ratio of the rate constants at the elevations of Los Angeles and Denver is 4.5/3.0 = 1.5, and the respective temperatures are (373 ; rm{K }) and (365; rm{K}). With the subscripts 2 and 1 referring to Los Angeles and Denver respectively:

[E_a = dfrac{(8.314)(ln 1.5)}{dfrac{1}{365; rm{K}} – dfrac{1}{373 ; rm{K}}} = dfrac{(8.314)(0.405)}{0.00274 ; rm{K^{-1}} – 0.00268 ; rm{K^{-1}}}]

[ = dfrac{(3.37; rm{J; mol^{–1} K^{–1}})}{5.87 times 10^{-5}; rm{K^{–1}}} = 5740; rm{ J; mol^{–1}} = 5.73 ; rm{kJ ;mol^{–1}}]

Comment: This low value seems reasonable because thermal denaturation of proteins primarily involves the disruption of relatively weak hydrogen bonds; no covalent bonds are broken (although disulfide bonds can interfere with this interpretation).

The pre-exponential factor

Up to this point, the pre-exponential term, A in the Arrhenius equation, has been ignored because it is not directly involved in relating temperature and activation energy, which is the main practical use of the equation.

However, because A multiplies the exponential term, its value clearly contributes to the value of the rate constant and thus of the rate. Recall that the exponential part of the Arrhenius equation expresses the fraction of reactant molecules that possess enough kinetic energy to react, as governed by the Maxwell-Boltzmann law. This fraction can run from zero to nearly unity, depending on the magnitudes of (E_a) and of the temperature.

If this fraction were 0, the Arrhenius law would reduce to

[k = A]

In other words, (A) is the fraction of molecules that would react if either the activation energy were zero, or if the kinetic energy of all molecules exceeded (E_a) — admittedly, an uncommon scenario (although barrierless reactions have been characterized).

The role of collisions

What would limit the rate constant if there were no activation energy requirements? The most obvious factor would be the rate at which reactant molecules come into contact. This can be calculated from kinetic molecular theory and is known as the frequency- or collision factor, (Z).

In some reactions, the relative orientation of the molecules at the point of collision is important, so a geometrical or steric factor (commonly denoted by (rho) (Greek lower case rho) can be defined. In general, we can express (A) as the product of these two factors:

[A = Zrho]

Values of ρ are generally very difficult to assess; they are sometime estimated by comparing the observed rate constant with the one in which A is assumed to be the same as (Z).

Introduction

The 'Arrhenius Equation' was physical justification and interpretation in 1889 by Svante Arrhenius, a Swedish chemist. Arrhenius performed experiments that correlated chemical reaction rate constants with temperature. After observing that many chemical reaction rates depended on the temperature, Arrhenius developed this equation to characterize the temperature-dependent reactions:

[ large k=Ae^{^{frac{-E_{a}}{k_{B}T}}} ]

[large ln k=ln A - frac{E_{a}}{k_{B}T} ]

With the following terms:

k: Chemical reaction rate constant

In unit of s^-1(for 1^st order rate constant) or M^-1s^-1(for 2^nd order rate constant)

A: The pre-exponential factoror frequency factor

Specifically relates to molecular collision
Deals with the frequency of molecules that collide in the correct orientation and with enough energy to initiate a reaction.
It is a factor that is determined experimentally, as it varies with different reactions.
In unit of L mol^-1s^-1or M^-1s^-1(for 2^nd order rate constant) and s^-1(for 1^st order rate constant)
Because frequency factor A is related to molecular collision, it is temperature dependent
Hard to extrapolate pre-exponential factor because lnk is only linear over a narrow range of temperature

E_a: The activation energy is the threshold energy that the reactant(s) must acquire before reaching the transition state.

Once in the transition state, the reaction can go in the forward direction towards product(s), or in the opposite direction towards reactant(s).
A reaction with a large activation energy requires much more energy to reach the transition state.
Likewise, a reaction with a small activation energy doesn't require as much energy to reach the transition state.
In units of kJ/mol.
-E_a/RT resembles the Boltzmann distribution law.

R: The gas constant.

Its value is 8.314 J/mol K.

T:The absolute temperature at which the reaction takes place.

In units of Kelvin (K).

Implications

The exponential term in the Arrhenius equation implies that the rate constant of a reaction increases exponentially when the activation energy decreases. Because the rate of a reaction is directly proportional to the rate constant of a reaction, the rate increases exponentially as well. Because a reaction with a small activation energy does not require much energy to reach the transition state, it should proceed faster than a reaction with a larger activation energy.

In addition, the Arrhenius equation implies that the rate of an uncatalyzed reaction is more affected by temperature than the rate of a catalyzed reaction. This is because the activation energy of an uncatalyzed reaction is greater than the activation energy of the corresponding catalyzed reaction. Since the exponential term includes the activation energy as the numerator and the temperature as the denominator, a smaller activation energy will have less of an impact on the rate constant compared to a larger activation energy. Hence, the rate of an uncatalyzed reaction is more affected by temperature changes than a catalyzed reaction.

The Math in Eliminating the Constant A

To eliminate the constant (A), there must be two known temperatures and/or rate constants. With this knowledge, the following equations can be written:

[ ln k_{1}=ln A - dfrac{E_{a}}{k_{B}T_1} ]

at (T_1) and

[ ln k_{2}=ln A - dfrac{E_{a}}{k_{B}T_2} ]

at (T_2). By rewriting the second equation:

[ ln A = ln k_{2} + dfrac{E_{a}}{k_{B}T_2} ]

and substitute for (ln A) into the first equation:

[ ln k_{1}= ln k_{2} + dfrac{E_{a}}{k_{B}T_2} - dfrac{E_{a}}{k_{B}T_1} ]

This simplifies to:

[ ln k_{1} - ln k_{2} = -dfrac{E_{a}}{k_{B}T_1} + dfrac{E_{a}}{k_{B}T_2} ]

[ ln dfrac{k_{1}}{k_{2}} = -dfrac{E_{a}}{k_{B}} left (dfrac{1}{T_1}-dfrac{1}{T_2} right )]

Graphically determining the Activation Energy of a Reaction

A closer look at the Arrhenius equation reveals that the natural logarithm form of the Arrhenius equation is in the form of (y = mx + b). In other words, it is similar to the equation of a straight line.

[ ln k=ln A - dfrac{E_{a}}{k_{B}T} ]

where temperature is the independent variable and the rate constant is the dependent variable. So if one were given a data set of various values of k, the rate constant of a certain chemical reaction at varying temperature (T), one could graph (ln (k)) versus (1/T). From the graph, one can then determine the slope of the line and realize that this value is equal to (-E_a/R). One can then solve for the activation energy by multiplying through by -R, where R is the gas constant.

Problems

Find the activation energy (in kJ/mol) of the reaction if the rate constant at 600K is 3.4 M^-1s^-1 and 31.0 at 750K.
Find the rate constant if the temperature is 289K, Activation Energy is 200kJ/mol and pre-exponential factor is 9 M^-1s^-1
Find the new rate constant at 310K if the rate constant is 7 M^-1s^-1at 370K, Activation Energy is 900kJ/mol
Calculate the activation energy if the pre-exponential factor is 15 M^-1s^-1^,rate constant is 12M^-1s^-1and it is at 22K
Find the new temperature if the rate constant at that temperature is 15M^-1s^-1 while at temperature 389K the rate constant is 7M^-1s¹^, the Activation Energy is 600kJ/mol

Solutions

1. E_a is the factor the question asks to be solved. Therefore it is much simpler to use

(large ln k = -frac{E_a}{RT} + ln A)

To find E_a, subtract ln A from both sides and multiply by -RT.

This will give us:

( E_a=ln A -ln k)RT)

2. Substitute the numbers into the equation:

Text Lab 1 3 9 – A Text Transformation Toolkit Pdf

( ln k = frac{-(200 times 1000text{ J}) }{ (8.314text{ J mol}^{-1}text{K}^{-1})(289text{ K})} + ln 9)

Text Lab 1 3 9 – A Text Transformation Toolkit Online

k = 6.37X10^-36M^-1s^-1

3. Use the equation ln(k₁/k₂)=-Ea/R(1/T₁-1/T₂)

ln(7/k₂)=-[(900 X 1000)/8.314](1/370-1/310)

k₂=1.788X10^-24M^-1s^-1

4. Use the equation k = Ae^-Ea^/RT

Text lab 1 3 9 – a text transformation toolkit word

12 = 15e^-Ea^/(8.314)(22)

Ea = 40.82J/mol

5. Use the equatioin ln(k₁/k₂)=-Ea/R(1/T₁-1/T₂)

ln(15/7)=-[(600 X 1000)/8.314](1/T₁ - 1/389)

T₁ = 390.6K

References

Chang, Raymond. 2005. Physical Chemistry for the Biosciences. Sausalito (CA): University Science Books. p. 311-347.
Segal, Irwin. 1975. Enzyme Kinetics. John Wiley & Sons, Inc. p.931-933.
Ames, James. 2010. Lecture 7 Chem 107B. University of California, Davis.
Laidler, Keith. 'The Development of the Arrhenius Equation.' J. Chem. Educ., 1984, 61 (6), p 494
Logan, S. R. 'The orgin and status of the Arrhenius Equation.' J. Chem. Educ., 1982, 59 (4), p 279

Contributors and Attributions

Guenevieve Del Mundo, Kareem Moussa, Pamela Chacha, Florence-Damilola Odufalu, Galaxy Mudda, Kan, Chin Fung Kelvin