主题五:过度使用技术
Test automation is based on a simple economic proposition:
测试自动化基于一个简单的经济观点:
· If a manual test costs $X to run the first time, it will cost just about $X to run each time thereafter, whereas:
· 如果第一次手工测试的成本是$X,则其后每次测试的成本大致都是$X,然而:
· If an automated test costs $Y to create, it will cost almost nothing to run from then on.
· 如果创建自动化测试的成本是$Y,则其后的运行成本几乎为零。
$Y is bigger than $X. I've heard estimates ranging from 3 to 30 times as big, with the most commonly cited number seeming to be 10. Suppose 10 is correct for your application and your automation tools. Then you should automate any test that will be run more than 10 times.
$Y比$X大。我了解到的估计范围是从3至30倍,而常常被引用的数值似乎是10。假设10对于应用程序和自动化工具是正确的。这样应当将运行10次以上的测试都进行自动化。
A classic mistake is to ignore these economics, attempting to automate all tests, even those that won't be run often enough to justify it. What tests clearly justify automation?
一个典型错误是忽略这些经济上的考虑,试图自动化所有的测试,甚至包括那些不常运行的测试以至不能证明自动化是必要的。哪些测试能明显地证明自动化是必要的?
· Stress or load tests may be impossible to implement manually. Would you have a tester execute and check a function 1000 times? Are you going to sit 100 people down at 100 terminals?
· 压力或负载测试可能无法手工实现。你会让测试员执行并检查一个函数1000次吗?你会找100个人坐在100个终端前面吗?
· Nightly builds are becoming increasingly common. (See [McConnell96] or [Cusumano95] for descriptions of the procedure.) If you build the product nightly, you must have an automated "smoke test suite". Smoke tests are those that are run after every build to check for grievous errors.
· 夜间构建变得越来越普遍了。(参见[McConnell96]或[Cusumano95]可以了解这个过程的描述)。如果在夜间构建产品,就必须有一个自动化的“冒烟测试套件”。 冒烟测试指的是那些在每次构建之后都去检查严重错误的测试。
· Configuration tests may be run on dozens of configurations.
· 配置测试可能要在数十种配置上运行。
The other kinds of tests are less clear-cut. Think hard about whether you'd rather have automated tests that are run often or ten times as many manual tests, each run once. Beware of irrational, emotional reasons for automating, such as testers who find programming automated tests more fun, a perception that automated tests will lead to higher status (everything else is "monkey testing"), or a fear of not rerunning a test that would have found a bug (thus leading you to automate it, leaving you without enough time to write a test that would have found a different bug).
其他种类的测试不是这个明显。仔细想一下,对于那些多次运行或者运行次数是手工运行次数10倍的自动化测试,你是否只运行一次。要当心实现自动化的不理性的、感情的原因,例如测试员发现程序自动化测试更有趣,认为自动化测试将带来更高的地位(其他测试都是“猴子测试”),或者是害怕不能重新运行一个会发现 bug 的测试(这导致你将它自动化,使你没有足够的时间编写一个会发现其他 bug 的测试)。
You will likely end up in a compromise position, where you have:
你可能在最后有一个折中的方式,你将有:
1. a set of automated tests that are run often.
一套经常运行的自动化测试。
2. a well-documented set of manual tests. Subsets of these can be rerun as necessary. For example, when a critical area of the system has been extensively changed, you might rerun its manual tests. You might run different samples of this suite after each major build.
一套文档齐备的手工测试。这些测试的子集合可以在需要的时候重新运行。例如,当一个系统的关键领域被大规模地改变时,可能会重新运行手工测试。在每一次主要构建之后,都可能会运行这个套件的不同样例。
3. a set of undocumented tests that were run once (including exploratory "bug bash" tests).
一套没有文档的、只运行一次的测试(包括探索性的“bug 大清除”测试)。
Beware of expecting to rerun all manual tests. You will become bogged down rerunning tests with low bug-finding value, leaving yourself no time to create new tests. You will waste time documenting tests that don't need to be documented.
注意不要期望重新运行所有的手工测试。重新运行这些很少能找到 bug 的测试会使你停滞不前,使你自己没有时间创建新的测试。你会把时间浪费在为不需要文档的测试编写文档上。
You could automate more tests if you could lower the cost of creating them. That's the promise of using GUI capture/replay tools to reduce test creation cost. The notion is that you simply execute a manual test, and the tool records what you do. When you manually check the correctness of a value, the tool remembers that correct value. You can then later play back the recording, and the tool will check whether all checked values are the same as the remembered values.
如果你能够降低创建自动测试的成本,就可以多做一些。这也是使用GUI捕获/回放工具能够减少创建测试的成本的承诺。这个想法是你只需要执行手工测试,工具会录制下你所做的操作。当你手工检查一个值是否正确时,工具会记着那个正确值。过后你可以回放录制,工具会检查是否所有检查的值是否都与记忆的值相同。
There are two variants of such tools. What I call the first generation tools capture raw mouse movements or keystrokes and take snapshots of the pixels on the screen. The second generation tools (often called "object oriented") reach into the program and manipulate underlying data structures (widgets or controls).
这类工具有两个变种。我称为第一代的工具只是捕获原始的鼠标移动或击键操作,并记下屏幕上象素的瞬象。第二代工具(常称为“面向对象的”)深入程序并操纵底层数据结构(小配件或控件)。
First generation tools produce unmaintainable tests. Whenever the screen layout changes in the slightest way, the tests break. Mouse clicks are delivered to the wrong place, and snapshots fail in irrelevant ways that nevertheless have to be checked. Because screen layout changes are common, the constant manual updating of tests becomes insupportable.
第一代工具产生的是不可维护的测试。不论什么时候,只要屏幕布局有了非常微小的变化,测试就要中断。鼠标的点击传送到不正确的位置,瞬象以一种不相关的方式失败,必须予以检查。因为屏幕布局变化是常见情况,所以经常手动更新测试也变得无法忍受。
Second generation tools are applicable only to tests where the underlying data structures are useful. For example, they rarely apply to a photograph editing tool, where you need to look at an actual image - at the actual bitmap. They also tend not to work with custom controls. Heavy users of capture/replay tools seem to spend an inordinate amount of time trying to get the tool to deal with the special features of their program - which raises the cost of test automation.
第二代工具只有在底层数据结构有用时才是可行的。例如,它们很少能用于照片编辑工具,因为你需要查看实际的图象,即实际的位图。它们也不大能够与定制的控件一起使用。大量用户的捕获/回放工具似乎都要花费大量时间来使得工具能够处理他们程序的特殊功能——这增加了自动测试的成本。
Second generation tools do not guarantee maintainability either. Suppose a radio button is changed to a pulldown list. All of the tests that use the old controls will now be broken.
第二代工具也不能保证可维护性。假设一个单选按钮改变为下拉列表。所有使用老控件的测试都将中断。
GUI interface changes are of course common, especially between releases. Consider carefully whether an automated test that must be recaptured after GUI changes is worth having. Keep in mind that it can be hard to figure out what a captured test is attempting to accomplish unless it is separately documented.
GUI界面当然是常常会改变的,特别是在不同的发行版之间。仔细考虑一下一个在GUI变化之后必须重新捕获的自动化测试工具是否值得拥有。记住,除非另外使用文档记录下来,否则想要了解一个录制的测试能够完成什么工作是一件困难的事。
As a rule of thumb, it's dangerous to assume that an automated test will pay for itself this release, so your test must be able to survive a reasonable level of GUI change. I believe that capture/replay tests, of either generation, are rarely robust enough.
一个基本原则是,认为自动化测试的投资在这个发行版就能收回的想法是危险的,所以在一个合理的GUI变化范围之内测试必须能够继续使用。我相信不论是第一代还是第二代捕获/回放测试,都不够健壮。
An alternative approach to capture/replay is scripting tests. (Most GUI capture/replay tools also allow scripting.) Some member of the testing team writes a "test API" (application programmer interface) that lets other members of the team express their tests in less GUI-dependent terms. Whereas a captured test might look like this:
捕获/回放的一个替代方法是脚本化测试。(大多数GUI捕获/回放工具都允许编写脚本。)测试小组的某些成员编写一个“测试API(应用编程接口)”,允许小组的其他成员以较少依赖GUI的方式表达他们的测试。一个捕获的测试类似于这样:
· text $main.accountField "12"
click $main.OK
menu $operations
menu $withdraw
click $withdrawDialog.all
...
文本 $main.accountField "12"
点击 $main.OK
菜单 $operations
菜单 $withdraw
点击 $withdrawDialog.all
a script might look like this:
而一个脚本类似于:
· select-account 12
withdraw all
...
select-account 12
withdraw all
The script commands are subroutines that perform the appropriate mouse clicks and key presses. If the API is well-designed, most GUI changes will require changes only to the implementation of functions like withdraw, not to all the tests that use them. Please note that well-designed test APIs are as hard to write as any other good API. That is, they're hard, and you shouldn't expect to get it right the first time.
脚本命令是执行适当的鼠标点击和按键的子程序。如果API设计得好,大多数GUI变化仅需要对函数(例如withdraw)实现变化,而不是所有使用它们的测试。请注意设计精良的API和其他好API一样难写。也就是说,因为它们不容易写,你也不要指望第一次就得到正确结果。
In a variant of this approach, the tests are data-driven. The tester provides a table describing key values. Some tool reads the table and converts it to the appropriate mouse clicks. The table is even less vulnerable to GUI changes because the sequence of operations has been abstracted away. It's also likely to be more understandable, especially to domain experts who are not programmers. See [Pettichord96] for an example of data-driven automated testing.
这个方法的一个变种,是数据驱动的测试。测试员提供一个表来描述键值。某些工具读取表并将它转换为特定的鼠标点击。这个表即使在GUI变化时也不易受到损害,因为操作序列已经被抽象出来了。它也有可能是更易于理解,尤其是对于非程序员的领域专家。查看[Pettichord96]可以获得数据驱动自动化测试的示例。
Note that these more abstract tests (whether scripted or data-driven) do not necessarily test the user interface thoroughly. If the Withdraw dialog can be reached via several routes (toolbar, menu item, hotkey), you don't know whether each route has been tried. You need a separate (most likely manual) effort to ensure that all the GUI components are connected correctly.
注意这些抽象测试(不论是脚本化的还是数据驱动的)不一定会完全测试用户界面。如果“取款”对话框能够通过几个途径(工具条、菜单项)达到,你无法知道是否尝试了每个路线。你需要一个单独的(很可能是手工的)的工作来确保所有的GUI部件都正确地连接。
Whatever approach you take, don't fall into the trap of expecting regression tests to find a high proportion of new bugs. Regression tests discover that new or changed code breaks what used to work. While that happens more often than any of us would like, most bugs are in the product's new or intentionally changed behavior. Those bugs have to be caught by new tests.
不论你采用的是什么方法,不要陷入期望回归测试发现高比例的新 bug 的陷阱。回归测试是发现以前起作用、但新代码或更改后的代码不起作用的现象。虽然它比我们希望的发生的次数更多,但许多 bug 是产品的新的或故意更改的行为。那些 bug 必须通过新测试来捕捉。
Code coverage
代码覆盖率
GUI capture/replay testing is appealing because it's a quick fix for a difficult problem. Another class of tool has the same kind of attraction.
GUI捕获/回放测试因为可以快速修复困难问题而具有吸引力。另一类工具也同样具有吸引力。
The difficult problem is that it's so hard to know if you're doing a good job testing. You only really find out once the product has shipped. Understandably, this makes managers uncomfortable. Sometimes you find them embracing code coverage with the devotion that only simple numbers can inspire. Testers sometimes also become enamored of coverage, though their romance tends to be less fervent and ends sooner.
困难的问题是很难知道你是否圆满地完成了测试工作。可能只有当产品已交付后才能真正知道。可以理解的是,这使得经理们不舒服。有时候你会发现他们热心采用代码覆盖率,认为只有那些简单的数字可以鼓舞士气。候测试员也变得倾心于覆盖率,虽然他们的兴趣没有那么高,而且结束得也快。
What is code coverage? It is any of a number of measures of how thoroughly code is exercised. One common measure counts how many statements have been executed by any test. The appeal of such coverage is twofold:
什么是代码覆盖率?它是代码是否全面执行的数字衡量。一个常见的衡量是计算所有测试共执行了多少条语句。对这种覆盖率的呼吁有两方面:
1. If you've never exercised a line of code, you surely can't have found any of its bugs. So you should design tests to exercise every line of code.
如果你从未执行过某一行代码,你当然不能找出它的任何 bug 。所以应当设计一个可以执行每一行代码的测试。
2. Test suites are often too big, so you should throw out any test that doesn't add value. A test that adds no new coverage adds no value.
测试套件常常很大,所以应该抛弃任何不能增值的测试。一个不增加新覆盖率的测试不能增加任何价值。
Only the first sentences in (1) and (2) are true. I'll illustrate with this picture, where the irregular splotches indicate bugs:
句子(1)和(2)中,只有第一句是正确的。我将用下面的图说明,其中的不规则黑点指示的是 bug :
If you write only the tests needed to satisfy coverage, you'll find bugs. You're guaranteed to find the code that always fails, no matter how it's executed. But most bugs depend on how a line of code is executed. For example, code with an off-by-one error fails only when you exercise a boundary. Code with a divide-by-zero error fails only if you divide by zero. Coverage-adequate tests will find some of these bugs, by sheer dumb luck, but not enough of them. To find enough bugs, you have to write additional tests that "redundantly" execute the code.
如果你仅编写需要满足覆盖率的测试,你会发现 bug 。那些总是失败的代码不论怎样执行,你都肯定能发现它们。但是大多数的 bug 取决于如何执行某一行代码。例如,对于“大小差一”(off-by-one)错误的代码,只有当你执行边界测试时才会失败。只有在被零除的时候,代码才会发生被零除的错误。覆盖率足够的测试会发现这些 bug 中的一部分,全靠运气,但发现得还不够多。要发现足够多的 bug ,你必须编写其他的测试“冗余地”执行代码。
For the same reason, removing tests from a regression test suite just because they don't add coverage is dangerous. The point is not to cover the code; it's to have tests that can discover enough of the bugs that are likely to be caused when the code is changed. Unless the tests are ineptly designed, removing tests will just remove power. If they are ineptly designed, using coverage converts a big and lousy test suite to a small and lousy test suite. That's progress, I suppose, but it's addressing the wrong problem.
同样的原因,因为有些测试不能增加覆盖率而将它们从回归测试套件中去掉也是危险的。关键不是覆盖代码;而是测试那些当代码更改时容易被发现的 bug 。除非测试用例是不熟练的设计,否则去掉测试用例就是去除作用力。如果它们是不熟练的设计,可以使用覆盖率将一个大而粗劣测试用例套件转化成一个小而粗劣的测试用例套件。我想这是进步,但是与这个问题无关。
A grave danger of code coverage is that it is concrete, objective, and easy to measure. Many managers today are using coverage as a performance goal for testers. Unfortunately, a cardinal rule of management applies here: "Tell me how a person is evaluated, and I'll tell you how he behaves." If a person is evaluated by how much coverage is achieved in a given time (or in how little time it takes to reach a particular coverage goal), that person will tend to write tests to achieve high coverage in the fastest way possible. Unfortunately, that means shortchanging careful test design that targets bugs, and it certainly means avoiding in-depth, repetitive testing of "already covered" code.
代码覆盖率的一个重大危险是它是具体、主观而易于衡量的。今天的许多经理都使用覆盖率作为测试员的绩效目标。不幸的是,一个重要的管理规则适用于这里:“告诉我如何评价一个人,然后我才能告诉你他的表现。”如果一个人是通过在给定的时间内覆盖了多少代码(或者是在多么少的时间内达到了特定覆盖目标)来评估的,那么那个人将倾向于以尽可能快的方式达到高覆盖率的测试。不幸的是,这将意味对以发现 bug 为目的的仔细测试设计的偷工减料,这当然也意味着避开了深层次、重复地测试“已经覆盖”的代码。
Using coverage as a test design technique works only when the testers are both designing poor tests and testing redundantly. They'd be better off at least targeting their poor tests at new areas of code. In more normal situations, coverage as a guide to design only decreases the value of the tests or puts testers under unproductive pressure to meet unhelpful goals.
仅当测试员设计了的测试质量不高并且冗余地进行测试时,将测试度作为测试设计技巧才能起作用。至少可以让他们将这些把这些质量不高的测试转移到新的领域中。在正式的场合,覆盖率作为一个设计的指导只会减少测试的价值,或将测试员置于低效率的压力下,以达到没有用处的目标。
Coverage does play a role in testing, not as a guide to test design, but as a rough evaluation of it. After you've run your tests, ask what their coverage is. If certain areas of the code have no or low coverage, you're sure to have tested them shallowly. If that wasn't intentional, you should improve the tests by rethinking their design. Coverage has told you where your tests are weak, but it's up to you to understand how.
覆盖率在测试中确实能起作用,但不是作为测试设计的指导,而是作为一个大致的评估。在运行完测试后,看一下它们的覆盖率是多少。如果某个领域的代码没有覆盖到或覆盖率很低,可以确定你对它们的测试很肤浅。如果不是故意那样做的,你应该考虑重新设计它们以改进测试。覆盖率告诉你测试的哪个部分是薄弱的,但如何理解则取决于你。
You might not entirely ignore coverage. You might glance at the uncovered lines of code (possibly assisted by the programmer) to discover the kinds of tests you omitted. For example, you might scan the code to determine that you undertested a dialog box's error handling. Having done that, you step back and think of all the user errors the dialog box should handle, not how to provoke the error checks on line 343, 354, and 399. By rethinking design, you'll not only execute those lines, you might also discover that several other error checks are entirely missing. (Coverage can't tell you how well you would have exercised needed code that was left out of the program.)
你也不能完全忽略覆盖率。你可以浏览未覆盖的代码行(可能是在程序员的辅助下)以发现你忽略的某种测试。例如,你可能浏览代码以确定你是否对某个对话框的错误处理测试不足。在完成这些之后,你翻回头应该考虑对话框应该处理的所有用户错误,而不是检查第343、354和399行的错误。通过重新思考设计,你不仅能执行那些行,而且可能会发现几个其他完全被忽略了错误。(覆盖率不能告诉你程序之外的、所需要代码的执行情况)。
There are types of coverage that point more directly to design mistakes than statement coverage does (branch coverage, for example). However, none - and not all of them put together - are so accurate that they can be used as test design techniques.
还有几类覆盖率,比语句覆盖率更直接地指向设计错误(例如分支覆盖率)。但是,其他种类——即使把他们都放在一起——也不能够精确到用于测试用例设计技巧。
One final note: Romances with coverage don't seem to end with the former devotee wanting to be "just good friends". When, at the end of a year's use of coverage, it has not solved the testing problem, I find testing groups abandoning coverage entirely. That's a shame. When I test, I spend somewhat less than 5% of my time looking at coverage results, rethinking my test design, and writing some new tests to correct my mistakes. It's time well spent.
最后再说明一下:对覆盖率的兴趣似乎不能以从前的爱好者希望成为“好朋友”而结束。在使用了一年的覆盖率之后,它没有解决测试问题,我发现测试小组完全放弃了覆盖率。这是一件丢人的事情。当我测试的时候,我花大约5%的时间查看覆盖率结果,重新考虑我的测试用例设计,并编写一些新的测试用例校正我的错误。这个时间是值得花的。
文章来源于领测软件测试网 https://www.ltesting.net/